The life of a book at Distributed Proofreaders

January 1, 2016

This post walks through the life of a book at DP from its beginnings as a physical book to its final form as a beautiful ePub, using Uncle Wiggily’s Auto Sled by Howard Roger Garis, recently posted to Project Gutenberg (eBook number 50405), as a study.

Aside: I didn’t help with this particular book in any way, but rather selected it based on its length, language, beautiful illustrations, and wonderful example of a final ePub.

wiggilycover

Selecting a book

The process begins when a volunteer (usually referred to as a Content Provider) finds a book they want as an eBook. They first have to get a clearance from Project Gutenberg Literary Archive Foundation (PGLAF) that the book is in the public domain, and legal to be reproduced. pgdp.net and Project Gutenberg are both in the United States and thus must adhere to US Copyright law. DP and PG sites hosted in other countries are able to work on and host books that are in the public domain in their respective countries, but aren’t in the public domain in the US.

Figuring out if a book is in the public domain can be oddly complicated — which is why we leave it to the professionals at PGLAF — but a general rule of thumb is that if it was published in the US before 1923, it’s probably in the public domain in the US.

Uncle Wiggily’s is copyright 1922, so just barely under the wire.

Getting the initial text

After receiving clearance, the volunteer either scans the book in or finds the page images from Google BooksThe Internet Archive (usually through their OpenLibrary site), or a slew of other image providers. The images will likely need some level of cleaning to deskew or despeckle them after being scanned in. The images are then run through OCR software to get an initial, raw copy of the text.

Page images of Uncle Wiggily’s were obtained from Google Books.

Note that Google Books and The Internet Archive stop here — eBooks you download from them contain only the text obtained from OCR. PDFs contain the page images with the underlying OCR available for selection and searching. The Internet Archive provides an ePub format, but it’s of the raw OCR text — not a pleasant reading experience.

At DP, this is just the first step in the process of refining and creating an eBook.

Loading the book into DP

Once the page images and text are available, a Project Manager will take up the mantle and guide the book (referred to as a project) through DP. Note that the Project Manager may have acted as Content Provider as well, may have been asked by the Content Provider to manage the book, or may have found the project on one of DP’s internal lists of available scans ready for adoption.

Either way, the Project Manager will create a new project at DP for the book (e.g., Uncle Wiggily’s project page). They’ll fill in a slew of metadata about the project so that proofreaders will be able to find it. This includes information like the name, author, the language the book is written in, and its genre. They will then add the page images and text.

Unleash the proofreaders!

Up until now the process hasn’t been very distributed and may, in fact, have all be done by a single individual. But now that the book has been loaded and is ready for proofreading, many people can work on it at once.

The book starts out in P1, the first proofreading round. Proofreading volunteers can select any book available in this round and start proofreading pages. How they select which project to work on is completely up to them. They might browse the list of all available projects in the round or search for those matching a specific genre and/or language.

Once they find a project and click on ‘Start Proofreading,’ they are presented with an interface that shows the page image and the text. Their job is straightforward: make the text match the image and follow some basic proofreading guidelines. After they make whatever changes they think are necessary to the text, they save the page and can either get a new page from the project or stop proofreading. Other volunteers may be working on the book at the same time, each on a separate page.

After all pages have been proofread, the project is moved into two other proofreading rounds in series: P2 and P3. While any volunteer can proofread books in P1, the subsequent rounds have entrance criteria to ensure each level has ever-increasing proofreading experience and critical eyes.

The time it takes to go through the proofreading rounds can vary from minutes to years depending on the size of the book, the complexity of the pages, the quality of the initial OCR, and most importantly, how many volunteers are interested in working on it!

Uncle Wiggily’s meagre 33 pages soared through all three proofreading rounds in 4.5 hours.

Formatting: a bold move

Proofreading focuses on the page text, not how it’s formatted — that’s for the F1 and F2 formatting rounds. It’s in these rounds that all formatting happens, including things like bold, italics, and underlining, as well as marking poetry and other non-paragraph text for when the book is combined back together. These rounds are also fully distributed and, not surprisingly, there’s a set of formatting guidelines as well.

Uncle Wiggily’s completed both formatting rounds in roughly 12 hours.

Stitching them all back up again

Now that the pages have been proofread and formatted, they wait for a Post-Processor to pick them up and stick them together into their final form. The Project Manager may perform this step, or it may be someone else. The Post-Processor will do a wide range of sanity checks on the text to ensure consistency, merge hyphenated words that break across pages, and many other bits. They’ll create at least a plain-text version of the book for uploading to Project Gutenberg. Nowadays HTML versions are also very common and are further used to make ePubs for eBook readers.

Books like Uncle Wiggily with illustrations require even more care. Unlike page texts that are often scanned in at a relatively low resolution in black and white, illustrations are often in color and always at a higher resolution. Post-Processors will take great care in cropping, color balancing, and doing other image processing on the illustrations before including them in the HTML and ePub versions.

Smoooooooth reading

Often, but not always, Post-Processors will submit the books to what is called the smooth reading round. This is an opportunity for people to read the book as a book, but with a careful eye to anything that looks amiss. Humans are great at noticing when things are not quite right, and what a better way to do it than reading the book! If the reader spies something amiss they can let the Post-Processor know and have it corrected.

Posted to Project Gutenberg

Now that the eBook is completed, it’s posted to Project Gutenberg! Each eBook gets a unique number from Project Gutenberg which is recorded in the DP project record.

Uncle Wiggily’s Auto Sled was given number 50405 and was posted in several different formats:

Every book posted from DP includes a credit line in the text that recognizes the Project Manager and Post-Processor individually and the team at DP as a collective. If the images were sourced from another provider, they are also recognized in the credit line.

Uncle Wiggily’s credit line looks like this:

E-text prepared by David Edwards, Emmy, and the Online Distributed Proofreading Team (http://www.pgdp.net) from page images generously made available by the Google Books Library Project (http://books.google.com)

Preserving history, one page at a time

As you can see, there are many different ways to help create an eBook as a DP volunteer. The best thing about DP is that you can do only the parts you enjoy and only as much of those parts as you enjoy.

Interested in helping a book on its journey? It’s easy to get started as a proofreader — just:

  1. Create an account at DP
  2. After you register, find a project and start proofreading!

Or you can smooth read a book without even creating an account.


Proofing with Maps

August 8, 2015

While proofing for Distributed Proofreaders, I often find myself opening up a mapping application to locate rivers, towns, buildings, forts, streets, etc. that are mentioned, described, or central to a project.  Sometimes it’s to figure out where they are. Sometimes it’s to try and see what’s being described.

map

For example, Early Western Travels, 1748-1846, Volume XXIII, describes some rock formations that the footnote identified as being in Dawson and Valley Counties, Montana. Using that information, I was able to view a photo of the rock formations. I’ve also found remote tiny towns that still exist in the American West — one even had a preserved historical district.

Florizel’s Folly (in progress at DP) led me to Brighton, EnglandYellowstone’s Living Geology: Earthquakes and Mountains (also in progress) to Old Faithful.

I posted in the DP forums about this and found another proofreader who was using mapping software to locate parks that were mentioned in old bird books as locations of certain birds. This person was interested in whether the parks have the same birds.

Of course, I look at maps because I love maps. So starting with a specific reference point from a book, I can get lost for half an hour or more exploring, envisioning, and virtually visiting. Anywhere. And how exciting when I get a chance to visit in person a site I’ve visited before via mapping software; for example, the Pony Express Statue in Sacramento Old Town.

If you haven’t tried this before, do! You may find yourself addicted.

This post was contributed by WebRover, a DP volunteer.


Sunday School Stories

April 4, 2015

Maybee’s Stepping Stones by Archie Fell is a book of Sunday school stories for each week of the year. As I read it, I experienced a wide range of emotions — love, kindness, patience, life, death, naughtiness, guilt, fear, consequences, tolerance, forgiveness, family, community, happiness, sorrow, adversity, hope, loneliness, sadness, joy….

frontispiece

I gasped with alarm when Dick shot himself; when Tryphosa was overcome with the fire. I wanted to cry when Dick lay in the woods unheard, when Phosy and Aunty McFane became ill, and I rejoiced when Mrs. Harte and Bill Finnegan went to the Sabbath School, and when Dan Harte resolved to overcome his addiction to alcohol. I shared the children’s frustrations as they struggled with doing the right thing, and smiled unashamedly when their good deeds worked near miracles.

The stories may be old-fashioned, and based on Christianity, but the lessons are for us all, whether we believe in a god or not, whether our deeds are in person or via social media, whether we are young or old. We can all put out a hand in comfort and together we can grow in strength no matter what our trials and tribulations.

She had just been reading a chapter in the Bible out loud, and Aunty McFane said there was a promise for every ache she had. Isn’t it funny,” he  continued, turning to Miss Marvin, “that folks just as different as can be find exactly what they want in the Bible? — Maybee’s Stepping Stones, page 224.

Reading these stories, I couldn’t help but reminisce about when I was a little girl going to Sunday school.

Denomination meant nothing to us so the church we attended was the one within walking distance — I think it was Presbyterian. Our parents didn’t seem particularly religious, but they did make us go to Sunday school. Our father had in mind that if we weren’t christened it would be easier for us if we wanted to marry someone of strong faith in a particular church.

I never did work out my father’s beliefs. I suspect my mother was quite devout, although I did not know her to go to church, and she didn’t speak about religion much. She did go to a Catholic primary school — she had me shocked and in fits of laughter when she told me of the time she had to stand in front of an open fire with a piece of soap in her mouth because she had sworn at the nuns.

…  then she tried scrubbing the inside of his mouth with soap-suds — Maybee’s Stepping Stones, page 19.

My sister only recently told me the story of her second son who, at age six, when admonished for swearing, was threatened with a similar fate of having his mouth washed out with soap. The little boy went to the bathroom, grabbed some soap, foamed it up in his mouth, and went out to his mother and said, “Now I can swear.” I think there’s quite a bit of my mother’s determined spirit in both my sister and my nephew. The same son said to my sister the other week: “Do what you want, mother, you will anyway.”

My mother also told the story of a family member who was a Major in the Salvation Army. I heard her say many times that only the good die young. And I learnt that she had a very difficult time accepting the death of a daughter before I was born.

Upon the pine coffin, the girls in Miss Cox’s class laid a wreath of beautiful hot-house flowers; but all over the lid, and inside, around the pale face and over the white robe, were fresh, fragrant pond-lilies, their subtile perfume filling the room. — Maybee’s Stepping Stones, page 149.

We had Sunday School stories, much like those told in Maybee’s Stepping Stones. We collected a stamp for each story lesson we attended. When our stamp sheet was full, we were presented with a little book.

We had our “Sunday best” clothes, and how we did love dressing up, putting on our delicate little dresses with ribbons and bows, and polishing our little shoes. Going to Sunday school was exciting and something to look forward to. It added a purpose to our lives, spiritual and social.

But she made her appearance, bright and early, Sabbath morning, comparatively quite docile, submitted to be washed, shampooed, braided, and ruffled, with a most martyr-like air, and came out from the process not so very unlike the five other girls, among whom Say seated her, with such a happy look in her own blue eyes. Just to see her sitting there more than repaid the trouble. — Maybee’s Stepping Stones, page 106.

Our Sunday school was at the back of the church in a prefabricated corrugated steel “Nissen hut” like those used for temporary accommodation during the war years. The building is still there but it is no longer a church, and the hut has been replaced by a brick addition attached to the main building.

I mentioned above it was within walking distance. Back then, there was a church nearby almost everywhere. I thought about this in recent years when a neighbour who had become almost housebound because of poor vision and declining mobility told me that one of the things she missed most was being able to walk to church. Her old church building was still there, too, at the end of the street where she had lived most of her adult life, along with the convent buildings that had been converted first to a school, and then to an art gallery, and now left to crumble. The nearest church for her was now on the other side of town. Buses don’t run on Sundays in this small community so, with few friends or family interested in taking her to church, she had only television services to comfort her.

So much inward soul searching from a little children’s book — literary merit?… Well, the stories stand up to the test of time, is all I can conclude.

This post was contributed by a DP volunteer.


From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

March 7, 2015

Because my wife’s native language is Cebuano, I am always on the lookout for resources in that language. Although widely spoken in the Southern Philippines, with about 30 million native speakers, the language lacks any official status and is mainly used in informal settings. Primary schools switch to a mix of English and Tagalog (re-branded as “Filipino” to make it a national language) after the first two years, and most official business takes place in English. As a result, there are very few publications in Cebuano.device-2015-02-23-204551

Back in 2006, I was able to obtain a set of scans of John U. Wolff’s Dictionary of Cebuano Visayan from somebody in the Philippines, and not much later I found a second set of scans available online from Cornell University. Immediately I noticed that this is a great resource for people who would like to study the language: it gives detailed grammatical information, and includes numerous sample sentences. Of course, it does have its issues: its use of non-standard orthography makes it less acceptable to most speakers of language, and the way the information on verb-usage is encoded is hard to understand even for a determined student. But still it would be very nice to have this book in a digital format.

Since the dictionary dates from 1972, at first I had little hope it could be re-published in Project Gutenberg; however, I got in touch with the author, now Emeritus Professor at Cornell University, and after consultation with the publisher he gave me permission to process the book for Project Gutenberg. Later on, I also noticed a very liberal “Public Domain” notice on this copy, stating that the book would enter the public domain in 1982.

Quickly, the process of preparing the scans for Distributed Proofreaders started: splitting all scans into columns, preparing instructions for the sometimes complex entries, and preparing several projects (one for each letter), such that proofers wouldn’t be shocked by a 2500+ page count, and more importantly, that work on it could be done in all rounds at once, and post-processing could get an early start with the first few letters.

When the first parts started to return from the site, the massive work of post-processing such a huge work started. Fiddling with regular expressions and custom-made conversion scripts in a combination of Perl and XSLT, I managed to massage the original typographic tagging to a far more useful structural tagging, such that all the various elements encoded in the dictionary were marked as such, with grammar labels, entries and sub-entries, sample sentences, and their translations having their own tag. This would also enable a spell-check of the entire document, in which the dictionary itself proved highly useful, because one of the first things I did with the data was to convert it into a SQL database, and build a web interface around it, which enabled me to quickly look up words in their context, and then use this interface to locate remaining issues in the data.

When all this was done, I was able to produce a huge (almost 10 megabyte) HTML and text file for submission to Project Gutenberg, and a nice PDF version which could be used to reprint the book; and, even better, I could publish the web interface on the website I set up to promote Bohol. All files required to process the dictionary are available online as well.

Since the introduction of tablet computers, I wanted to also create some software for them, and I got that opportunity in 2013, when I got three months of paid leave as part of my severance payment when my employer decided to close the Dutch office in which I was working. In that period, I dived into the architecture of Android apps, and basically re-coded the functionality of the website for a smartphone, in such a way that all the data was on-board and could be accessed without the need to be connected. Although the app was basically finished by October 2013, it took me quite some time to publish it, as I was occupied with other things, as a 7.2 earthquake in Bohol destroyed my in-laws’ house (as well as many other buildings, including some of its most beautiful historical churches). Also, I wanted to add some more features and polish the icons being used, and was investigating a way to earn something from the app. Seeing that this was not going happen soon, I decided to release the Cebuano-English Dictionary App for free, and also publish the complete source code, hoping it will prove a great resource for all with an interest in the Cebuano language, and hoping the source code will be helpful in building similar dictionaries for other languages. (Unfortunately, I won’t be making a version for the iPhone, as Apple requires DRM on all apps distributed through their iTunes store, and in general their conditions are incompatible with the GPL-3 I am using for my code).

Of course, all this wouldn’t have been possible without the diligent proofreading of many volunteers at Distributed Proofreaders — for that, daghang salamat (many thanks)!


A Day at Waterloo

January 7, 2015

Early last year I downloaded A Week at Waterloo in 1815, by Lady Magdalene De Lancey, from Project Gutenberg. I was soon caught by the story, written by the widow of Colonel Sir William Howe De Lancey.

De Lancey

Col. Sir William Howe De Lancey

Sir William was mortally wounded in a skirmish, the day before the big battle at Waterloo, when he was riding at Wellington’s side. He was hit in his side by a cannonball that threw him off his horse. He was not killed immediately, but survived his wound for six days. When his men saw he wasn’t dead yet, they moved him to a barn, where he was left for several hours, till the fight was over, and he could be transported to a nearby farm.

When his wife, who was staying in Antwerp, heard that he was wounded, or maybe dead, she didn’t hesitate to look for transport that would bring her to her husband. When she finally got there, after much trouble, she was relieved to find him still alive.

The cannonball left a gigantic bruise on Sir William’s side. In those circumstances, nobody could really tell the severity of his wound. Lady De Lancey nursed her husband, never leaving his side. But he didn’t make it. After his death, examination revealed that the cannonball had broken several ribs, which had penetrated his lung.

I was very much touched by this story. Sir William seemed to have been a good man, and his comrades, his superiors, and his family speak very highly of him in this memoir. You can find a full review of it here.

Now to my own story. A few weeks after I read the book, my dog died. I was very sad, and so was my son. We felt lost in the house. We decided it would do us good to get out and make a day-trip. I proposed that we should go to Waterloo, as it is only an hour’s drive from our place, but I had never been there. So the next Saturday we went.

memorial

Waterloo Memorial at Evere

First we visited the cemetery in Evere, where the British casualties are entombed, and there is a beautiful memorial monument on top of the tomb. The illustration is on page 118 of the book. If you look at it, you can see on the left stairs going down. This is where you enter the tomb, and inside there are niches containing the remains of the officers. I soon found William De Lancey’s last resting place, and stood a few minutes in silence, honouring this brave man, and his fellow officers and soldiers. (The soldiers with lower rank are also buried there, outside the tomb, but within the outer walls that you can see around the monument.)

Afterwards we went to Waterloo, where we visited the Wellington Museum, located in the house where Wellington had his headquarters. On a wall in one of the rooms was a newspaper page, and in the bottom right corner I could read amongst the names of the casualties: William De Lancey, mortally wounded.

We also climbed the stairs to the top of the Waterloo Lion, from where we had a view over the entire battlefield. Later we also visited Napoleon’s last headquarters.

This day was a very interesting experience, being at the place where so many gave their lives. But it was William De Lancey and his wife who touched my heart.

Thank you, all the members of the Distributed Proofreaders team, who worked so hard to make this book available for the world!

This post was contributed by Eevee, a DP volunteer.


Typesetting

October 8, 2014

Typesetting is a topic close to the hearts of many DPers, and the foundation on which the books we work on were built.

I learnt typewriting on a manual typewriter when I was at school. A classmate secured a job as an editor with a magazine based on the skills she learnt in the course we were doing. I was so envious! Editor on a magazine, with no work experience, and no qualification. A few years earlier, when asked by a teacher what I wanted to be, I replied I wanted to be a journalist, not because I wanted to be a writer, but because I wanted to work on newspapers, with those monstrous printing presses and the glorious smell of ink, and fiddly bits of lead.

I did manage to become a journalist and editor, but the huge presses were ageing, and typesetting was becoming regarded as no more than wordprocessing on a computer. I remember being chastised for the miles of galley paper that spewed out of the printer one time when I forgot to close off the heading command properly and ended up with a whole article in 72-point Times, a somewhat expensive mistake as rolls of galley proof paper were not cheap.

setting type

Working on the book, Typesetting, by A. A. Stewart, for Project Gutenberg, I couldn’t help but reflect and wish I could have been an apprentice hand compositor and daydream about what the publishing industry must have been like when each character had to be manually placed in the composing stick; when the characters of each font were housed in separate type cases; when measurements for line lengths, page sizes, and margins, had to be mentally calculated quickly and accurately; when justification of lines was achieved by manually placing a mix of different space widths characters (and even resorting to “pieces of paper or thin card” if metal thin spaces were not at hand).

type case

Imagine being able to set type and be able to read the text upside down; to have the dexterity to take a piece of type from the case and place it in the case; to proofread the lines of type and correct mistakes before justifying the lines.

upside down

How arduous the correction process, where “Simple errors like the exchanging of one type for another of the same width, the turning of an inverted character, or the transposition of letters or words, are corrected by pressing the line at both ends to lift it up about one-third of its height and picking out the wrong types with the finger and thumb. The line is then dropped in place and the right types put in.”

Not to mention having to wash the type and placing each character back into the proper slots in the proper cases, so that the type pieces could be used over and over.

Sitting at my computer, selecting fonts, messing about with HTML and CSS coding, I still want to be an apprentice hand compositor. “Typesetting, a primer of information about working at the case, justifying, spacing, correcting, making-up, and other operations employed in setting type by hand”, is an excellent training manual that gives me an insight into what I would have been doing had I been able to achieve my dream.

This post was contributed by a DP volunteer.


Introducing . . . Harrison Ford! (1918 Edition)

September 24, 2014

cover

I was looking for an online copy of The Cruise of the Make-Believes by Tom Gallon to find out if a certain bit of punctuation was a colon or semi-colon (semi-colon, by the way), and, as one does when Googling, I found other links with the same title: one was to Turner Classic Movies and another to the Internet Movie Database. Yes! My book was turned into a movie in 1918. A silent movie!

I pointed the link out to a friend, and she noted who one of the leading men was: Harrison Ford. Now we both knew that in 1918, it wasn’t Han Solo or Indiana Jones or the President. Time for more searching.

This not-at-all make-believe Harrison Ford was born in 1884 in Kansas City, Missouri. He started in the theater on the east coast before moving to Hollywood in 1915. He was in over forty films. His final was his only talkie, Love in High Gear, which was released in 1932. He then took his career back to the stage and also began directing there. During World War II, he toured with the USO. He died in 1957, without having had children, and is no relation to that other Harrison Ford.

Harrison Ford

One of his hobbies was collecting old books . . . a man after DP’s own heart. I’ve no idea why this book was chosen to be made into a movie. It seems no better nor worse than any of our other romances. Old books, you just never know where they will take you. This one took me on a hunt to find the first dreamy Harrison Ford.

You can read more about this Harrison Ford here or here or here, all of which I used for my information with gratitude.

This post was contributed by a DP volunteer.


%d bloggers like this: