From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

Because my wife’s native language is Cebuano, I am always on the lookout for resources in that language. Although widely spoken in the Southern Philippines, with about 30 million native speakers, the language lacks any official status and is mainly used in informal settings. Primary schools switch to a mix of English and Tagalog (re-branded as “Filipino” to make it a national language) after the first two years, and most official business takes place in English. As a result, there are very few publications in Cebuano.device-2015-02-23-204551

Back in 2006, I was able to obtain a set of scans of John U. Wolff’s Dictionary of Cebuano Visayan from somebody in the Philippines, and not much later I found a second set of scans available online from Cornell University. Immediately I noticed that this is a great resource for people who would like to study the language: it gives detailed grammatical information, and includes numerous sample sentences. Of course, it does have its issues: its use of non-standard orthography makes it less acceptable to most speakers of language, and the way the information on verb-usage is encoded is hard to understand even for a determined student. But still it would be very nice to have this book in a digital format.

Since the dictionary dates from 1972, at first I had little hope it could be re-published in Project Gutenberg; however, I got in touch with the author, now Emeritus Professor at Cornell University, and after consultation with the publisher he gave me permission to process the book for Project Gutenberg. Later on, I also noticed a very liberal “Public Domain” notice on this copy, stating that the book would enter the public domain in 1982.

Quickly, the process of preparing the scans for Distributed Proofreaders started: splitting all scans into columns, preparing instructions for the sometimes complex entries, and preparing several projects (one for each letter), such that proofers wouldn’t be shocked by a 2500+ page count, and more importantly, that work on it could be done in all rounds at once, and post-processing could get an early start with the first few letters.

When the first parts started to return from the site, the massive work of post-processing such a huge work started. Fiddling with regular expressions and custom-made conversion scripts in a combination of Perl and XSLT, I managed to massage the original typographic tagging to a far more useful structural tagging, such that all the various elements encoded in the dictionary were marked as such, with grammar labels, entries and sub-entries, sample sentences, and their translations having their own tag. This would also enable a spell-check of the entire document, in which the dictionary itself proved highly useful, because one of the first things I did with the data was to convert it into a SQL database, and build a web interface around it, which enabled me to quickly look up words in their context, and then use this interface to locate remaining issues in the data.

When all this was done, I was able to produce a huge (almost 10 megabyte) HTML and text file for submission to Project Gutenberg, and a nice PDF version which could be used to reprint the book; and, even better, I could publish the web interface on the website I set up to promote Bohol. All files required to process the dictionary are available online as well.

Since the introduction of tablet computers, I wanted to also create some software for them, and I got that opportunity in 2013, when I got three months of paid leave as part of my severance payment when my employer decided to close the Dutch office in which I was working. In that period, I dived into the architecture of Android apps, and basically re-coded the functionality of the website for a smartphone, in such a way that all the data was on-board and could be accessed without the need to be connected. Although the app was basically finished by October 2013, it took me quite some time to publish it, as I was occupied with other things, as a 7.2 earthquake in Bohol destroyed my in-laws’ house (as well as many other buildings, including some of its most beautiful historical churches). Also, I wanted to add some more features and polish the icons being used, and was investigating a way to earn something from the app. Seeing that this was not going happen soon, I decided to release the Cebuano-English Dictionary App for free, and also publish the complete source code, hoping it will prove a great resource for all with an interest in the Cebuano language, and hoping the source code will be helpful in building similar dictionaries for other languages. (Unfortunately, I won’t be making a version for the iPhone, as Apple requires DRM on all apps distributed through their iTunes store, and in general their conditions are incompatible with the GPL-3 I am using for my code).

Of course, all this wouldn’t have been possible without the diligent proofreading of many volunteers at Distributed Proofreaders — for that, daghang salamat (many thanks)!

6 Responses to From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

  1. Sarah Jensen says:

    Thank you for sharing your story. This is why I love Distributed Proofreaders; there are so many great projects being worked on every day! You’ve gone above and beyond to make this project especially valuable. Well done!

  2. Jeroen, with all that work, I would like to download the dictionary just to see the finished product.
    As an aside, I would ask if Cebuano is in anyway referenced to Cebu City. We used to fly into the Cebu City area, when I was in the Navy during the Viet Nam Era, and I still remember the bus rides from the airfield into town to stay at a hotel there.
    quentin

    • Hi Quentin, Yes, Cebuano is the language spoken in the city of Cebu, which is on the island of Cebu, as well as surrounding islands, and in large parts of Mindanao. I guess if you would return now, you wouldn’t recognize the area (and it would be great if you still have some old photographs from that era to compare..) The dictionary production itself also dates from that era, in which the US still poured a lot of money into SE Asian Studies; the preparation of this dictionary (the data, not the app) was funded by US tax-payers, which is one of the reasons it is freely available now.

  3. susanskinner says:

    Amazing work, Jeroen! I love working on your Phillippines-related projects and this looks like it will be an amazing resource.

  4. Bel says:

    Great work! I’m so glad there are so many people at DP and elsewhere who are dedicated to the great project of making all these books accessible to the world at large.

  5. trigger says:

    I’m so proud of this project. This was the project that got me hooked to DP and I never looked back. Thanks, Jheroen, for bringing this and other wonderful projects to DP and the world.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: