From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

March 7, 2015

Because my wife’s native language is Cebuano, I am always on the lookout for resources in that language. Although widely spoken in the Southern Philippines, with about 30 million native speakers, the language lacks any official status and is mainly used in informal settings. Primary schools switch to a mix of English and Tagalog (re-branded as “Filipino” to make it a national language) after the first two years, and most official business takes place in English. As a result, there are very few publications in Cebuano.device-2015-02-23-204551

Back in 2006, I was able to obtain a set of scans of John U. Wolff’s Dictionary of Cebuano Visayan from somebody in the Philippines, and not much later I found a second set of scans available online from Cornell University. Immediately I noticed that this is a great resource for people who would like to study the language: it gives detailed grammatical information, and includes numerous sample sentences. Of course, it does have its issues: its use of non-standard orthography makes it less acceptable to most speakers of language, and the way the information on verb-usage is encoded is hard to understand even for a determined student. But still it would be very nice to have this book in a digital format.

Since the dictionary dates from 1972, at first I had little hope it could be re-published in Project Gutenberg; however, I got in touch with the author, now Emeritus Professor at Cornell University, and after consultation with the publisher he gave me permission to process the book for Project Gutenberg. Later on, I also noticed a very liberal “Public Domain” notice on this copy, stating that the book would enter the public domain in 1982.

Quickly, the process of preparing the scans for Distributed Proofreaders started: splitting all scans into columns, preparing instructions for the sometimes complex entries, and preparing several projects (one for each letter), such that proofers wouldn’t be shocked by a 2500+ page count, and more importantly, that work on it could be done in all rounds at once, and post-processing could get an early start with the first few letters.

When the first parts started to return from the site, the massive work of post-processing such a huge work started. Fiddling with regular expressions and custom-made conversion scripts in a combination of Perl and XSLT, I managed to massage the original typographic tagging to a far more useful structural tagging, such that all the various elements encoded in the dictionary were marked as such, with grammar labels, entries and sub-entries, sample sentences, and their translations having their own tag. This would also enable a spell-check of the entire document, in which the dictionary itself proved highly useful, because one of the first things I did with the data was to convert it into a SQL database, and build a web interface around it, which enabled me to quickly look up words in their context, and then use this interface to locate remaining issues in the data.

When all this was done, I was able to produce a huge (almost 10 megabyte) HTML and text file for submission to Project Gutenberg, and a nice PDF version which could be used to reprint the book; and, even better, I could publish the web interface on the website I set up to promote Bohol. All files required to process the dictionary are available online as well.

Since the introduction of tablet computers, I wanted to also create some software for them, and I got that opportunity in 2013, when I got three months of paid leave as part of my severance payment when my employer decided to close the Dutch office in which I was working. In that period, I dived into the architecture of Android apps, and basically re-coded the functionality of the website for a smartphone, in such a way that all the data was on-board and could be accessed without the need to be connected. Although the app was basically finished by October 2013, it took me quite some time to publish it, as I was occupied with other things, as a 7.2 earthquake in Bohol destroyed my in-laws’ house (as well as many other buildings, including some of its most beautiful historical churches). Also, I wanted to add some more features and polish the icons being used, and was investigating a way to earn something from the app. Seeing that this was not going happen soon, I decided to release the Cebuano-English Dictionary App for free, and also publish the complete source code, hoping it will prove a great resource for all with an interest in the Cebuano language, and hoping the source code will be helpful in building similar dictionaries for other languages. (Unfortunately, I won’t be making a version for the iPhone, as Apple requires DRM on all apps distributed through their iTunes store, and in general their conditions are incompatible with the GPL-3 I am using for my code).

Of course, all this wouldn’t have been possible without the diligent proofreading of many volunteers at Distributed Proofreaders — for that, daghang salamat (many thanks)!


%d bloggers like this: