From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

March 7, 2015

Because my wife’s native language is Cebuano, I am always on the lookout for resources in that language. Although widely spoken in the Southern Philippines, with about 30 million native speakers, the language lacks any official status and is mainly used in informal settings. Primary schools switch to a mix of English and Tagalog (re-branded as “Filipino” to make it a national language) after the first two years, and most official business takes place in English. As a result, there are very few publications in Cebuano.device-2015-02-23-204551

Back in 2006, I was able to obtain a set of scans of John U. Wolff’s Dictionary of Cebuano Visayan from somebody in the Philippines, and not much later I found a second set of scans available online from Cornell University. Immediately I noticed that this is a great resource for people who would like to study the language: it gives detailed grammatical information, and includes numerous sample sentences. Of course, it does have its issues: its use of non-standard orthography makes it less acceptable to most speakers of language, and the way the information on verb-usage is encoded is hard to understand even for a determined student. But still it would be very nice to have this book in a digital format.

Since the dictionary dates from 1972, at first I had little hope it could be re-published in Project Gutenberg; however, I got in touch with the author, now Emeritus Professor at Cornell University, and after consultation with the publisher he gave me permission to process the book for Project Gutenberg. Later on, I also noticed a very liberal “Public Domain” notice on this copy, stating that the book would enter the public domain in 1982.

Quickly, the process of preparing the scans for Distributed Proofreaders started: splitting all scans into columns, preparing instructions for the sometimes complex entries, and preparing several projects (one for each letter), such that proofers wouldn’t be shocked by a 2500+ page count, and more importantly, that work on it could be done in all rounds at once, and post-processing could get an early start with the first few letters.

When the first parts started to return from the site, the massive work of post-processing such a huge work started. Fiddling with regular expressions and custom-made conversion scripts in a combination of Perl and XSLT, I managed to massage the original typographic tagging to a far more useful structural tagging, such that all the various elements encoded in the dictionary were marked as such, with grammar labels, entries and sub-entries, sample sentences, and their translations having their own tag. This would also enable a spell-check of the entire document, in which the dictionary itself proved highly useful, because one of the first things I did with the data was to convert it into a SQL database, and build a web interface around it, which enabled me to quickly look up words in their context, and then use this interface to locate remaining issues in the data.

When all this was done, I was able to produce a huge (almost 10 megabyte) HTML and text file for submission to Project Gutenberg, and a nice PDF version which could be used to reprint the book; and, even better, I could publish the web interface on the website I set up to promote Bohol. All files required to process the dictionary are available online as well.

Since the introduction of tablet computers, I wanted to also create some software for them, and I got that opportunity in 2013, when I got three months of paid leave as part of my severance payment when my employer decided to close the Dutch office in which I was working. In that period, I dived into the architecture of Android apps, and basically re-coded the functionality of the website for a smartphone, in such a way that all the data was on-board and could be accessed without the need to be connected. Although the app was basically finished by October 2013, it took me quite some time to publish it, as I was occupied with other things, as a 7.2 earthquake in Bohol destroyed my in-laws’ house (as well as many other buildings, including some of its most beautiful historical churches). Also, I wanted to add some more features and polish the icons being used, and was investigating a way to earn something from the app. Seeing that this was not going happen soon, I decided to release the Cebuano-English Dictionary App for free, and also publish the complete source code, hoping it will prove a great resource for all with an interest in the Cebuano language, and hoping the source code will be helpful in building similar dictionaries for other languages. (Unfortunately, I won’t be making a version for the iPhone, as Apple requires DRM on all apps distributed through their iTunes store, and in general their conditions are incompatible with the GPL-3 I am using for my code).

Of course, all this wouldn’t have been possible without the diligent proofreading of many volunteers at Distributed Proofreaders — for that, daghang salamat (many thanks)!

Uncle Sam’s Place and Prospects (1899)

February 15, 2011

In Barbara Tuchman’s The Proud Tower, there is a chapter (“End of a Dream”) about America’s decision to become a colonial power. At the time, we had just won the Spanish-American war and there was a major political conflict between those that believed that America was destined to become an imperial power (the pro-imperialism forces) and those that believed that becoming an imperial power would destroy America’s principals of self-government and isolation (the anti-imperialism forces). The pro-imperialism forces won.

The political paper “Outlook: Uncle Sam’s Place and Prospects in International Politics” gives a feel for the nature of that political conflict. That paper was read by Newton MacMillan before The Fortnightly Club (Oswego, N. Y.) on May 2, 1899.

In that paper, Mr. MacMillan argued for colonization of the Philippines, in part, because it was needed to protect our interests in capturing the Chinese market. He cited statistics about the value of our exports to China and talked of conspiracies of other foreign powers to shut us out of that market.

But how long is this to continue? With our experience of tariffs we need not be reminded that low prices do not command markets. Continental Europe does not like us. We saw that during the Spanish war, and we have heard it since in various impatient declarations of hostility, at Berlin or Vienna, far more significant than official assurances of distinguished consideration. Indeed, if Germany, or France, or Russia does not openly break with us, it is because fear or prudence is stronger than inclination. The moment any one or all of them combined feels able to slam the door in our face without fear of reprisals, the door will be slammed.

He argued for setting up a base for operations against any attempts to shut us out of China.

So, in great measure, the Philippines mean for us a foothold in the East and a strong leverage on China. Would our co-operation be sought at this time, as it has been, not only by England but by Germany, if George Dewey had not sailed his ships into the harbor of Manila on the night of the 30th of April, 1898, dodging the sunken mines and torpedoes, that he might on the morrow fire “the shot heard round the world?” On that day and since then the world learned that we are a nation not only of shopkeepers and money-grabbers, but also of fighters; that in a prolonged war we stand unconquerable, irresistable. A year and a day ago we were a nation; to-day we are a power, and have only to assert ourselves as such.

Then, after arguing that we must subject the Philippines to our rule for economic reasons, he argued that colonization was needed to “make us less corrupt”:

But if, on the other hand, we set up good government in the colonies, how long shall we be content with misrule at home? Not long, I promise you. “It is one of the most beautiful compensations of this life,” says the wise man, “that no man can sincerely try to help another without helping himself.” No less true is this of nations. The eyes of the world are upon us and the conscience of civilization will hold us strictly accountable. As we deal with those ignorant wards whom the God of Battles has given into our keeping, even so shall we be dealt with. And in uplifting them from barbarism so shall we be uplifted.

He even stated that:

I believe the present low tone of our internal politics to be due to the long and peaceful isolation of the Republic.

In other words, peaceful isolation is bad, but colonizing foreign lands will save our soul. Interesting arguments.

