Post-Processing Fornander’s Hawaiian Antiquities

March 1, 2024

When searching for works to prepare as e-books at Distributed Proofreaders, I always try to find works that are still interesting today, add some diversity to Project Gutenberg’s collection, or are of significant cultural or historical importance.

Another criterion is that the works should be manageable by the volunteers here at Distributed Proofreaders, and in this, I like to explore the edges of what is possible. Each e-book on the site goes through multiple proofreading and formatting rounds, with volunteers carefully reviewing the images of each page with the computer-generated text generated from the images. Once all the pages have completed these steps, a post-processor carefully assembles them into an e-book.

Collections of folklore are always popular and interesting. They are timeless and offer an insight into the culture of a people. Over the years, I’ve added a couple of books with Hawaiian folklore from various authors, and, while digging deeper for more, I hit upon the mother-lode of many of these works: the Fornander Collection of Hawaiian Antiquities and Folk-Lore, a huge collection of material collected in the late 19th Century by Abraham Fornander, published between 1916 and 1920, in three large volumes, by the Bishop Museum Press in Honolulu.

Abraham Fornander was born in Sweden, on the island of Öland, on 4 November 1812, the son of a clergyman. He studied theology at the University of Uppsala, but dropped out and left Sweden to became a whaler. In 1838, he arrived on Hawaii. Here, he became a coffee planter, land-surveyor, and journalist. He also officially became a citizen of the (then still independent) Kingdom of Hawaii, and married Pinao Alanakapu, a Hawaiian chiefess. He started to promote public education and took up various official roles as inspector, governor, and judge. This allowed him to travel on the Hawaiian islands and collect a lot of information about Hawaiian mythology and the Hawaiian language. He used much of his collected materials to publish his Account of the Polynesian Race (a work I hope to tackle at some later date). After his death, he left a massive collection of notes and papers. These ended up in the Bernice P. Bishop Museum and ultimately were published, together with English translations, from 1916 to 1920. The first volume of Fornander’s collection is now available on Project Gutenberg (the following two volumes are still in progress at Distributed Proofreaders at the time of writing).

The volumes are bilingual, with the English translation on the left and Hawaiian original on the right. Since the Hawaiian language, as written at that time, used only standard letters and no diacritics, it is not that difficult for non-speakers to deal with. In fact, the Hawaiian alphabet is surprisingly short, with just 13 letters: five vowels: a e i o u (each with a long pronunciation and a short one, but here not distinguished); eight consonants: h k l m n p w; and the glottal stop (not shown in this text). Since all syllables in Hawaiian are a single consonant followed by a vowel or diphthong, to non-natives some words may appear long and repetitious, and in particular names can become pretty long — although there are also plenty of very short words to compensate.

Like many indigenous languages, Hawaiian is an endangered language. It was still widely spoken in the 19th Century, when the Hawaiian islands were an independent kingdom that maintained diplomatic relations with many countries. The Hawaiian Kingdom’s constitution was written in Hawaiian. Literacy was promoted and newspapers were regularly printed. However, through the machinations of American businessmen, the government of Queen Liliʻuokalani was overthrown in 1893, and after being run as a “Republic” for a short while, the territory was annexed by the United States in 1898. This led to the demise of the Hawaiian language. In 1896, English was made the sole official language, and the use of Hawaiian in schools was systematically suppressed. Only in the 1950’s did this trend slowly begin to reverse, with renewed interest in the language and indigenous culture, though Hawaii became a U.S. state in 1959. Hawaiian dictionaries were published, and a revival movement gained traction in the 1970’s, with schools once again teaching children the language. However, it is still spoken by only a small fraction of the current population of Hawaii.

Having Fornander’s collection easily accessible will be very valuable to learners of the language (even though the language used is probably archaic and the spelling differs a bit from modern Hawaiian) and to students of its folklore and history. The collection starts off, appropriately, with a mythological description of the discovery of the islands and the origins of the Hawaiian people. The first volume further includes, among many others, the popular story of Umi, a fifteenth-century chief or king, who usurped the throne from his older half-brother, then ruled for about 35 years and united the Hawaiian islands into a single kingdom.

Since today only about 24,000 speakers of Hawaiian remain, the hope of finding enough native speakers to help us out with this project was limited. We needed to ask non-Hawaiian-speaking volunteers to work on Hawaiian pages, even if they didn’t know a single word. Hawaiian is an Austronesian language, remotely related to languages such as Malay or Tagalog, so speakers of those might occasionally recognize a word, although it will often require some linguistic training to see the relationship (and that really is no help in proofreading those pages). Hawaiian is more closely related to Polynesian languages such as Tongan, Samoan, or Tahitian, and speakers of those languages can probably get some of the gist of the stories (but speakers of those languages are also not easily found).

So how to deal with such a massive and complex work?

Well, first, praise where praise is due: The many volunteers at Distributed Proofreaders dutifully ploughed through the Hawaiian pages and fixed a lot of errors left behind by the optical character recognition process (which turns scanned images into editable text). When I received the work to post-process, most of the hard work had already been done.

Still, post-processing a work like this is a considerable challenge. Post-processors have to create both text and HTML files for Project Gutenberg and make them both easily readable. First, I needed to untwine the English and Hawaiian text (which in the original book were on alternating pages), such that both the English and Hawaiian text became continuous texts, at least at the chapter level. To do this I simply made two copies of the text file, and then removed the English part of the text. Then I recombined them, so that the Hawaiian follows the corresponding English chapter.

Once the untwining was done, I started to add tags to demarcate chapter headings, poetry, tables, and footnotes, convert quotation marks to their proper curly shapes, etc., and deal with the issues the proofreaders noted. Then I came to the task of checking the entire text for remaining spelling issues, and that in a language I do not speak, without the help of a spelling-checker, and in an obsolete spelling.

Luckily, I’ve done this a few times before, and developed a few tools to help me make this easier. During my preparation, I tagged each fragment of text in my file with the language it is written in. This enables me to create word-lists, which I can inspect. Words that occur many times can be safely ignored, but those that are rare or unique may need some further inspection. Since I color-code by frequency, rare words jump out.

Using the word-list, I can identify suspect words, but that doesn’t always help. Then I can turn to a another tool and generate a KWIC (Keyword in Context) index. This allows me to see how each word is used, and, based on that, I can often decide how to deal with it.

The illustration below show how this works for the name Kekakapuomaluihi. At a glance, I can see this is used in Hawaiian and English. It is mentioned in the index (yellow background), pointing to the page it can actually be found, and its meaning is explained in a footnote (pink background).

Finally, I wanted to align the text in parallel columns, such that the English and Hawaiian could be read side-by-side, as in the original. This is less straightforward than it sounds, because sometimes a paragraph on the left is the equivalent of two on the right, and sometimes paragraph boundaries do not match. To make this work, I give all paragraphs in one language a label, and give the matching paragraphs in the other language the same label. This way, my software knows which paragraphs to place next to each other.

Having gone through all those steps, I was at last able to submit the work to Project Gutenberg. Now the first volume of Fornander’s monumental collection is freely available to all those interested in Hawaiian culture. At the time of writing, volume two is almost ready as well, and volume three is in the final formatting round at Distributed Proofreaders.

This post was contributed by Jeroen Hellingman, a Distributed Proofreaders volunteer who was the Project Manager and Post-Processor for the Fornander Collection.


From Paper to App: How Distributed Proofreaders Got a Cebuano Dictionary on Smartphones

March 7, 2015

Because my wife’s native language is Cebuano, I am always on the lookout for resources in that language. Although widely spoken in the Southern Philippines, with about 30 million native speakers, the language lacks any official status and is mainly used in informal settings. Primary schools switch to a mix of English and Tagalog (re-branded as “Filipino” to make it a national language) after the first two years, and most official business takes place in English. As a result, there are very few publications in Cebuano.device-2015-02-23-204551

Back in 2006, I was able to obtain a set of scans of John U. Wolff’s Dictionary of Cebuano Visayan from somebody in the Philippines, and not much later I found a second set of scans available online from Cornell University. Immediately I noticed that this is a great resource for people who would like to study the language: it gives detailed grammatical information, and includes numerous sample sentences. Of course, it does have its issues: its use of non-standard orthography makes it less acceptable to most speakers of language, and the way the information on verb-usage is encoded is hard to understand even for a determined student. But still it would be very nice to have this book in a digital format.

Since the dictionary dates from 1972, at first I had little hope it could be re-published in Project Gutenberg; however, I got in touch with the author, now Emeritus Professor at Cornell University, and after consultation with the publisher he gave me permission to process the book for Project Gutenberg. Later on, I also noticed a very liberal “Public Domain” notice on this copy, stating that the book would enter the public domain in 1982.

Quickly, the process of preparing the scans for Distributed Proofreaders started: splitting all scans into columns, preparing instructions for the sometimes complex entries, and preparing several projects (one for each letter), such that proofers wouldn’t be shocked by a 2500+ page count, and more importantly, that work on it could be done in all rounds at once, and post-processing could get an early start with the first few letters.

When the first parts started to return from the site, the massive work of post-processing such a huge work started. Fiddling with regular expressions and custom-made conversion scripts in a combination of Perl and XSLT, I managed to massage the original typographic tagging to a far more useful structural tagging, such that all the various elements encoded in the dictionary were marked as such, with grammar labels, entries and sub-entries, sample sentences, and their translations having their own tag. This would also enable a spell-check of the entire document, in which the dictionary itself proved highly useful, because one of the first things I did with the data was to convert it into a SQL database, and build a web interface around it, which enabled me to quickly look up words in their context, and then use this interface to locate remaining issues in the data.

When all this was done, I was able to produce a huge (almost 10 megabyte) HTML and text file for submission to Project Gutenberg, and a nice PDF version which could be used to reprint the book; and, even better, I could publish the web interface on the website I set up to promote Bohol. All files required to process the dictionary are available online as well.

Since the introduction of tablet computers, I wanted to also create some software for them, and I got that opportunity in 2013, when I got three months of paid leave as part of my severance payment when my employer decided to close the Dutch office in which I was working. In that period, I dived into the architecture of Android apps, and basically re-coded the functionality of the website for a smartphone, in such a way that all the data was on-board and could be accessed without the need to be connected. Although the app was basically finished by October 2013, it took me quite some time to publish it, as I was occupied with other things, as a 7.2 earthquake in Bohol destroyed my in-laws’ house (as well as many other buildings, including some of its most beautiful historical churches). Also, I wanted to add some more features and polish the icons being used, and was investigating a way to earn something from the app. Seeing that this was not going happen soon, I decided to release the Cebuano-English Dictionary App for free, and also publish the complete source code, hoping it will prove a great resource for all with an interest in the Cebuano language, and hoping the source code will be helpful in building similar dictionaries for other languages. (Unfortunately, I won’t be making a version for the iPhone, as Apple requires DRM on all apps distributed through their iTunes store, and in general their conditions are incompatible with the GPL-3 I am using for my code).

Of course, all this wouldn’t have been possible without the diligent proofreading of many volunteers at Distributed Proofreaders — for that, daghang salamat (many thanks)!


Castes and Tribes of Southern India by Edgar Thurston

April 23, 2014

Image of front cover of bookBack in 1995-96, I lived in India for about one and a half years, with the initial idea of making a number of multimedia productions on Indian art, culture, and history, but ended up mostly working on Indian language dictionary databases….

One of the sources I encountered in India was the various multi-volume sets entitled “Castes and Tribes of…” for various regions, such as the Central Provinces, Bengal, The Punjab, and, one of the biggest sets, the seven volume work covering Southern India. All these books were put together at the behest of the British Government by officials and their Indian assistants in the late nineteenth and early twentieth century. There are even several volumes on “Criminal Tribes”. These sets describe, in its entire intricate detail, the mind-baffling complexity of Indian society a hundred years ago. A society that has been quickly changing and has already lost much of this complexity—sometimes for the better, but sometimes not—and is today changing at an even faster rate, losing much of its colourful diversity in the process.

For one of the multimedia productions, I proposed to digitize the entire set, and produce a CD “Castes and Tribes of India”, to make this massive piece of work available again. The project never made it.

Image of Malayan Devil-Dancers (pl4-441)

Malayan Devil-Dancers

The books are of an encyclopedic nature. After a relatively short general introduction, they treat the castes and tribes in alphabetical order, in articles, that can sometimes just encompass a single paragraph, but sometimes as long enough to fill a monograph. For the time, many of these volumes are lavishly illustrated with photographs (The original set on the Central Provinces even used collotypes, a costly raster-free reproduction technique that preserves the sharpness and details of the original photographs). As the articles are written by various people, and often based on older publications or articles, the quality and scope of the articles varies somewhat, but in general, they give an interesting oversight of each caste or tribe described. Since the terms “caste” and “tribe” are used liberally, you can also find very interesting articles in those books on for example Anglo-Indians, Mar Thoma Christians, and Cochin Jews, and groups living in almost stone-age conditions such as the Irula, as well as the highly secluded Nambudiri Brahmins.

A few years after my return to Holland, in 1998, I managed to purchase an original 1909 copy of Thurston’s 7-volume set on Southern India, as well as reprints of many of the other sets, for digitization and inclusion in Project Gutenberg. At that time, I started scanning these volumes, but just as quickly stopped doing so, as I found out that the scanning would damage the costly volumes, and put the project on hold. I did continue with the (less costly) 4-volume facsimile reprint on the Central Provinces. Several years later, I purchased a scanner that would cause less damage to the books, and continued scanning, and shortly afterwards discovered that the scans were being added to the Internet Archive collection, so I no longer needed to scan the remaining volumes (except for a few missing pages). Anyway, starting from 2006, the projects appeared on the Distributed Proofreading site, slowly but steadily making their moves through the rounds, until, finally, the last volume left the rounds in 2011, and the huge task of post-processing this work started, which was complicated, due to the many words with accents, and the numerous tables in the books. Finally, on 21 June 2013, the entire set got posted on Project Gutenberg.

Almost 18 years after first envisioning this project, and 15 years after starting work on it, one of the biggest projects I’ve worked on for Project Gutenberg has come to a close. If somebody is interested, the original volumes are for sale (they won’t be cheap though). For the time being, I will leave the remaining sets to be picked up by other volunteers.

The 7 volumes of  Castes and Tribes and of Southern India are available here: I, II, III, IV, V, VI, VII.

The 4 volumes of The Tribes and Castes of the Central Provinces of India are available here: I, II, III, VI.


Farthest North

November 10, 2010

In our time of comfortable air-travel, it is hard to imagine that just 115 years ago neither the North or the South-pole had ever been set foot upon by mankind, but this is exactly what the Norwegian explorer Fridtjof Nansen set out to do in 1893, with a purpose-built ship, the “Fram”, and a team of hardened men all willing to risk their lives for this mission.

The "Fram" in the ice

The "Fram" in the ice (from Farthest North, volume II)

Nansen’s richly illustrated book, Farthest North; Being the Record of a Voyage of Exploration of the Ship “Fram” 1893-96 and of a Fifteen Months’ Sleigh Journey by Dr. Nansen and Lieut. Johansen, originally published in Norwegian in 1897, was translated into English the same year. In it, Nansen describes the inventiveness he used to organize such a mission with fairly limited means. The “Fram” had a number of interesting innovations. Its hull was specially designed to be lifted by the drifting ice, instead of being crushed; and the ship had a featured windmill, to provide electricity, and thus some bright electric light through the long polar night.

Volume I (Project Gutenberg ebook 30197) describes the planning phase, and how the “Fram” set out to sail as far North as it could, intentionally letting itself be captured by the ice, in an attempt to drift further North than any ship had reached before.

Volume II (Project Gutenberg ebook 34120) continues with the even more dare-devil attempt by Nansen and his companion Johanson to reach the North Pole on sledges pulled by dogs. During this trip, they killed their dogs one-by-one, feeding the weaker or exhausted ones to the remaining ones–the same, much criticized, method that brought Amundsen to the South Pole 16 years later. They reached the at that time record setting latitude of 86 degrees 15 minutes North before heading South again. During this trip South, they lost their remaining dogs, and were finally forced to build a hut and stop for the winter on Franz Josef Land, North of Spitsbergen. Here they survived by shooting over-curious ice-bears, and using them as food, until they were able to reach a small outpost built by the British explorer Jackson, just a few miles from where they survived the winter with so much hardship.

From here, the journey home proceeds smoothly, and the two explorers were rejoined with their crew who had survived another winter in the ice on the “Fram”, before being able to return home.

The "Fram" as it is today

The "Fram" as it is today, in the Fram Museum

The book is a pleasant read, and is illustrated with hundreds of photographs and drawings; including a number of color plates from pencil sketches made during the trip.

Nansen went on to become an influential statesman, with an important role for Norwegian independence, and an even more important role in saving millions of lives in Russia, Armenia, and Turkey with the High Commission for Refugees, for which he was awarded the Nobel peace prize in 1922.

After being the ship that reached the Northern-most latitude, the “Fram” also became the ship to reach the Southern-most latitude during Amundsens expedition to reach the South-Pole in 1911. Today, the “Fram” can still be visited: it has been pulled ashore in Oslo, and a museum, the Frammuseet, has been build around it. The visit will be even more impressive after reading this book, when, after short inspection of the cabins where these men have been living for three years, you realize some of the hardships they must have gone through. (This visit can easily be combined with a visit to the Norwegian Maritime Museum, the Kon-Tiki Museum, and the Viking Ship Museum, all located at walking distance from each other.)