Friday, October 12, 2007

Spell Center and Linguistics

Kerianne pointed out to me that I haven't posted in a while. Realizing that this is in fact true I've come up with a cohesive topic to write about. Since I've been working on Spell Center for the past 3 months, I'd like to formally introduce it with some explanation for why it's necessary.

Translation projects are often initiated by some sort of linguist, sometimes a PhD candidate looking for a doctoral thesis or some similar situation. These high class intellectuals perform detailed analysis of the language including phonology, morphology and several other words ending in -ology that most normal people never utter.

Often, a written language doesn't yet exist, so linguists have to work with locals to decide what alphabet should be used. Lots of times they choose the alphabet of a trade language, like Thai or English (Latin) because the similarity will make learning the trade language easier. One language project that I've seen in progress started in script very similar to Lanna (Northern Thai), then was converted into Thai, and has currently been converted to Latin letters. This sort of change is sometimes done for political reasons, other times for practical reasons (language learning). Sometimes an entirely new alphabet is designed. Computers make these conversions rather easy, especially if converting amounts to respelling (like ที่น to Tim).

Once the linguist has done initial work and created an alphabet with phonology and grammar, the written language is taught to local people and a translation project begins. SIL has found it most effective to make use of local people for most of the translation. And so computers are introduced to tribal people at an early stage. The translation work is typed into computers (usually running Windows XP and sometimes Mac OSX or Linux) using tools like Paratext and Translator's Workbench developed by farang non-profit organizations.

Enter Spell Center. In trade languages like English, French, and Thai there is usually several dictionaries and word lists available for spell checkers to check against. However, in minority languages, the only list of words is the corpus of work that has been created. In Spell Center (and other apps like Paratext) we parse the entire corpus of work and get a list of all words used. In Spell Center, the user can look through this list of words and decide which ones are spelled correctly and fix the ones that are spelled wrong (pictured to the right).

While existing applications do this already, Spell Center exists so that we can add additional algorithms to help out the translator. For instance, we can make an assertion that if a short word occurs only once or twice (rather than 125 times), its probably a typo and we can mark it wrong. We also allow the translator to be unsure, and use a '?' to come back to later when someone more knowledgeable is available. We also allow the user to see every place where the word was used in the context that it was used in the bottom pane to help them remember the meaning of the word.

We'll be integrating this with an open source application called Enchant (link), which is an engine that can be used to provide spelling suggestions for translation editors and word processors.

After Spell Center comes to a relatively stable place, we can consider extending the functionality to create a concordance very quickly.

A requirement of Spell Center is that it should be easy for new users to learn as it probably won't often be used so much by PhD linguists, but more by local people doing translation work. The interface I have here in the screen shot will actually change quite a bit before it's actually released. The current plan is that I'll have it at a relatively stable place by the beginning of November so that we can try it out on a user and find all those design issues that we never really thought about. I'm quite excited for this stage. It's the time of truth, when I get to find out if all the work I've been doing has been worthwhile, and I get to see it in use. This suddenly isn't some obscure project for school that has no life cycle beyond getting a decent grade, this is actually going to be used by people!

No comments: