An Gramadóir: Developers' Guide
Prev		Next

Chapter 2. Starting a new language

2.1. Statistical support

The first thing you should do if you're interested in porting An Gramadóir is Contact me. Assuming your language is one of the 2000+ languages for which my web crawler is running, I will create a new language pack for you using this data. If you don't have a clean word list there will be some preliminary work involved in constructing one.

Even if you have a word list in place, the web crawler can be used to augment the word list or even to find potential errors in it by statistical means. The crawler generates the following files for each language:

A.toadd.txt: This is the main list of candidate words to be considered for addition to the word list; these words pass through all of the statistical filters.
A.toaddcap.txt: Same as A.toadd.txt, but consisting of words appearing primarily in upper case in the corpus. These words are therefore usually (but not always) proper names of one kind or another.
A.accent.txt: Pairs of words that pass through the filters but differ only in presence or absence of one or more diacritical marks.
A.glanacc.txt: Same as A.accent.txt, but each pair consists of one word that is already in the "clean" word list (labelled "z" in the file) and one word which is not (labelled "y"). In most cases, the "y" word is incorrect and this is an efficient way to build up a "replacement file" (see Section 3.1.3).
A.pollute.txt: High frequency words that also appear in the aspell English word list (or another "polluting" language that you can specify); many of these words are correct, especially the highest frequency words, but as you get deeper in the list quite a few are really pollution.
A.3gram.txt: High frequency words that have one or more "suspect" three letter sequences in them. The filters must "learn" what correctly-spelled words look like based on (1) some number-crunching on the raw corpus (2) any edits to this and the other files. So initially there will be a mixture of correct and incorrect words in this file, but eventually this improves as the language model improves.

Prev	Home	Next
Caveat Emptor		CVS access