An Gramadóir: Developers' Guide | ||
---|---|---|
Prev | Next |
Of course, until you actually add some real grammatical rules to the language pack input files, the Perl module will function as a simple spell checker only. In this chapter I'll describe the syntax of the input files and some tricks for building them quickly.
In case you're just curious about a single file (what it does or how to create it), here are brief descriptions of each of the files, with links to the more detailed descriptions later in this chapter.
3grams-xx.txt. List of 3-grams, sorted by frequency.
aonchiall-xx.in. Disambiguation rules.
Changes. ChangeLog to be included in the Lingua::XX::Gramadoir distribution.
comhshuite-xx.in. List of set phrases.
configure. Script used to create the langpack Makefile.
COPYING. License for the language pack (not necessarily for the perl module).
earraidi-xx.bs. Database of misspellings and replacements.
eile-xx.bs. Database of non-standard spellings and replacements.
freq-xx.txt. Frequency counts for words in the lexicon.
giorr-xx.pre. Optional preprocessing step used by the segmentation module.
giorr-xx.txt. List of abbreviations that are usually followed by a period/full stop.
lexicon-xx.bs. Main database of words and parts of speech, compressed.
macra-xx.meta.pl. Macro definitions for use in input files.
morph-xx.txt. Morphological rules.
nocombo-xx.txt. List of morphologically non-productive words.
pos-xx.txt. Table of parts of speech and internally-used numerical codes.
README. Language pack README; will be included in the general perl module also.
rialacha-xx.in. Grammatical rules and exceptions.
token-xx.in. Language-specific tokenization rules.
triail.xml. Expected output of perl module test script.
unigram-xx.pre. Optional preprocessing step used before applying unigram tagger.
unigram-xx.txt. List of all parts of speech, sorted by frequency.
If you'd like your grammar checker to have at least the functionality of a spell checker, you'll need to assemble a large word list (though it is worth mentioning that, for some languages, it is possible to implement a tool that performs interesting checks without necessarily recognizing each word, e.g. Igbo "vowel harmony" rules). Most languages will want a tagged list, with part-of-speech information associated to each word.
Part-of-speech markup is added to input texts as
XML tags; you'll need to choose these
tags first.
If you haven't provided me with a tagged word list
(e.g. if you're just starting with a word list from
a spell checker) the default language pack will simply
tag all words with <U
>
("unknown" part of speech).
If you just want a fancy spell checker this is sufficient.
Otherwise you can place your tags
(e.g. <N>, <V>, <N plural="y">, etc.)
in pos-xx.txt
and assign a numerical code to each (used internally).
There are a couple of mild restrictions:
The numerical codes must be integers between 1 and 65535, excluding 10 (used as a file delimiter). [1]
Code 127 has a special meaning across all languages: it is used to markup words which are correct but are very rare or might hide common misspellings. A good example in Irish is ata which is a past participle meaning "swollen", but does not appear in my corpus of over 20 million words except as a misspelling of atá (a form of the verb "to be"). Words like yor and cant are well-known examples in English.
The XML tags must be
ASCII capital letters,
excluding
B
,
E
,
F
,
X
,
Y
,
and Z
(which are all tags added to the XML stream
by An Gramadóir while checking grammar; see the
FAQ for explanations
of these). This leaves
20 possible tags,
which should be more than enough in light of the
fact that you can refine the semantics of your tags by adding
XML attributes where appropriate.
The files lexicon-xx.bs and lexicon-xx.txt contain the main database of recognized words. The first of these is the compressed version that comes in the language pack tarball, the latter is the uncompressed version that you should use for editing, adding words, part-of-speech tags, etc. If you don't see lexicon-xx.txt you can recreate it using:
$ make lexicon-xx.txt
Conversely, if you ever do a make dist, the compressed version will be updated correctly, taking into account any additions or changes made to lexicon-xx.txt. The file lexicon-xx.txt contains one word per line followed by whitespace and one of the numerical grammatical codes from pos-xx.txt; e.g.:
Example 3-1. An excerpt from a fictional lexicon-en.txt
dipper 31 dire 36 direct 33 direct 36 direct 37 directed 36 direction 31 directional 36 directions 32
Note that ambiguous words should be listed multiple times, once for each possible part of speech (we are thinking in the example above of the word direct as either a verb, adjective, or adverb). The word list need not be alphabetized, but this is probably a good idea for maintenance purposes! The only requirement is that all of the codes for a single ambiguous word must appear contiguously.
As noted earlier, in the default language pack, all grammatical
codes are initially set to "1"
(<U
>) as
placeholders, until a proper tagged word list can
be constructed.
The file eile-xx.bs is a "replacement" file which contains on each line a non-standard or dialect spelling of a legitimate word followed by a suggested replacement. The file earraidi-xx.bs is similar, but should be used for true misspellings. The only difference in functionality between the two files is how the replacements are reported to the end-user. I built the file eile-en.bs in the English language pack by collating the specifically American and British word lists that are distributed with ispell. The Irish file eile-ga.bs is a by-product of my work on dialect support for Irish language spell checkers. The replacement "word" is allowed to contain spaces, e.g.
spellchecker spell checker
The file morph-xx.txt encodes morphological rules and other spelling changes for your language; it is structured as a sequence of substitutions, one per line, using Perl regular expression syntax, with fields separated by whitespace. When an unknown word is encountered, these replacements are applied recursively (depth first, to a maximum depth of 6) until a match is found.
So, for example, this file is where you can specify customized rules for decapitalization (the default language pack provides standard rules for this, while for Irish it is substantially more complicated). You can also use it to strip common prefixes and suffixes in much the same way as the "affix file" is used for ispell or for aspell (but, unlike those programs, allowing several levels of recursion). For Irish, morph-ga.txt is also used to encode many of the spelling reforms that were introduced as part of the "Official Standard" in the 1940's.
The syntax is simpler than it first appears. Each line represents a single rule, and contains four whitespace-separated fields. The first field contains the pattern to be replaced, the second field is the replacement (backreferences allowed, which moves us beyond the usual realm of finite state morphology), and the third field is a code indicating the "violence level" the change represents. Level -1 means that no message should be reported if the rule is applied and the modified word is found (as in the default rule which turn uppercase words into lowercase). Level 0 means that a message is given which just alerts the user that the surface form was not found in the database but that the modified version was. Level 1 indicates that the rule applies only to non-standard or variant forms and will be reported as such (e.g. for American English you could define a level 1 rule that changes ^anaesth to anesth, or globally changes centre to center, etc.) Level 2 indicates that the rule applies only when the surface form is truly incorrect in some way.
False positives can be avoided by placing words that are not morphologically productive in the file nocombo-xx.txt.
[1] | This is a white lie; the legal numerical codes are, in actuality, precisely those positive integers corresponding to Unicode code points. So this means there are more than a million possible codes (but it also means that you need to avoid the so-called surrogates, 55296 to 56320). Hopefully no one will ever need to know this. |