An Gramadóir: Developers' Guide | ||
---|---|---|
Prev | Chapter 3. A tour of the language pack | Next |
The remainder of the files in the language pack require less attention, and some can be ignored entirely.
The grammar checker performs simple segmentation of the input text into sentences. It is possible to customize this for your language by editing the files giorr-xx.txt and giorr-xx.pre. The default language pack uses statistical methods to extract likely abbreviations from a text corpus (i.e. words that appear almost exclusively followed by a period "."). You'll find these in giorr-xx.txt. You may also want to uncomment the lines in giorr-xx.pre so that one letter abbreviations are escaped properly. Any other unusual conventions for ends of sentences should get encoded here.
You can specify how the grammar checker tokenizes the input stream by added rules to the file token-xx.in. You can use this to deal with URLs, email addresses, monetary amounts, ordinals (1st, 2nd, ...), etc. in a clean, uniform way. The syntax of this file looks like:
regex:tag
3grams-xx.txt and freq-xx.txt. The first of these files contains a list all 3-character sequences appearing above a certain frequency threshold in the corpus generated by my web crawler. Zero-width word boundaries are treated as characters in these counts; the initial word boundary is denoted "<" and the terminal word boundary is denoted ">". This is (currently) just used to improve the error messages when unknown words are encountered; if the word contains "suspect" 3-grams, An Gramadóir will report it is "possibly a foreign word". Eventually I'd like to allow An Gramadóir to skip over entire sentences or paragraphs that it detects as being written in a different language. The file freq-xx.txt contains frequency counts from the same corpus for all words appearing in lexicon-xx.txt. There should be no reason to edit either file directly. Periodically I can provide updates if the web corpora grow substantially, or if new words are added to the lexicon.
You only need to worry about the Changes file before releasing a version of the Perl module; this gets copied as-is into Lingua-XX-Gramadoir. The README and COPYING files only need to be changed if you prefer a license other than the GPL. Also be aware that you have control over the licensing of two distinct packages: the language pack and the generated Perl module. The default behavior is to generate a README for the Perl module which says something like "Lingua-XX-Gramadoir is released under the same terms as the gramadoir-xx package from which it was built:", and then to include the language pack README verbatim. Let me know if you need to do something more complicated.
The file configure is run once when you first set up the language pack to create the Makefile with all of the targets you'll need for day-to-day development. Only a couple of lines of this script should ever need to be edited. The first variable GRAMDIR should point to the directory containing the gramadoir "engine". If you're working out of CVS, this should probably be ../engine. The next line contains the version number and should be edited when you want to create a new release of the language pack and/or Perl module.