An Gramadóir: Developers' Guide
Prev	Chapter 3. A tour of the language pack	Next

3.3. An assortment of less important files

The remainder of the files in the language pack require less attention, and some can be ignored entirely.

3.3.1. Segmentation

The grammar checker performs simple segmentation of the input text into sentences. It is possible to customize this for your language by editing the files giorr-xx.txt and giorr-xx.pre. The default language pack uses statistical methods to extract likely abbreviations from a text corpus (i.e. words that appear almost exclusively followed by a period "."). You'll find these in giorr-xx.txt. You may also want to uncomment the lines in giorr-xx.pre so that one letter abbreviations are escaped properly. Any other unusual conventions for ends of sentences should get encoded here.

3.3.2. Tokenization

You can specify how the grammar checker tokenizes the input stream by added rules to the file token-xx.in. You can use this to deal with URLs, email addresses, monetary amounts, ordinals (1st, 2nd, ...), etc. in a clean, uniform way. The syntax of this file looks like:

regex:tag

The rules are applied in the order they are specified in the file. Applying a rule amounts to matching the regular expression globally in the input and surrounding the matched text with the specified tag. The regular expression will not match within or across already-recognized tokens, so you will want to give rules for longer, more complicated tokens like URLs first.

3.3.3. Files to leave alone

3grams-xx.txt and freq-xx.txt. The first of these files contains a list all 3-character sequences appearing above a certain frequency threshold in the corpus generated by my web crawler. Zero-width word boundaries are treated as characters in these counts; the initial word boundary is denoted "<" and the terminal word boundary is denoted ">". This is (currently) just used to improve the error messages when unknown words are encountered; if the word contains "suspect" 3-grams, An Gramadóir will report it is "possibly a foreign word". Eventually I'd like to allow An Gramadóir to skip over entire sentences or paragraphs that it detects as being written in a different language. The file freq-xx.txt contains frequency counts from the same corpus for all words appearing in lexicon-xx.txt. There should be no reason to edit either file directly. Periodically I can provide updates if the web corpora grow substantially, or if new words are added to the lexicon.
You only need to worry about the Changes file before releasing a version of the Perl module; this gets copied as-is into Lingua-XX-Gramadoir. The README and COPYING files only need to be changed if you prefer a license other than the GPL. Also be aware that you have control over the licensing of two distinct packages: the language pack and the generated Perl module. The default behavior is to generate a README for the Perl module which says something like "Lingua-XX-Gramadoir is released under the same terms as the gramadoir-xx package from which it was built:", and then to include the language pack README verbatim. Let me know if you need to do something more complicated.
The file configure is run once when you first set up the language pack to create the Makefile with all of the targets you'll need for day-to-day development. Only a couple of lines of this script should ever need to be edited. The first variable GRAMDIR should point to the directory containing the gramadoir "engine". If you're working out of CVS, this should probably be ../engine. The next line contains the version number and should be edited when you want to create a new release of the language pack and/or Perl module.

Prev	Home	Next
Grammar checking	Up	Using An Gramadóir as a library