An Gramadóir: Developers' Guide | ||
---|---|---|
Prev | Chapter 3. A tour of the language pack | Next |
The grammar checker per se is generated from three input files that share the same basic syntax, to be described in the sections below. Complicated "meta" scripts convert these (more or less) human-readable files into the Perl scripts which actually find and mark up the grammatical errors.
The structure of all three input files is essentially the same. I've included a flex/bison parser in the distribution that can be used for error-checking these files during development (see the poncin target in the Makefile). Also, those who might prefer a formal (BNF-like) grammar can look at the files ponc.in.l and ponc.in.y.
Lines beginning with a # or lines containing only whitespace are ignored. All other lines contain "rules", which are structured as follows:
phrase:action
A phrase is a simplified description of the regular expression you want to match in the marked up text stream. The phrase syntax is the same for all three files: one or more words, optionally surrounded by tags, separated by single spaces. A word can either be an explicit regular expression (e.g. [Aa]ch to match upper or lowercase ach) or one of a collection of macros defined in the file macra-xx.meta.pl (e.g. for Irish LENITEDDFST expands to the regular expression [DdFfSsTt]h[^<]*). Complicated regular expressions should be defined as macros in macra-xx.meta.pl; simple expressions such as optional substrings or alternation are fine. In such cases, you should avoid using "non-capturing parentheses" and use plain (capturing) parentheses; the conversion scripts will treat these correctly when generating the final Perl code.
The real power comes from being able to specify part of speech tags as regular expressions; these take one of the following forms:
No tag at all. Something like direct will match any appearance of the word direct, ignoring part of speech tags. This can be useful in the disambiguation module before all parts of speech have been resolved. Otherwise, if you know what the part of speech ought to be, it is clearer to specify it (and faster too).
A literal tag.
For instance
<T
>[Aa]n</T>
refers specifically to the Irish word
an
as the definite article and
not, say, as an interrogative particle.
There is an important distinction here between tags in
pos-xx.txt that sometimes admit
attributes (often true for nouns and verbs, e.g.
<N
gender="m" pl="y">,
<N
gender="m" pl="n">)
and those that do not (often "closed-class" parts of
speech like prepositions
<S
>)
In the first case, it is still legal to markup a word
with something like
<N
>direction</N
>, but internally this expands to a macro
that matches any noun tag,
ignoring attributes.
A macro tag. More generally, you can
define your own macros in macra-xx.meta.pl
in order to match some property in the attributes. For example,
if you specify a "gender" attribute for nouns (in addition
to some others like singular vs. plural, etc.) it is
natural to define a macro
<NM
>
that can be used to match any masculine noun,
independent of the other attributes.
Tags with simple character classes.
You can specify a range of tags using the usual
regular expression (square bracket) notation:
<[AN]
>
matches any adjective or noun tag.
In Irish, we have the rule:
<V cop="y">[Bb]a</V> <[AN]>UNLENITED</[AN]>:SEIMHIU
which requires that both nouns and adjectives be lenited following the past copula ba.
Negation is allowed as well; this rule flags a definite article preceding a word that is not a noun or adjective:
<T>ANYTHING</T> <[^AN]>ANYTHING</[^AN]>:CUPLA
The file comhshuite-xx.in
is the simplest of the three; each
line contains a multiword "set phrase" in the phrase portion
of the rule, followed by the
part of speech tag that should be assigned to the given phrase
as the action portion.
For instance, the phrase
le haghaidh
appears in
the Irish version,
followed by the single (opening)
part of speech tag <S
>,
indicating that it is to be treated as a
preposition.
Since this filter is applied before any disambiguation occurs,
the phrase portion should consist of the words to be lumped
together with no additional markup.
Dealing with idiomatic expressions in this way improves the performance of the part-of-speech tagger (in terms of both speed and accuracy). It also allows us to report an error when a word which is almost always used in a set phrase is mistakenly used in some other context.
The file aonchiall-xx.in contains rules for disambiguating parts of speech; for instance, the word an in Irish can either be the definite article or an interrogative particle. You will find a sequence of rules in aonchiall-ga.in which indicate, for instance, that if an is followed by a verb, preposition, or pronoun, we expect it to be an interrogative (and in most other cases it is the article). This kind of disambiguation is obviously a necessary preliminary step before one can try to apply grammatical rules depending on part of speech.
More specifically, the phrase portion of a rule in
aonchiall-xx.in
is required to contain a single word marked up with
<B
></B>.
Naturally, this is the word to disambiguate
and the phrase is the context in which the disambiguation
is to occur.
The full syntax used by An Gramadóir for an ambiguous word looks
something like this:
<B><Z><J/><R/><V/></Z>direct</B>
with the list of all possible part of speech tags given within the <Z> markup (note that trailing slashes are required on these tags to ensure valid XML). If you don't care about matching the part of speech tags for an ambiguous word, it is acceptable to leave out the <Z> markup entirely:
<B>direct</B>
will match any ambiguous instance of the word "direct". It is also common to define macros to match certain regular expressions in the part of speech tags; for example, one could define a macro ANYNOUN to match any sequence of tags containing <N[^>]*/>; then the following will match all ambiguous words that can possibly be resolved as nouns:
<B><Z>ANYNOUN</Z>ANYTHING</B>
Like comhshuite-xx.in, the "action" portion consists of a single part of speech tag, representing the disambiguated part of speech when the given phrase is matched.
The rules specified in aonchiall-xx.in are applied (in the order they appear) two times. The second pass is quite useful for Irish, allowing rules to be applied in cases that the contextual parts of speech are disambiguated in the first pass.
The latest versions of the engine admit an extension that allows certain part-of-speech tags to be excluded in a given context. This is done by prefixing an exclamation point (!) to the action portion of the rule. So, for example, this rule (for Irish) indicates that eclipsed words should not be tagged as past tense verbs:
<B><Z>ANYPAST</Z>ECLIPSED</B>:!<V p="y" t="caite">Note also that the action portion can be given as a regular expression, and all matching tags will be eliminated from consideration:
[Dd]o <B><Z>ANYVERB</Z>ANYTHING</B>:!<V[^>]+>
If an ambiguity is not resolved after two passes through aonchiall-xx.in, then the default behavior is to simply assign the candidate tag with the highest overall frequency in your language. The file unigram-xx.txt consists of a list of the legal part-of-speech tags for your language sorted in order of frequency highest to lowest. Sometimes it helps in disambiguation to "lump together" several tags (e.g. by stripping attributes that have no use in grammar checking). This can be achieved by placing appropriate substitutions in unigram-xx.pre. After you have a first version up and running, you can create or update unigram-xx.txt with this command:
% cat big.txt | gramdev-xx.pl --minic > unigram-xx.txt
In fact, it is even possible for An Gramadóir to apply statistical methods to help find candidate rules for aonchiall-xx.in. I've implemented the algorithm from Eric Brill's paper Unsupervised learning of disambiguation rules for part of speech tagging so that the output is suitable for use in aonchiall-xx.in (and so that the highest-scoring rules come first). Run it as follows:
$ cat big.txt | gramdev-xx.pl --brill > rules.txt
The file rialacha-xx.in contains the grammatical rules proper, and lists any exceptions to these rules.
The phrase portion of a rule in rialacha-xx.in is converted to a regular expression which matches a grammatical error. The action portion consists simply of a macro which expands to the error message you want to be displayed when the rule applies. These macros are defined in messages.txt. Perhaps the most common rule for Irish is SEIMHIU which expands to "Séimhiú ar iarraidh" ("Missing lenition"). Certain macros can also take an parameter inside curly braces: the action BACHOIR{ina} expands to "Ba chóir duit /ina/ a úsáid anseo" ("You ought to use /ina/ here") with the parameter inserted between the slashes.
Two very important rules are included in the default language pack:
<X>ANYTHING<X>:UNKNOWN <F>ANYTHING<F>:UNCOMMON
Words not found in the lexicon are marked up with the tag
<X
>,
and so the first rule reports such words as "unknown".
Words that are found in the lexicon, but appear there with
part of speech code 127 (see above), are given the special
tag
<F
>
and so the second rule reports these as "uncommon".
In earlier versions, the exceptions were kept in a separate input file called eisceacht-xx.in. We now find it more convenient to store the exceptions together with the rules to which they apply in the file rialacha-xx.in. Following each rule, one has the option of including a block of patterns representing exceptions to the rule that are actually grammatical and should not be reported as errors. For instance, the word dhá ("two") causes lenition in general, but not, for instance, when preceded by the possessive adjective ár. To implement this exception, it is placed in rialacha-ga.in immediately following the general rule, and with the action portion of the rule set to OK:
<A>[Dd]há<A> UNLENITED:SEIMHIU <D>[Áá]r<D> <E><A>[Dd]há<A> UNLENITED<E>:OK
When the exception requires more context than the rule itself,
as in this example,
the words corresponding to the rule must be enclosed
within <E
> tags
to avoid potential ambiguities.
You can specify as many exceptions as you like to a single rule,
but note that exceptions only apply to the rule that
they follow.
It is a good idea to include one or more sample sentences for each rule in rialacha-xx.in. These are given on lines beginning with #., which for Irish I usually put on the line directly preceding the rule that they illustrate. When you build the Perl module, these sentences are extracted into the plain text file triail in the language pack directory, and are also used to generate a test script for the Perl module. The "expected output" when the grammar checker is applied to triail is stored in triail.xml. The command:
$ make test
will rebuild the Perl module and test scripts (if necessary), and then compare the results of checking triail with the contents of triail.xml, complaining with great bitterness when they differ. When new sample sentences are added, you'll need to update triail.xml; use
$ make triail.xml-update
to do this (but be sure before you update that you haven't accidentally broken any other rules).