An Gramadóir: Developers' Guide
Prev	Chapter 1. An Overview	Next

1.2. The Grammar Checking Process

The first version of An Gramadóir was written as a (pretty simple-minded) sed script consisting entirely of substitutions:

s/de [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g;
s/de s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g;
s/mo [aeiouáéíóú][^h][^ ]*/<E msg="apostrophe">&<E>/g;
s/mo [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g;
s/mo s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g;
s/sa [bcfgmp][^h][^ ]*/<E msg="lenition">&<E>/g;

The latest versions are written in Perl and are infinitely more intelligent, though I've maintained this essentially "stateless" design. [1] The input text is passed through a series of filters, each of which adds some XML markup. I'll illustrate this with a trivial English language example.

Preprocessing. Each language has a native character encoding that is used internally by An Gramadóir to represent the lexicon and rule sets. It is also the default input encoding for the interface script gram-xx.pl. If text in another encoding is passed to gram-xx.pl, it is converted to the native encoding in the preprocessing step. Also, if the input text contains any SGML-style markup, it will be removed at this stage; otherwise it can interfere with the XML markup inserted by An Gramadóir. In the example below, the preprocessor will simply strip the  markup:
```
A umpire. The status quo.
-->
A umpire. The status quo.
	 
```
Segmentation. This step breaks the text up into sentences, each of which is marked up with a <line> tag:
```
A umpire. The status quo.
-->
<line>A umpire.</line>
<line>The status quo.</line>
	 
```
See Section 3.3.1 for more information on how this is implemented.
Tokenization. Next, each sentence is broken up into words, each of which is marked up with a <c> tag:
```
<line>A umpire.</line>
<line>The status quo.</line>
-->
<line><c>A</c> <c>umpire</c>.</line>
<line><c>The</c> <c>status</c> <c>quo</c>.</line>
	 
```
See Section 3.3.2 for information on how to specify language-specific tokenization rules.
Lookup. Next, each word is looked up in the the lexicon. Unambiguous words are tagged with their correct part of speech, while ambiguous words are assigned a more complicated markup involving all of their possible parts of speech (e.g. umpire in the example, which can be, a priori, either a noun or a verb). Words that aren't found in the lexicon are sent to the morphology engine in the hope of recognizing them as morphological variants of some known word.
```
<line><c>A</c> <c>umpire</c>.</line>
<line><c>The</c> <c>status</c> <c>quo</c>.</line>
-->
<line><T def="n">A</T> <Z><N/><V/></Z>umpire.</line>
<line><T def="y">The</T> <N>status</N> <F>quo</F>.</line>
	 
```
See Section 3.1.2 and Section 3.1.4 for more information on how words are stored and recognized by the morphology engine.
Chunking. In this step, certain "set phrases" are lumped together to be treated as single units by the grammar checker. In the present example, the word "quo" is marked up with the special tag <F> which would lead to an warning from the grammar checker unless, as is the case here, it appears in known set phrase. This is a useful trick.
```
<line><T def="n">A</T> <Z><N/><V/></Z>umpire.</line>
<line><T def="y">The</T> <N>status</N> <F>quo</F>.</line>
-->
<line><T def="n">A</T> <Z><N/><V/></Z>umpire.</line>
<line><T def="y">The</T> <N>status quo</N>.</line>
	 
```
See Section 3.2.2 for how to specify these chunks for your language.
Disambiguation. This step uses local contextual cues to resolve any ambiguous part of speech tags. In our example, the fact that "umpire" is preceded by an article is a good indicator that it is a noun and not a verb:
```
<line><T def="n">A</T> <Z><N/><V/></Z>umpire.</line>
<line><T def="y">The</T> <N>status quo</N>.</line>
-->
<line><T def="n">A</T> <N>umpire</N>.</line>
<line><T def="y">The</T> <N>status quo</N>.</line>
	 
```
The syntax of the disambigation input file is described in Section 3.2.3.

Rules. Finally, the actual grammatical rules are applied:

<line><T def="n">A</T> <N>umpire</N>.</line>
<line><T def="y">The</T> <N>status quo</N>.</line>
-->
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE teacs SYSTEM "https://cadhan.com/dtds/gram-en.dtd">
<teacs>
<line><E msg="BACHOIR{an}"><T def="n">A</T> <N>umpire</N></E>.</line>
<line><T def="y">The</T> <N>status quo</N>.</line>
</teacs>

See Section 3.2.4 for information on how rules and exceptions are specified in the input file rialacha-xx.in.

Recurse. The basic strategy of An Gramadóir is like a bottom-up parser, but with grammatical rules being applied at each stage of the parse. Empirically at least, the kinds of rules one would normally like to implement seem to be naturally "stratified" according to the amount of phrase structure markup needed to implement them. Simple spell checking is like "level -1", requiring no markup at all. Most rules for Irish are "level 0", requiring part of speech (including gender, number, etc.) markup but no more; they are, therefore, able to be implemented with just one pass through the sequence of steps above. For many languages, a natural next step would be to chunk noun phrases and then apply any appropriate rules at this level before proceeding to deeper parsing. See Section 1.4 for more general remarks on this strategy and how it is particularly well-suited to languages with limited resources.

Prev	Home	Next
An Overview	Up	Available Languages

1.2. The Grammar Checking Process

Notes