An Gramadóir: Developers' Guide | ||
---|---|---|
Prev | Chapter 1. An Overview | Next |
The first version of An Gramadóir was written as a (pretty simple-minded) sed script consisting entirely of substitutions:
s/de [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g; s/de s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g; s/mo [aeiouáéíóú][^h][^ ]*/<E msg="apostrophe">&<E>/g; s/mo [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g; s/mo s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g; s/sa [bcfgmp][^h][^ ]*/<E msg="lenition">&<E>/g;
The latest versions are written in Perl and are infinitely more intelligent, though I've maintained this essentially "stateless" design. [1] The input text is passed through a series of filters, each of which adds some XML markup. I'll illustrate this with a trivial English language example.
Preprocessing. Each language has a
native character encoding that is
used internally by An Gramadóir to represent the lexicon and
rule sets. It is also the default input encoding
for the interface script gram-xx.pl.
If text in another encoding is passed to
gram-xx.pl, it is converted
to the native encoding in the preprocessing step.
Also, if the input text contains any SGML-style
markup, it will be removed at this stage; otherwise it can
interfere with the XML markup inserted by An Gramadóir.
In the example below,
the preprocessor will simply strip the
<b
> markup:
A <b
>umpire</b
>. The status quo. --> A umpire. The status quo.
Segmentation. This step breaks the text up
into sentences, each of which is marked up with a
<line
> tag:
A umpire. The status quo. --> <line
>A umpire.</line
> <line
>The status quo.</line
>
See Section 3.3.1 for more information on how this is implemented.
Tokenization. Next, each sentence
is broken up into words, each of which is marked up
with a <c
> tag:
<line
>A umpire.</line
> <line
>The status quo.</line
> --> <line
><c
>A</c
> <c
>umpire</c
>.</line
> <line
><c
>The</c
> <c
>status</c
> <c
>quo</c
>.</line
>
See Section 3.3.2 for information on how to specify language-specific tokenization rules.
Lookup. Next, each word is looked up in the the lexicon. Unambiguous words are tagged with their correct part of speech, while ambiguous words are assigned a more complicated markup involving all of their possible parts of speech (e.g. umpire in the example, which can be, a priori, either a noun or a verb). Words that aren't found in the lexicon are sent to the morphology engine in the hope of recognizing them as morphological variants of some known word.
<line
><c
>A</c
> <c
>umpire</c
>.</line
> <line
><c
>The</c
> <c
>status</c
> <c
>quo</c
>.</line
> --> <line
><T
def="n">A</T
> <B
><Z
><N
/><V
/></Z
>umpire</B
>.</line
> <line
><T
def="y">The</T
> <N
>status</N
> <F
>quo</F
>.</line
>
See Section 3.1.2 and Section 3.1.4 for more information on how words are stored and recognized by the morphology engine.
Chunking. In this step, certain "set phrases" are lumped together to be treated as single units by the grammar checker. In the present example, the word "quo" is marked up
with the special
tag <F
>
which would lead to an warning from the grammar checker unless,
as is the case here, it appears in known set phrase.
This is a useful trick.
<line
><T
def="n">A</T
> <B
><Z
><N
/><V
/></Z
>umpire</B
>.</line
> <line
><T
def="y">The</T
> <N
>status</N
> <F
>quo</F
>.</line
> --> <line
><T
def="n">A</T
> <B
><Z
><N
/><V
/></Z
>umpire</B
>.</line
> <line
><T
def="y">The</T
> <N
>status quo</N
>.</line
>
See Section 3.2.2 for how to specify these chunks for your language.
Disambiguation. This step uses local contextual cues to resolve any ambiguous part of speech tags. In our example, the fact that "umpire" is preceded by an article is a good indicator that it is a noun and not a verb:
<line
><T
def="n">A</T
> <B
><Z
><N
/><V
/></Z
>umpire</B
>.</line
> <line
><T
def="y">The</T
> <N
>status quo</N
>.</line
> --> <line
><T
def="n">A</T
> <N
>umpire</N
>.</line
> <line
><T
def="y">The</T
> <N
>status quo</N
>.</line
>
The syntax of the disambigation input file is described in Section 3.2.3.
Rules. Finally, the actual grammatical rules are applied:
<line
><T
def="n">A</T
> <N
>umpire</N
>.</line
> <line
><T
def="y">The</T
> <N
>status quo</N
>.</line
> --> <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE teacs SYSTEM "https://cadhan.com/dtds/gram-en.dtd"> <teacs
> <line
><E
msg="BACHOIR{an}"><T
def="n">A</T
> <N
>umpire</N
></E
>.</line
> <line
><T
def="y">The</T
> <N
>status quo</N
>.</line
> </teacs
>
See Section 3.2.4 for information on how rules and exceptions are specified in the input file rialacha-xx.in.
Recurse. The basic strategy of An Gramadóir is like a bottom-up parser, but with grammatical rules being applied at each stage of the parse. Empirically at least, the kinds of rules one would normally like to implement seem to be naturally "stratified" according to the amount of phrase structure markup needed to implement them. Simple spell checking is like "level -1", requiring no markup at all. Most rules for Irish are "level 0", requiring part of speech (including gender, number, etc.) markup but no more; they are, therefore, able to be implemented with just one pass through the sequence of steps above. For many languages, a natural next step would be to chunk noun phrases and then apply any appropriate rules at this level before proceeding to deeper parsing. See Section 1.4 for more general remarks on this strategy and how it is particularly well-suited to languages with limited resources.
[1] | "Stateless" isn't exactly the right word; the program maintains plenty of state, it is just carried around in the text stream itself vs. in so-called "variables", risky abstractions which I'm told are used widely in certain programming languages. |