Developing patterns for Irish
There is a brief description of the TeX hyphenation algorithm on the Wikipedia page for TeX, or for more details you can check out Frank Liang's PhD thesis where the algorithm was first described.
The Irish patterns consist of rules like the following:
al3i a6ll al2ann geal5aRoughly speaking, even numbers prevent a word from being broken at the given point, and odd numbers permit a break at the given point. Larger numbers carry stronger weight than lower numbers when two rules apply. The first rule
al3i
permits a hyphen after the “l”
in words like béaliata or galinneall.
The second rule strongly prevents hyphenation before
the “ll” in words like timpeallacht or fealltóir.
The third rule weakly prevents hyphenation after the “l” in
words like bialann or dialann. Note that it also applies
to the verb gealann, theoretically preventing a desirable
hyphenation, but is overridden by the fourth rule
which permits a hyphen at this same spot (since 5 is greater than 2).
One basic heuristic involves lenition;
a word should never be broken between a lenitable
consonant and the “h” indicating the lenition orthographically.
Thus you will find patterns like c2h
and d2h
in the rule set. Conversely, if an “h” appears after a vowel
or non-lenitable consonant, it is usually a good candidate for a hyphen point,
as in Bói-héam-ach or Faran-haít.
This results in patterns
of the form i1h
and n5h6a
.
Another basic heuristic is, for syncopated words, to include a hyphen at the point of syncopation; e.g. ciog-al and ciog-lach.
Results
The resulting hyphenation patterns are very much morphological vs. phonological (my personal preference). As a consequence, they do not always agree with hyphenations I've found in actual printed texts. For instance:
These patterns | Corpus |
---|---|
Ceilt-each | Ceil-teach |
siosc-adh | sios-cadh |
craic-eann | crai-ceann |
ceann-aithe | cean-naithe |
tuairt-eáil | tuair-teáil |
comh-alta | com-halta |
The last example is of course an abomination of the worst kind.
Known bugs or ambiguities
The word “record” is a well-known example of an English word where the proper hyphenation depends on context (verb re-cord vs. noun rec-ord). A strict adherence to morphological hyphenation in Irish leads to a number of amusing (and highly-improbable) ambiguities, many arising from the not-particularly-distinctive form of the imperfect autonomous:
- bhrach-taí “used to be fermented” (broad stem) vs. bhracht-aí “sappiest”
- cháint-í “most critical” (no hyphen) vs. cháin-tí “used to be taxed”
- Cheilt-í “most Celtic” (no hyphen) vs. Cheil-tí “used to be concealed”
- chist-í “treasures” (no hyphen) vs. chis-tí “used to be handicapped”
- gcoirtí “may you tan” (no hyphen) vs. gcoir-tí “used to be worn out”
- chreataí “used to be shaken” (no hyphen) vs. chreat-aí “shakiest”
- doir-tear “breeds” vs. doirt-ear “spills”
- fhuad-ar “bustle” vs. fhua-dar “they sewed”
- fhuaf-ar “odious” vs. fhua-far “one will sew”
- ghais-tí “used to gush” vs. ghaist-í “traps” (no hyphen)
- gheal-taí “used to be brightened” vs. ghealt-aí “lunatics”
- ghor-taí “used to be incubated” vs. ghort-aí “may you injure”
- na haist-í “hatches” (no hyphen) vs. na hais-tí “essays”
- lé-amar “we read” (past) vs. léam-ar “lemur”
- luad-ar “movement” vs. lua-dar “they mentioned”
- meat-aí “most perishable” vs. mea-taí “used to waste”
- réalt-aí “stars” vs. réal-taí “used to be developed”
- ríf-ear “reefer” (Collins Gem) vs. rí-fear “will tighten”
- shá-dar “they stabbed” vs. shád-ar “solder”
- Shá-imis “we stab” vs. Sháim-is “Sami language”
- shá-faí “one would stab” vs. sháf-aí “ax handle” (genitive)
- speal-ta “mowed” vs. spealt-a “milt” (genitive, no hyphen)
- thiom-áin-tí “used to be driven” vs. thiom-áint-í “drives”
In the rules file, these words are placed
inside a \hyphenation{}
statement so that TeX will not apply
the usual rule set to them.
Note there is also a potential difficulty with words like bainte which can be viewed morphologically as bain+te (i.e. past participle) or as baint+e (genitive of a second declension noun). The same holds if the noun forms admits a plural, e.g. bhaint+í could also be the imperfect autonomous bhain+tí. The current set of patterns is designed to allow the past participle hyphenation in most cases. Here are the other words for which this is relevant: athoscailte, bainte, ceilte, cigilte, coigilte, cuimilte, deighilte, déroinnte, diomailte, dúbailte, easmailte, eitilte, fóinte, foroinnte, fuascailte, meilte, múscailte, oscailte, roinnte, satailte, streachailte, tochailte, tomhailte, tríroinnte, tuirlingte. Other past participles have the same ambiguity “accidentally”: ciste (cist=“a cyst”), coirte (coirt=“tree bark”), deilte (deilt=“delta”), feilte (feilt=“felt”) The noun cruachta is a third declension example.
Finally, there are some true bugs. They are extremely rare: as of version 1.0 the patterns do not produce any hyphen points which are not in the database and miss just 10 out of 314,639 hyphen points. This is not to say that you won't discover any bad hyphenations, but that they are the fault of the underlying database and not of the algorithms used to produce the patterns.