Developing patterns for Irish
I plan on including a detailed description of the initial data set and bootstrapping process here. For now, just a couple of quick rules while I'm thinking of them:
One basic heuristic involves lenition; a word should never be broken between a lenitable consonant and the "h" indicating the lenition orthographically. Thus you will find patterns like "c4h", "d4h", etc. in the rule set. Conversely, if an "h" appears after a non-lenitable consonant, it is usually a good candidate for a hyphen point, as in Bói-héam-ach or Faran-haít. This results in patterns of the form "i1h", "n1h", etc.
Another basic heuristic is, for syncopated words, to include a hyphen at the point of syncopation; e.g. ciog-al and ciog-lach.
Results
The resulting hyphenation patterns are very much "etymological"
vs. "phonological". As a consequence, they do not always
agree with hyphenations I've found in actual printed texts.
For instance:
My patterns | Corpus |
Ceilt-each | Ceil-teach |
siosc-adh | sios-cadh |
craic-eann | crai-ceann |
ceann-aithe | cean-naithe |
tuairt-eáil | tuair-teáil |
comh-alta | com-halta |
The last, of course, is an abomination of the worst kind.
Known bugs or ambiguities
- The word "record" is a well-known example of
an English word where the proper hyphenation depends on
context (verb "re-cord" vs. noun "rec-ord").
A strict adherence to "morphological" hyphenation in Irish
leads to a number of amusing (and highly-improbable) ambiguities,
many arising from the not-particularly-distinctive form
of the imperfect autonomous:
- bhrach-taí "used to be fermented" (broad stem) vs. bhracht-aí "sappiest"
- cháint-í "most critical" (no hyphen) vs. cháin-tí "used to be taxed"
- Cheilt-í "most Celtic" (no hyphen) vs. Cheil-tí "used to be concealed"
- chist-í "treasures" (no hyphen) vs. chis-tí "used to be handicapped"
- gcoirtí "may you tan" (no hyphen) vs. gcoir-tí "used to be worn out"
- chreataí "used to be shaken" (no hyphen) vs. chreat-aí "shakiest"
- doir-tear "breeds" vs. doirt-ear "spills"
- fhuad-ar "bustle" vs. fhua-dar "they sewed"
- fhuaf-ar "odious" vs. fhua-far "one will sew"
- ghais-tí "used to gush" vs. ghaist-í "traps" (no hyphen)
- gheal-taí "used to be brightened" vs. ghealt-aí "lunatics"
- ghor-taí "used to be incubated" vs. ghort-aí "may you injure"
- na haist-í "hatches" (no hyphen) vs. na hais-tí "essays"
- lé-amar "we read" (past) vs. léam-ar "lemur"
- luad-ar "movement" vs. lua-dar "they mentioned"
- meat-aí "most perishable" vs. mea-taí "used to waste"
- réalt-aí "stars" vs. réal-taí "used to be developed"
- ríf-ear "reefer" (Collins Gem) vs. rí-fear "will tighten"
- shá-dar "they stabbed" vs. shád-ar "solder"
- Shá-imis "we stab" vs. Sháim-is "Sami language"
- shá-faí "one would stab" vs. sháf-aí "ax handle" (genitive)
- speal-ta "mowed" vs. spealt-a "milt" (genitive, no hyphen)
- thiom-áin-tí "used to be driven" vs. thiom-áint-í "drives"
\hyphenation{}
statement so that TeX will not apply the usual rule set to them. - Note there is also a potential difficulty with words like bainte which can be viewed morphologically as bain+te (i.e. past participle) or as baint+e (genitive of a second declension noun). The same holds if the noun forms admits a plural, e.g. bhaint+í could also be the imperfect autonomous bhain+tí. The current set of patterns is designed to allow the past participle hyphenation in most cases. Here are the other words for which this is relevant: athoscailte, bainte, ceilte, cigilte, coigilte, cuimilte, deighilte, déroinnte, diomailte, dúbailte, easmailte, eitilte, fóinte, foroinnte, fuascailte, meilte, múscailte, oscailte, roinnte, satailte, streachailte, tochailte, tomhailte, tríroinnte, tuirlingte. Other past participles have the same ambiguity "accidentally": ciste (cist="a cyst"), coirte (coirt="tree bark"), deilte (deilt="delta"), feilte (feilt="felt") The noun cruachta is a third declension example.
- Finally, there are some true "bugs". They are extremely rare: as of version 1.0 the patterns do not produce any hyphen points which are not in the database and miss just 10 out of 314,639 hyphen points. This is not to say that you won't discover any bad hyphenations, but that they are the fault of the underlying database and not of the algorithms used to produce the patterns.