Developing patterns for Irish

I plan on including a detailed description of the initial data set and bootstrapping process here. For now, just a couple of quick rules while I'm thinking of them:

One basic heuristic involves lenition; a word should never be broken between a lenitable consonant and the "h" indicating the lenition orthographically. Thus you will find patterns like "c4h", "d4h", etc. in the rule set. Conversely, if an "h" appears after a non-lenitable consonant, it is usually a good candidate for a hyphen point, as in Bói-héam-ach or Faran-haít. This results in patterns of the form "i1h", "n1h", etc.

Another basic heuristic is, for syncopated words, to include a hyphen at the point of syncopation; e.g. ciog-al and ciog-lach.


The resulting hyphenation patterns are very much "etymological" vs. "phonological". As a consequence, they do not always agree with hyphenations I've found in actual printed texts. For instance:

My patterns Corpus
Ceilt-each Ceil-teach
siosc-adh sios-cadh
craic-eann crai-ceann
ceann-aithe cean-naithe
tuairt-eáil tuair-teáil
comh-alta com-halta

The last, of course, is an abomination of the worst kind.

Known bugs or ambiguities