A Joint Model of Orthography and Morphological Segmentation


We present a model of morphological segmentation that jointly learns to segment and restore orthographic changes, e.g., funniest 7 → fun-y-est. We term this form of analysis canonical segmentation and contrast it with the traditional surface segmentation, which segments a surface form into a sequence of substrings, e.g., funniest 7 → funn-i-est. We derive an importance sampling algorithm for approximate inference in the model and report experimental results on English, German and Indonesian.

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies