Stochastic Contextual Edit Distance and Probabilistic FSTs

Abstract

String similarity is most often measured by weighted or unweighted edit distance d(x, y). Ristad and Yianilos (1998) defined stochastic edit distance—a probability distribution p(y | x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic ontextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.

Publication
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics