Parsing has been a central problem in Natural Language Processing, further improved with pre-trained language models and neural networks. Dependency parsers, for example, are still deployed in various commercial and academic systems that process natural language. The speed and accuracy of these parsers are crucial to a positive experience for a user who uses a downstream application based on these parsers. I will present two advances in significantly improving the speed of dependency parsing (EMNLP, 2021) and considerably improving the accuracy of unsupervised constituency parsing (Findings of ACL, 2022). In the first part of the talk, I will mainly show how a simple preprocessing step of the scored model edge weights can make dependency parsing asymptotically optimal, with a quadratic complexity for parsing (rather than cubic) with respect to the sentence length. The number of edges to be scored by the model is quadratic in the sentence length. Hence the algorithm has optimal complexity. In the second part of the talk, I will show how co-training and intuitions from spectral learning combine with pre-trained language models to get effective multilingual unsupervised parsing. Our parsing algorithm is based on identifying nodes that dominate a substring in the sentence, alternating in co-training steps between an “outside string” view and an “inside string” view. Experiments with treebanks in English, Chinese and Korean show that this method is effective. Joint work with Miloš Stanojević and Nickil Maveli.
Shay Cohen is a Reader at the University of Edinburgh (School of Informatics). Before this, he was a postdoctoral research scientist in the Department of Computer Science at Columbia University and held an NSF/CRA Computing Innovation Fellowship. He received his B.Sc. and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon University in 2011. His research interests span a range of topics in natural language processing and machine learning, focusing on structured prediction (for example, parsing) and text generation.