2

Locally Typical Sampling

Today's probabilistic language generators fall short when it comes to producing coherent and fluent text, despite the fact that the underlying models perform incredibly well in terms of standard metrics such as perplexity. This dichotomy has …

Naturalistic Causal Probing for Morpho-Syntax

Probing has become a go-to methodology for interpreting and analyzing deep neural models in natural language processing. However, there is still a lack of understanding of the limitations and weaknesses of various types of probes. In this work, we …

On the Effect of Anticipation on Reading Times

Over the past two decades, numerous studies have demonstrated how less predictable (i.e. higher surprisal) words take more time to read. In general, these previous studies implicitly assumed the reading process to be purely responsive: readers …

Testing the Predictions of Surprisal Theory in 11 Languages

A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its …

A Cross-Linguistic Pressure for Uniform Information Density in Word Order

While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these …

On Decoding Strategies for Neural Text Generators

When generating text from probabilistic models, the chosen decoding strategy has a profound effect on the resulting text. Yet theproperties elicited by various decoding strategies do not always transfer across natural language generation tasks. For …

Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a …

Efficient Computation of Expectations under Spanning Tree Distributions

We give a general framework for inference in spanning tree models. We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models. Our algorithms …

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to …

On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those …