Ari Holtzman


Nightly Research Statement

My goal is to help create the foundations for a science of generative models. Since the field is changing quickly, I'll give a sense of what I'm thinking about right now. I've singled out my three biggest current fascinations below:

  1. Fundamental Laws of Language Models While scaling laws are interesting, they are definitely not the whole story, and because of that they're likely not stable in slightly different scenarios. I think there's room for many more fundamental laws. For instance, most LLMs are trained only on positive signal: predicting which words come after which other words. This makes it statistically harder to learn inhibitory signal, such as what not to say in certain contexts. Can we describe this discrepancy quantitatively, especially how it scales with data? Does it scale the same way with synthetic data designed to help propagate inhibitory aspects of language and other media?
  2. Lowering the Noise Floor Right now, the noise floor in models is higher than it needs to be. For instance, because of the way we tokenize data models must learn the orthographic relationship between "cat" and "cats", without seeing the character similarity and learning the more general rule of "+s" pluralization in English. This lowers their statistical power when it comes to getting the most out of the data. Many have argued that because LLMs appear to learn the relationship between these tokens, this problem doesn't matter. I disagree—I think models are wasting sample complexity on learning individual pairs of connections vs. general rules, and that this comes at the cost of learning other features. We need to find representations, both of the data and in the way we design architectures, that allow the model to get the most out of limited data and limit training steps.
  3. LLMs as Tools Right now, LLMs are mostly useful in augmenting a human doing a task—there are very few tasks that are directly automateable via LLM, even when the LLM is given APIs to other systems. Industrial revolutions tend to happen when at least some part of a useful process can be fully automated. Doing this requires defining tasks in such a way that we can detect and handle errors more elegantly, having clear notions of error tolerance in various parts of the pipeline, and being able to carve out predictable subsystems out of currently wildly unpredictable LLMs. Perhaps LLMs will always be very unreliable—in which case we need to make them faster and smaller in order to be able to heavily filter what they produce. Regardless, thinking of LLMs as tools, something that one would call in day-to-day production code with "import llm" to get tasks done that require some level of semantic understanding (such as reorganizing input data into useful categories) is going to be a big next step for LLMs.

Research Foci

My primary interest is in generative models, how they work and how we can get them to generate text and other media that communicate with humans is useful and novel ways. Lately, I’ve been thinking about how language models fit the definition of complex systems, systems in which we understand the low-level components (neurons) but can’t explain or even fully describe the high-level behaviors (e.g., in-context learning) as they emerge with more data and parameters. In the spirit of complex systems, I want to create a taxonomy of model behavior, analogous to the periodic table of elements in Chemistry, which hardly explains complex chemical processes in its own right, but gives a description of elementary components and their interactions that can be used to build-up more complex hypotheses. Currently, we rely on benchmark performance or vague intuitive descriptions to pin-point specific phenomena, which means most hypotheses rely on imprecise vocabulary that won’t stand the test of time.

In the short-term, I’m interested in thinking how we can map out what models can and can’t do, which I believe will naturally relate to long-form generation. It is incredibly difficult to evaluate long-form generation rigorously, and it is hard to show long-form generations in power point slides, which has made coordinating the issues in long-form generation difficult for the academic community. In the medium-term, I think we need to tackle the non-objective aspects of language, as almost all communication is open to interpretation, relying instead on the pragmatic attempt at cooperation to bridge this gap. Focusing on easy-to-evaluate aspects of language doesn’t do it justice. Perhaps looking at indirect evaluation, where we evaluate what generated language can be used for, rather than whether it is “correct” can help move researchers in that direction. My long-term goal is to create discursive machines.

Selected Publications


Generative Models as a Complex Systems Science:

How can we make sense of large language model behavior?

Ari Holtzman, Peter West, Luke Zettlemoyer


QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers*, Artidoro Pagnoni*, Ari Holtzman, Luke Zettlemoyer


Contrastive Decoding: Open-ended Text Generation as Optimization

ACL 2023

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis


Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

EMNLP 2022

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer


Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right

EMNLP 2021

=Ari Holtzman, =Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer

= equal contribution


The Curious Case of Neural Text Degeneration

ICLR 2019

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi

Useful Stuff

Materials from the Academic Job Market