Large Language Models, Spring 2023

Course Description

Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. In this course, we offer a self-contained introduction to language modeling and its applications. We start with the probabilistic foundations of language models, i.e., covering what constitutes a language model from a formal, theoretical perspective. We then discuss how to construct and curate training corpora, and introduce many of the neural-network architectures often used to instantiate language models at scale. The course covers aspects of systems programming, discussion of privacy and harms, as well as applications of language models in NLP and beyond.

News

3. 1. 2023 Class website is online!
20. 2. 2023 Update on the previous announcement from January 30th: the Large Language Models course can count towards the core elective courses for the Data Science master’s program, rather than the core courses. Indeed, the course is now listed as a core elective course for the Data Science master’s program, so no additional action is required upon registering for the course through MyStudies.
20. 2. 2023 First draft of the notes for the first part of the course is online!
24. 2. 2023 The iPad class notes have been posted. The same link will contain updated notes for the first part of the course throughout the semester.
9. 3. 2023 The first part of the first assignment has been released!
25. 4. 2023 First draft of the notes for the second part of the course is online!
25. 5. 2023 The second assignment has been released together with the LaTeX source code!
4. 8. 2023 The sample exam and the exam review sheet have been released!

Syllabus and Schedule

On the Use of Class Time

Lectures

There are two lecture slots for LLM each week: the first one on Tuesdays 14-16 in CAB G 61 and the second one on Fridays 10-11 in CAB G 61.

Both lectures will be given in person and live broadcast on Zoom; the password is available on the course Moodle page.

Lectures will be recorded—links to the Zoom recordings will be posted on the course Moodle page.

Discussion Sections

Discussion sections (tutorials) will take place Thursdays 16-18 in NO C 60 and on Zoom (same link as the lectures).

Syllabus

Date	Time	Module	Topic	Lecturer	Summary	Material	Reading
21. 2. 2023	1 hour		Introduction and Overview	Ryan/Mrinmaya/Ce/Florian	The lecturers will contextualize large language models in NLP and computer science more broadly. Thereby, we will also motivate why the topic necessitates a separate course. We will also go over the course schedule and logistics.	Introductory Slides
21. 2. 2023	1 hour	Probabilistic Foundations	Basic Measure Theory	Ryan	Language modeling is about placing probability on infinite sets of strings. Measure theory is the primary tool used for the rigorous study of probability theory. This lecture shows why defining a language model rigorously requires a careful measure-theoretic treatment. We use the classic infinite coin toss model as an illuminating example. Then, we will get into some basic measure-theoretic definitions that will be useful in formally defining language models.		Du, Li, et al. A Measure-Theoretic Characterization of Tight Language Models. arXiv, 2022.
24. 2. 2023	1 hour		Defining a Language Model	Ryan	We will continue to introduce definitions and facts from basic measure theory, building up to a formal definition of a language model, which will be our working definition throughout the class.
28. 2. 2023	2 hours		Tight Language Models	Ryan	The primary goal of this lecture is to introduce the notion of tightness, which will be a recurring theoretical concept in the first part of the course. Informally, a language model is tight when it only places probability mass on finite strings. We introduce the Borel-Cantelli lemmata and prove a precise characterization of tight language models.		Du, Li, et al. A Measure-Theoretic Characterization of Tight Language Models. arXiv, 2022., Chen, Yining, et al. Recurrent Neural Networks as Weighted Language Recognizers. arXiv, 2017.
3. 3. 2023	1 hour	Modeling Foundations	The Language Modeling Task	Ryan	In this lecture, we introduce the language modeling task, which we define to be any attempt to learn a language model from finite data. We will discuss various objectives that one might wish to optimize to induce a language model from data. We also discuss various regularization techniques and their use in combatting overfitting.
7. 3. 2023	2 hours		Finite-State Language Models	Ryan	Finite-state language models have a storied history in NLP. They are a natural generalization of n-gram models, which were the standard in the field from the 1980s till the late 2010s. In terms of theory, we introduce probabilistic finite-state automata as a generalization of finite-state automata from classic theory of computation. Additionally, we give a simple, closed-form characterization of tightness. We also show how Bengio et al. (2003), the first successful neural language model, is naturally viewed as a probabilistic finite-state automaton.		Bengio, Yoshua, et al. A neural probabilistic language model. J. Mach. Learn. Res., 2003.
10. 3. 2023	1 hour		Pushdown Language Models	Ryan	In many ways, human language is more naturally modeled by a context-free grammar than by a finite-state automaton. This lecture discusses how to use weighted context-free grammars, specifically when implemented as weighted pushdown automata, to construct language models. In the case of a 1-stack pushdown language model, we give an iterative algorithm to determine tightness. We also discuss pushdown language models with more than one stack. In this case, determining whether such a language model is tight is undecidable. Learning the nuts and bolts of pushdown language models is more than just a historical artifact: The definitions provided in this lecture will serve as a basis for proofs about the capacity of recurrent neural networks. Indeed, our proof that it is undecidable to determine the tightness of a recurrent neural language model with infinite precision is as simple as demonstrating an encoding of a 2-stack pushdown language model as a recurrent neural network.
14. 3. 2023	2 hours	Neural Network Modeling	Recurrent Neural Language Models	Ryan	Finite-state language models, by construction, can only look at a finite amount of context. Recurrent neural networks are a formalism that overcomes this limitation. In this lecture, we give a formal definition of a recurrent neural language model (RNNLM). We give examples of tight and non-tight RNNLMs as well as characterize the vanishing gradient problem.
17. 3. 2023	1 hour		Variants of RNNLMs	Ryan	We discuss several popular variants of the RNN, most notably the LSTM and GRU. We give a formal argument showing that these variants mitigate the vanishing gradient problem.
21. 3. 2023	2 hours		Representational Capacity of RNNLMs	Ryan	In this lecture, we explore the representational capacity of RNNLMs. We show that, if the activation function is a hard thresholding operation, then RNNLMs have the same expressive capacity as a finite-state LM. However, we show that RNNLMs can implicitly represent finite-state LMs that are much larger. Additionally, if the activation function is a saturated sigmoid or a ReLu and we assume infinite precision arithmetic, we show how an RNN can emulate a Turing machine.		Siegelmann H. T. and Sontag E. D. On the computational power of neural nets. Computational learning theory. 1992.
24. 3. 2023	1 hour		Transformer-based Language Models	Ryan	Introduced in 2017 by Vaswani et al., Transformers have quickly become the most popular architecture for neural language modeling. They are the basis for recent large language models, e.g., GPT-3 and PaLM. This lecture gives the definition of a Transformer and overviews details, e.g., residual connections, layer normalization, and position embeddings.
28. 3. 2023	2 hours		Efficient Attention	Ryan	There is an ever-growing bag of tricks that speed up the computation of the attention mechanism in Transformer-based language models. This lecture overview those tricks and various generalizations of the transformer, which are becoming increasingly necessary to scale up Transformer LMs on academic hardware. We will also discuss multi-headed attention, sparse attention, and Transformer variants tailored for long documents. Where possible, we prove guarantees for the methods.
31. 3. 2023	1 hour		Representational Capacity of Transformer-based Language Models	Ryan	Inspired by the Turing completeness of RNNs, we study the representational capacity of Transformers. Although the connection to automata is not as straight-forward as with RNNs, we discuss how to think about Transformers as formal models and show that, assuming an unbounded number of layers and infinite precision, Transformers are Turing complete.
4. 4. 2023	2 hours	Modeling Potpourri	Tokenization	Ryan	Throughout the class, we have assumed access to the alphabet Σ. This lecture discusses how we should choose Σ. We discuss various facts about natural language that influence Σ, e.g., morphology and syntax. Then, we introduce the byte-pair encoding algorithm, an automatic procedure for inducing Σ, and give a analyze of its correctness and runtime.
		Easter Break
18. 4. 2023	2 hours	Modeling Potpourri	Generating Text from a Language Model	Ryan	A popular use case for language modeling is the generation of text. This lecture overviews various strategies for deterministically and stochastically generating text. We discuss beam search, ancestral sampling, as well as various sampling adaptors, e.g., top-k, nucleus, and locally typical sampling.
21. 4. 2023	1 hour	Training, Fine Tuning and Inference	Transfer Learning	Mrinmaya		Slides
25. 4. 2023	2 hours		Parameter efficient finetuning	Mrinmaya		Slides
28. 4. 2023	1 hour		Prompting and zero-shot inference	Mrinmaya		Slides
2. 5. 2023	2 hours	Parallelism and Scaling up	Scaling up	Ce		Slides
5. 5. 2023	1 hour	Parallelism and Scaling up	Parallelism	Ce		Slides
9. 5. 2023	2 hours	Applications and the Benefits of Scale	Multimodality	Mrinmaya		Slides
12. 5. 2023	1 hour	Applications and the Benefits of Scale	Additional Topics	Mrinmaya		Slides
16. 5. 2023	2 hours	Analysis	Analysis and Probing	Tiago/Ryan	Many language models are uninterpretable, i.e., it is hard to know why a language model prefers one prediction to another. This lecture overviews a variety of recent techniques for better understanding language models’ behavior and interpreting their predictions.	Slides
19. 5. 2023	1 hour	Analysis	Cognitive Modeling	Ethan/Alex/Ryan	Language models show remarkable linguistic capabilities. This lecture treats the question: Do language models process language as humans do? The performance of language modeling on a wide variety of cognitive benchmarks is discussed in an attempt to tease apart how language models are similar and dissimilar to human language processing. We will also discuss the implication of language remodeling on language science.	Slides 1, Slides 2
23. 5. 2023	2 hours	Security and Misuse	Security and Misuse	Florian	Machine learning models are remarkably brittle, and prone to all kinds of exploits. Language models are no different: we will see how tampering with model inputs or training data can lead to arbitrarily bad outcomes. We will also discuss how language models could be exploited for nefarious purposes such as large-scale spam campaigns. On the other hand, language models could also prove useful as a defensive tool, e.g., for automated online content moderation or for dispelling misinformation.	Slides
26. 5. 2023	1 hour		Harms and Ethical Concerns	Florian	Language models work extremely well, until they don’t! What are some of the harms that large-scale deployment of language models can bring? We will discuss ways in which models can perpetrate or exacerbate issues in training data (biases, toxicity, etc.) and the difficulty in aligning models with particular ethical principles or truths.	Slides
30. 5. 2023	2 hours		Memorization and Privacy	Florian	We look into language models’ remarkable ability to memorize training data, and the risks this may pose for privacy or copyright. We will look at different ways to define memorization and privacy for textual models, and understand the different threats they aim to address. We will then review methods for provably guaranteeing the confidentiality and privacy of machine learning systems, and debate their adequacy in the context of textual models.	Slides
2. 6. 2023	1 hour		The data lifecycle	Florian	So far, most of the course has been about models. But what would these models be without the right data? We will discuss the lifecycle of modern training sets for language models, to understand how design choices in the data collection and maintenance process influence the model’s “world view”. We will review emerging guidelines and best practices for managing and documenting machine learning datasets across their lifetime.	Slides

Organisation

Live Chat

In addition to class time, there will also be a RocketChat-based live chat hosted on ETH’s servers. Students are free to ask questions of the teaching staff and of others in public or private (direct message). There are specific channels for each of the two assignments as well as for reporting errata in the course notes and slides. All data from the chat will be deleted from ETH servers at the course’s conclusion.

Important: There are a few important points you should keep in mind about the course live chat:

RocketChat will be the main communications hub for the course. You are responsible for receiving all messages broadcast in the RocketChat.
Your username should be firstname.lastname. This is required as we will only allow enrolled students to participate in the chat and we will remove users which we cannot validate.
Tag your questions as described in the document on How to use Rycolab Course RocketChat channels. The document also contains other general remarks about the use of RocketChat.
Search for answers in the appropriate channels before posting a new question.
Ask questions on public channels as much as possible.
Answer to posts in threads.
The chat supports LaTeX for easier discussion of technical material. See How to use LaTeX in RocketChat.
We highly recommend you download the desktop app here.

This is the link to the main channel. To make the moderation of the chat more easily manageable, we have created a number of other channels on RocketChat. The full list is:

LLM General Channel for the general organisational discussions.
LLM Announcements Channel for the announcements by the teaching team.
LLM Content Questions for your questions about the content of the course.
LLM Errata for reporting typos and errors in the course lecture notes and the slides.
Find Assignment Partners for finding teammates for the course assignments.

If you feel like you would benefit from any other channel, feel free to suggest it to the teaching team!

Course Notes

We will prepare the course lecture notes as we go! The individual chapters will be published in the course syllabus and updated throughout the semester. Please report all errata to the teaching staff; we created an errata channel in RocketChat.

Links to the course notes:

Other useful literature:

Grading

Marks for the course will be determined by the following formula:

50% Final Exam
50% Assignments

On the Final Exam

The final exam is comprehensive and should be assumed to cover all the material in the slides and class notes.

On the Class Assignments

There will be 2 larger assignments in the course.

We require the solutions to be properly typeset. We recommend using LaTeX (with Overleaf), but markdown files with MathJax for the mathematical expressions are also fine.

The first assignment will be of more theoretical nature and will be released shortly after the start of the semester.

Assignment Deadlines

Both assignments are due on Tuesday, August 15th at 23:59.

Large Language Models, Spring 2023

Course Description

News

Syllabus and Schedule

On the Use of Class Time

Lectures

Discussion Sections

Syllabus

Organisation

Live Chat

Course Notes

Grading

On the Final Exam

On the Class Assignments

Assignment Deadlines

Large Language Models Lecturers

Assistant Professor in Computer Science

Assistant Professor in Computer Science

Assistant Professor of Computer Science

Large Language Models Guest Lecturers

Postdoc

Postdoc

Postdoc

Large Language Models Teaching Assistants

PhD Student

PhD Student

MsC Student

MsC Student

Master’s Student

MsC Student

PhD Student

PhD Student

MsC Student

PhD Student

MsC Student

PhD Student

MsC Student

Master’s Student

PhD Student