Data Governance and Transparency for Large Language Models

Abstract

The continued growth of LLMs and their wide-scale adoption in commercial applications such as chatGPT make it increasingly important to (a) develop ways to source their training data in a more transparent way, and (b) to investigate it, both for research and for ethical issues. This talk will discuss the current state of affairs and some data governance lessons learned from Big Science, an open-source effort to train a multilingual LLM - including an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.

Date
Jun 9, 2023 2:30 PM — 3:30 PM
Location
OAT S13

Bio

Anna Rogers is an assistant professor in the Computer Science Department at the IT University of Copenhagen. She has a wide range of research interests that span intersection of linguistics, natural language processing, and machine learning. She is also currently a co-program-chair of ACL 2023.