The continued growth of LLMs and their wide-scale adoption in commercial applications such as chatGPT make it increasingly important to (a) develop ways to source their training data in a more transparent way, and (b) to investigate it, both for research and for ethical issues. This talk will discuss the current state of affairs and some data governance lessons learned from Big Science, an open-source effort to train a multilingual LLM - including an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.
Anna Rogers is an assistant professor in the Computer Science Department at the IT University of Copenhagen. She has a wide range of research interests that span intersection of linguistics, natural language processing, and machine learning. She is also currently a co-program-chair of ACL 2023.