Canada AI Day
Schedule
Schedule
1:00 pm - 1:05 pm
Organizers
1:05 pm - 1:35 pm
Colin Raffel, University of Toronto
Title: Training Performant Language Models on Openly Licensed Text
Abstract:
Large language models (LLMs) are typically trained on trillions of words of copyrighted text that is used without explicit permission. Even with incredibly meager compensation, the cost of actually paying for the creation of this pretraining data would dwarf current model training costs. This has led rights holders to initiate a slew of lawsuits against companies developing LLMs, in turn leading the companies to make claims like “... it would be impossible to train today’s leading AI models without using copyrighted materials.” In this talk, I will present our work that aims to invalidate this claim by curating an 8TB dataset of openly licensed and public domain text from a diverse range of more than 25 sources, including governmental texts, historical books, research papers, educational video transcripts, and more. Crucially, I will also present the results of training an LLM on this text that attains competitive performance with budget-matched models trained on unlicensed data.
1:35 pm - 2:05 pm
Freda Shi, University of Waterloo
Title: Linguistic Insights Deepen Our Understanding of AI Systems: The Cases of Reference Frames and Logical Reasoning
Abstract:
Understanding why neural AI models behave the way they do, particularly when they fail, has become one of the central topics in recent years. In this talk, I will demonstrate how linguistic principles provide a powerful lens to diagnose and predict model behaviors, ultimately contributing to the development of more robust and interpretable AI systems. I will exemplify this line of research in our group with two specific cases: (1) multimodal reasoning with an awareness of reference frames, and (2) logical reasoning across multiple semantically equivalent forms. By analyzing how these AI models behave, I will highlight patterns and limitations in their decision-making processes, offering insights into their shortcomings and potential improvement. I will conclude by discussing my recent thoughts of interdisciplinary collaboration in advancing both AI research and linguistics.
2:05 pm - 2:15 pm
2:15 pm - 2:45 pm
Victor Zhong, University of Waterloo
Title: From Text-to-SQL to Agentic Search: The Future of Natural Language Data Interfaces
Abstract:
The ability to query databases using natural language has made significant strides with text-to-SQL models, unlocking access to structured data for non-expert users. However, as data environments grow increasingly complex—spanning massive enterprise databases and extending into heterogeneous data lakes containing unstructured and multi-modal information—the limitations of traditional text-to-SQL approaches become apparent. We chart the progression from foundational benchmarks like WikiSQL and Spider, through the real-world complexities captured by Spider 2.0, toward a future where language agents transcend SQL generation to perform autonomous, agentic search and reasoning across diverse, schema-less data landscapes. We envision a new generation of natural language interfaces—not just translating queries, but actively orchestrating complex data exploration and analysis in real-world settings.
2:45 pm - 3:15 pm
Sivan Sabato, McMaster University
Title: Classifier Fairness and How to Measure It
Abstract:
Discrimination by AI is widespread, resulting in some groups being treated unfairly by systems that incorporate AI. The field of algorithmic fairness studies methods for combating algorithmic discrimination. An essential step is formalizing a notion of fairness for classifiers. In this talk, I will demonstrate why this is more challenging than one might initially assume, and discuss types of formal fairness notions. I will then consider the challenges of measuring unfairness and auditing the fairness of classifiers that are not directly accessible, such as proprietary classifiers used by private companies. I will present a principled approach for quantifying unfairness, and methods for drawing conclusions on the unfairness of classifiers from limited aggregate statistics.
3:15 pm - 3:30 pm
3:30 pm - 3:40 pm
Ming Hou, Department of National Defence (DND), Canada
3:40 pm - 3:50 pm
Nima Shahbazi , Collective[i]
3:50 pm - 5:30 pm
Building the Future of AI: Data, Reasoning, and Trust - moderated by Filippo Sposini
Panelists:
Colin Raffel
Sivan Sabato
Freda Shi
Victor Zhong
Ming Hou
Nima Shahbazi
Audience Q&A
5:30 pm - 5:35 pm