Visiting Researcher talk: Dr Fazl Barez | 05 May 2025
Talk Title
Open Problems in Machine Unlearning for AI Safety
Speaker
Dr Fazl Barez
About the Speaker
Dr Fazl Barez is a Senior Research Fellow at the University of Oxford, affiliated with the Theory of Value Group (TVG) at Engineering Sciences and the AI Governance and Innovation (AIGI) programme at the Martin School. His work spans academia, leading industry labs, AI Safety Institutes, and non-profits, with a focus on advancing safe AI practices. His research has contributed to shaping both academic and industry standards.
He has led research with the UK AI Safety Institute on machine unlearning for AI safety, developed the N2G algorithm adopted by OpenAI to evaluate Sparse Autoencoders, and spearheaded the Alan Turing Institute’s response to the UK House of Lords on large language models, which informed parliamentary inquiries. He has also collaborated with Anthropic on research papers examining deception in LLMs, reward hacking, and related topics.
He is affiliated with Cambridge’s Centre for the Study of Existential Risk (CSER), NTU’s Digital Trust Centre, the University of Edinburgh’s School of Informatics, and is a member of ELLIS. In 2024–2025, he served as a Research Consultant with Anthropic’s Alignment team. His previous roles include researcher positions at Amazon and Huawei, as well as Co-director and Head of Research at Apart Research.
Talk Description
Dr. Fazl Barez’ recent talk at NTUdelved into the challenges and limitations of machine unlearning as a tool for ensuring AI safety. While machine unlearning is often employed to selectively suppress specific types of knowledge to align AI models with particular use cases, Dr. Barez cautioned against several limitations of this approach. One concern highlighted during the talk is associated with the “pluralistic nature” of knowledge. AI models, even after unlearning specific data, may inadvertently recombine pieces of information to generate dangerous insights. This raises ethical and practical questions, as suppressing such knowledge might also hinder valuable applications. Dr. Barez described the unintended side effects of unlearning, which could degrade a model’s ability to perform useful tasks. Additionally, he noted that assessing the effectiveness of machine unlearning remains a significant challenge, as AI models may retain traces of removed knowledge or relearn it through indirect means. During the Q&A session, Dr. Barez emphasised that unlearning alone cannot serve as a “silver bullet” for AI safety. He encouraged NTU researchers to reflect on this open problem, pursue further research into alternative approaches, and explore how the concepts presented could be integrated into practical applications within the broader frameworks of their own studies.