Large language model safety

Background

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across diverse tasks like text generation, machine translation, summarization, and question-answering. Models such as GPT-3, PaLM, and LLaMA have achieved breakthroughs in conversational AI, code generation, and multi-modal tasks. Their success stems from their massive scale in both parameter count and training data size, enabling them to capture complex patterns and relationships in language. It is crucial to ensure LLMs are safe. One of our recent studies investigates the issue of LLM inconsistency across multi-round conversations, which is detailed below.

Despite LLMs’ impressive abilities, LLMs frequently exhibit a critical flaw known as conversational inconsistency – a phenomenon where they provide contradictory information across multiple dialogue turns. For example, when asked about the location of Normandy, an LLM might correctly answer “France,” but then change its response to “Germany” if the user challenges its initial answer, despite the first response being factually correct. This inconsistency poses significant risks in high-stakes domains such as healthcare and legal advice, where reliable and trustworthy information is crucial for decision-making.

Traditional training objectives for LLMs, which focus primarily on next-token prediction in single-turn contexts, fail to adequately address conversational consistency. While these approaches excel at capturing local coherence and linguistic patterns, they struggle with maintaining global consistency across multiple dialogue turns. The fundamental limitation lies in their inability to model long-range dependencies and contextual dynamics inherent in multi-turn conversations. This shortcoming is particularly problematic as LLMs are increasingly integrated into professional and decision-support systems where maintaining consistent, accurate responses is paramount for user trust and safety.

What We Do

We introduce a new way to make AI language models more consistent in conversations called Conversationally Consistent Supervised Fine-Tuning (CC-SFT). Instead of just training the AI to give good individual answers, we specifically train it to maintain consistency across conversations. The method works by combining three different goals: getting the first answer right, getting the follow-up answer right, and making sure these two answers don’t contradict each other.

Annother innovation lies in how we measure and enforce consistency between answers. We use a mathematical technique called Wasserstein distance (or Earth Mover’s Distance) to compare how similar the meanings of the two answers are, even if we use different words to say the same thing. By balancing these three goals during training – accuracy of first answer, accuracy of second answer, and consistency between them – We help the AI learn to stick to its correct answers even when challenged, rather than changing its mind inappropriately. Check out our open-source code repositories on our Harvard AI Robotics Lab GitHub account.

Selected Publications

To be online in February 2025.

Background

What We Do

Selected Publications

Latest News

FairDiffusion paper published by Science Advances

Joseph Wang won Harvard Summer Program for Undergraduates in Data Science award

Professor Paul Liang’s seminar

Todd Zhou won Kempner Research in Artificial and Natural Intelligence for Undergraduates with Mentorship award

Professor Yilun Du’s seminar