2025 WS: Michael Wiegand

22.09.2025

Just say no: Investigating the reluctance of language models to provide honest but discouraging feedback

This research project aims to investigate the extent to which different Large Language Models (LLMs) tend to refrain from giving truthful feedback, particularly when it involves delivering unpleasant or discouraging information to users. To achieve this, a diverse set of potentially awkward or sensitive situations must be identified and systematically evaluated. We aim to understand which types of scenarios language models are most reluctant to respond to with honest discouragement, and why. For instance, does this reluctance depend on the size of the model? Does it depend on the degree or type of post-training (e.g. instruction tuning, reinforcement learning from human feedback etc)? Or are other factors involved?

Special attention will be given to the wording of prompts, as certain ways of asking for advice may be especially unlikely to elicit honest or critical responses. Ideally, the project will also explore strategies to elicit more truthful feedback from LLMs, even when such responses may be unwelcome or negative.

Prerequisites for students:

Good programming knowledge of Python
General understanding of large language models (LLMs), including the different stages of their development (e.g. pre-training, reinforcement learning from human feedback, etc.)
Experience working with LLMs, particularly in zero-shot and few-shot classification and prompt engineering; ideally also experience with fine-tuning
Familiarity with quantitative methods, especially evaluation techniques in NLP contexts
Familiarity with supervised learning in general
Basic knowledge of how to design and set up experiments in NLP
Openness to work on language data

Project open to: Business Analytics, Data Science, Digital Humanities (students should be comfortable with coding in Python)

Number of students: 1-4