Human-Centric Evaluation

How do we know AI actually helped? How do we evaluate AI from the perspective of the person using it and not just the system producing it?

Most AI evaluation measures whether the system works. I'm interested in whether it's useful. This means designing evaluations around what users actually need — whether an LLM response truly aligns with a user's intent and whether it actually helps someone make a decision. A part of this also involves making evaluations easier for users to investigate, interpret, and trust.

Upcoming work on chart evaluations at Microsoft.