Evaluations

Human-Centric Evaluation

How do we know AI actually helped? How do we evaluate AI from the perspective of the person using it and not just the system producing it?

Most AI evaluation measures whether the system works. I'm interested in whether it's useful. This means designing evaluations around what users actually need — whether an LLM response truly aligns with a user's intent and whether it actually helps someone make a decision. A part of this also involves making evaluations easier for users to investigate, interpret, and trust.

Upcoming work on chart evaluations at Microsoft.

Research Papers

Literature-Grounded Novelty Assessment of Scientific Ideas

[Best Paper Award]

Simra Shahid, Marissa Radensky, Raymond Fok, Pao Siangliulue, Daniel S. Weld, Tom Hope

In Scientific Document Processing Workshop, ACL 2025.
[Paper][Code]

CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering

Ishika Joshi*^, Simra Shahid*^, Shreeya V.^, Manushree V.^, Yantao Z., Yunyao Li, Balaji K, Gromit Yeuk-Yin Chan

*=Principal Investigators, ^=First Authorship

In ACM IUI, 2025
[Webpage][Paper]

Innovative Proposals

Catchy Content

Why is nobody clicking your website?

This proposal can help Marketers and Content Creators understand which type of content engages well with users (indicated by views, clicks, etc). It uses topic models to convert long-form articles into tags and provide a snapshot view of content tags to such engagement metrics.

Outcomes:

Presented in Adobe Sneaks 2021 [Video Link]
Media Coverage in Forrestor, Fast Company, and Adobe Blogs [one, two]
Also adapted the method for Creatives - [Video Link]
US Patent granted.

Adobe Sneaks is a highly-selective projects shown in Adobe Summit as next-generation customer experiences in Adobe.

Topic Models Image Tagging Readability

Projects

Conversational Systems

Language Models Patent Adobe Product

[1] Sales summary generation: A concise summary for the sales agent with call to action for their leads. A fact-checking algorithm with Natural Language Inference is added to ensure that summaries and action items generated are not hallucinated. We created a golden dataset for benchmarking models on this task.

[2] Answering visitor questions: In this we use Retrieval Augmented Answer Generation (RAG) to find answer for the visitor question.

Page updated

Google Sites

Report abuse