Harshita Diddee

6715 Gates Hillman Center, Carnegie Mellon University

LTI PhD @ Carnegie Mellon University

Hello!

I am a 3rd Year Ph.D. student at Carnegie Mellon University (LTI, School of Computer Science) where I am advised by Daphne Ippolito.

My vision is to make LLMs reliable for open-ended real-world tasks by improving measurement validity: the gap between appearing capable on an evaluation benchmark and succeeding at what users actually need.

I pursue this in two ways: (1) enabling users to anticipate hidden failures - our work on why this matters; our tool which helps reveal where narrow, static evaluations benchmarks miss and diverge on model competence conclusions for real-world use cases; and (2) reduce the cost of such failures by making opaque, verbose LLM outputs easier to verify before errors become consequential—for example, an incorrect tax return buried in a 100,000-token response (Ongoing work at CMU, Adobe Research).

Last summer, I worked with Amazon Core Search to make retail LLM-as-Judges less brittle—addressing a measurement-validity failure mode where a judge can appear reliable while missing fundamental relevance signals (preprint forthcoming). Previously, as a Pre-Doctoral Research Fellow at Microsoft Research India advised by Kalika Bali, I studied text and speech LLM compression under extreme data scarcity (text; speech).

I earned my B.Tech. in Computer Science from Bharati Vidyapeeth’s College of Engineering as the Class of 2021 and Computer Science Department’s Best Outgoing Student.

My earliest formal education started about 20 years ago when I began learning Odissi, an Indian Classical Dance originating in Odisha. Dashavatar or The Ten Primary Avatars of Vishnu and Battu are some recitals in this diversely rich art form ( I am the dancer in the pink-purple costume :)). Consequently, I am also deeply invested in music (any and every form or language - Here is me singing a Hindi Bhajan).

News

May 2026	Joined Adobe Research in San Jose, CA as a Research Scientist Intern!
Jan 2026	We released BenchBrowser: A tool to help you build more trust in the validity of the evaluations you see for capabilities of your interest! Given your custom use case: BenchBrowser retrieves testcases from 20+ popular benchmarks to assess how your capability is being evaluated and provides a workspace for you to assess if such evaluations lead to consistent conclusions about if a model is good for your task.
May 2025	Joined Amazon’s Core Search Team as an Applied Science Intern. Will be in Palo Alto, CA for the Summer! Let’s connect if you are interested in Data Selection and/or Language Model Personalization!
Jan 2025	Chasing Random: Instruction Selection Strategies Fail to Generalize accepted to NAACL Findings! Code released here!
Feb 2024	INMT-Lite accepted to LREC 2024! Code and Paper out!
Oct 2023	2 Papers Accepted to EMNLP 2023! MEGA and Fifty Shades of Bias: Normative Ratings of Gender Bias in GPT Generated English Text accepted to EMNLP 2023.
Apr 2023	Accepted an offer to join LTI @ Carnegie Mellon University for my PhD!
Nov 2022	I’ll be attending ALPS Winter School this January! Let’s chat if you are interested in referenceless NLG evaluations and Quality Estimation. You can check out my poster on our preliminary constraint review of QE here!
Nov 2022	CodeFed: Federated Speech Recognition for Low-Resource Code-Switching Detection has been accepted to ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)!
Oct 2022	Too Brittle To Touch: Comparing the Stability of Quantization and Distillation towards developing Low-Resource MT Models was accepted to WMT (Research Track): Check out the Preprint and Code! Headed to EMNLP to present it! Let’s chat if you’re interested in Data Quality Estimation, contrainsted generation or anything Low-Resource :)
Aug 2022	Joining the organizing committee for The 2022 IEEE Spoken Language Technology Workshop, SLT 2022 Hackathon!
Jul 2022	Visiting Johns Hopkins University for the month as a part of JSALT’22: Contributing to the Speech Translation for Under-Resourced Languages Track
May 2022	Presenting our work A Collaborative Approach to Developing Language Technology Interventions for Endangered Languages and leading the panel (Panel A) at the ComputEL-5, ACL’22

Selected Papers

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, and 1 more author

Preprint Feb 2026

Abs PDF Code

Do language model benchmarks actually measure what practitioners intend them to? High-level metadata is too coarse to convey the granular reality of benchmarks: a “poetry” benchmark may never test for haikus, while “instruction-following” benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BENCHBROWSER, a retriever that surfaces evaluation items relevant to natural-language use cases across 20+ benchmark suites. Validated by a human study confirming high retrieval precision, BENCHBROWSER generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability’s facets) and low convergent validity (lack of stable rankings when measuring the same capability). BENCHBROWSER helps quantify a critical gap between practitioner intent and what benchmarks actually test. The tool and its code are publicly available.
Chasing Random: Instruction Selection Strategies Fail to Generalize

Harshita Diddee, and Daphne Ippolito

In Findings of the Association for Computational Linguistics: NAACL 2025 Apr 2025

Abs PDF Code

Prior work (Zhou et al., 2023a) has shown that language models can be tuned to follow user instructions using only a small set of high-quality instructions. This has accelerated the development of methods that filter large, noisy instruction-tuning datasets down to a high-quality subset which works just as well. However, typically, the performance of these methods is not demonstrated across a uniform experimental setup and thus their generalization capabilities are not well established. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks. Our results indicate that selection strategies generalize poorly, often failing to consistently outperform even random baselines. We also analyze the cost-performance trade-offs of using these strategies: Our findings reveal that selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal—and some times no gains compared to tuning on the full dataset or a random subset
NoveltyBench: Evaluating Creativity and Diversity in Language Models

Yiming Zhang, Harshita Diddee, Susan Holm, and 5 more authors

In Second Conference on Language Modeling Apr 2025

Abs PDF

Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize creativity alongside quality.