LLM Eval

NIH Text Evaluation

A text-evaluation workflow for NIH-style project material, focused on rubric, data scope, badcase review, bias awareness, and human review.

rubricbadcasebiashuman review

Problem

Text-heavy evaluation is unreliable if reviewers and models do not share stable criteria. The real product problem is making judgment comparable and auditable.

Workflow

  1. 01Separate dataset scope and summary-generation pipeline records.
  2. 02Define rubric dimensions before asking for model output.
  3. 03Use badcase flags for thin abstracts, institution halo, geography skew, missing funding context, and synthetic-label risk.
  4. 04Keep final judgment in a human-review loop instead of treating model output as expert decision.

Evidence

Rubric structure

Scientific value, methodology, team, social impact, and resource-use dimensions.

Data scope note

Separates main data, pipeline-summary records, and enhanced-analysis samples.

Badcase checklist

Covers institution halo, geography skew, research-area skew, thin abstract, missing funding context, and synthetic-label risk.

Sample report

A demo report format for input summary, rubric scores, flags, and human-review notes.

Boundary

  • This does not replace NIH expert review.
  • This does not claim a completed public benchmark.
  • This does not prove model accuracy or commercial impact.

Role Mapping

  • LLM Eval / data product: maps task, rubric, samples, and human review.
  • Model strategy product: turns fuzzy quality into comparable evaluation language.
  • AI product: explains model output risk in terms product and research teams can share.