LLM Text Evaluation Framework

LLM Evaluation & Benchmarking App

Gallery image 1
Gallery image 2
Gallery image 3

LLM Text Evaluation Framework

This Streamlit app helps you compare an LLM response to a reference answer using seven weighted dimensions named in the repository (relevance, accuracy, completeness, coherence, creativity, tone, and intent alignment). It stores runs in a SQLite database, shows Plotly visuals, and can be run with Docker or a local Python workflow.

When it is useful

You are benchmarking prompts or models, teaching structured evaluation, or need a repeatable notebook replacement with simple dashboards. It is a lab and QA helper, not an independent judge of truth.

What you can do

  • Enter prompt, model output, and expected answer (terminology follows the app) and inspect per-criterion scores plus charts on the home experience.
  • Open the history view to revisit past evaluations and compare patterns over time.
  • Adjust weights in configuration as documented; criteria importance is configurable, not universal truth.
  • Deploy or develop using the GitHub repository instructions (Docker compose or uv + Streamlit paths).

Limits

  • Scores come from automated heuristics and models (similarity and text stats); they can disagree with human experts or favor certain writing styles.
  • “Accuracy” inside the tool does not verify facts against the world; pair with human review for high-stakes or regulated content.
  • Features marked as future in the README (batch exports, API, etc.) are not promised here.

You might also like

Explore All Blogs