LLM Text Evaluation Framework

Score model answers on seven criteria, chart trends, and keep a local history

This Streamlit app helps you compare an LLM response to a reference answer using seven weighted dimensions named in the repository (relevance, accuracy, completeness, coherence, creativity, tone, and intent alignment). It stores runs in a SQLite database, shows Plotly visuals, and can be run with Docker or a local Python workflow.

When it is useful

You are benchmarking prompts or models, teaching structured evaluation, or need a repeatable notebook replacement with simple dashboards. It is a lab and QA helper, not an independent judge of truth.

What you can do

Enter prompt, model output, and expected answer (terminology follows the app) and inspect per-criterion scores plus charts on the home experience.
Open the history view to revisit past evaluations and compare patterns over time.
Adjust weights in configuration as documented; criteria importance is configurable, not universal truth.
Deploy or develop using the GitHub repository instructions (Docker compose or uv + Streamlit paths).

Limits

Scores come from automated heuristics and models (similarity and text stats); they can disagree with human experts or favor certain writing styles.
“Accuracy” inside the tool does not verify facts against the world; pair with human review for high-stakes or regulated content.
Features marked as future in the README (batch exports, API, etc.) are not promised here.

LLM Text Evaluation Framework

LLM Text Evaluation Framework

Score model answers on seven criteria, chart trends, and keep a local history

When it is useful

What you can do

Limits

You might also like

Caveman Skill for Cursor and Claude Code: Shorter AI Replies, Lower Token Costs

Your AI Agent Shouldn't Start From Zero Every Session

How to Add PyPI Download Stats to GitHub Actions