EvalView: pytest for AI Agents | MCP Scorecard Blog

Observability platforms show you what your AI agent did. Eval platforms score how well it did. io.github.hidai25/evalview-mcp answers a different question: did my agent break?

The Server

EvalView, built by Hidai Bar-Mor (GitHub, Tel Aviv), is a pytest-style testing framework for AI agents. The README frames the positioning directly:

"Unlike observability platforms (LangSmith) that show you what happened, or eval platforms (Braintrust) that score how good your agent is, EvalView answers: 'Did my agent break?'"
— EvalView README

The core workflow: save a golden baseline of your agent's behavior, then run checks after any change — prompt edits, model swaps, tool updates. EvalView diffs the results and reports one of four statuses: PASSED (behavior matches), TOOLS_CHANGED (different tools called), OUTPUT_CHANGED (same tools, different output quality), or REGRESSION (significant score drop). Deterministic tool-call and sequence scoring means no LLM-as-judge dependency for basic checks, though a statistical mode with LLM judge caching is available for deeper evaluation.

Features

The tool set is broader than basic diffing: framework-native adapters for LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, and Ollama. Forbidden tool contracts — declare tools that must never be called, and the test hard-fails immediately. HTML trace replay for forensic debugging with full prompt/completion spans. Multi-reference goldens (up to five variants) for non-deterministic agents. CI/CD integration with a GitHub Action, exit codes, PR comments, and JSON output. Eight MCP tools including test creation, snapshot runs, skill validation, and visual report generation.

The Builder

Bar-Mor describes himself as a finance professional and software engineering master's graduate from Harvard Extension School. His GitHub history spans several years of projects — from Harvard coursework (web development, data science) to a Chrome extension using GPT-3, a Unity fighting game, and household management apps. EvalView represents a more focused bet on AI agent infrastructure.

The repo was created November 2025, Python, Apache 2.0 licensed, published on PyPI as evalview. 45 stars, 5 forks. The MCP integration provides agent-level access to the testing workflow, but the core value is the testing framework itself — this is a developer tool that happens to also be an MCP server, rather than the other way around.

Score: 66. No flags.

Sources: Hidai Bar-Mor — GitHub · EvalView — repo · PyPI · Scorecard: io.github.hidai25 (score 66)