MCP Scorecard

Mission StatementGitHub
← All posts

EvalView: pytest for AI Agents

A Harvard Extension School graduate in Tel Aviv built a regression testing framework for AI agents — golden baseline diffing that catches behavioral drift after prompt changes, model swaps, or tool updates. 45 stars, no API keys required.
io.github.hidai25

Observability platforms show you what your AI agent did. Eval platforms score how well it did. io.github.hidai25/evalview-mcp answers a different question: did my agent break?

The Server

EvalView, built by Hidai Bar-Mor (GitHub, Tel Aviv), is a pytest-style testing framework for AI agents. The README frames the positioning directly:

"Unlike observability platforms (LangSmith) that show you what happened, or eval platforms (Braintrust) that score how good your agent is, EvalView answers: 'Did my agent break?'"

EvalView README

The core workflow: save a golden baseline of your agent's behavior, then run checks after any change — prompt edits, model swaps, tool updates. EvalView diffs the results and reports one of four statuses: PASSED (behavior matches), TOOLS_CHANGED (different tools called), OUTPUT_CHANGED (same tools, different output quality), or REGRESSION (significant score drop). Deterministic tool-call and sequence scoring means no LLM-as-judge dependency for basic checks, though a statistical mode with LLM judge caching is available for deeper evaluation.

Features

The tool set is broader than basic diffing: framework-native adapters for LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, and Ollama. Forbidden tool contracts — declare tools that must never be called, and the test hard-fails immediately. HTML trace replay for forensic debugging with full prompt/completion spans. Multi-reference goldens (up to five variants) for non-deterministic agents. CI/CD integration with a GitHub Action, exit codes, PR comments, and JSON output. Eight MCP tools including test creation, snapshot runs, skill validation, and visual report generation.

The Builder

Bar-Mor describes himself as a finance professional and software engineering master's graduate from Harvard Extension School. His GitHub history spans several years of projects — from Harvard coursework (web development, data science) to a Chrome extension using GPT-3, a Unity fighting game, and household management apps. EvalView represents a more focused bet on AI agent infrastructure.

The repo was created November 2025, Python, Apache 2.0 licensed, published on PyPI as evalview. 45 stars, 5 forks. The MCP integration provides agent-level access to the testing workflow, but the core value is the testing framework itself — this is a developer tool that happens to also be an MCP server, rather than the other way around.

Score: 66. No flags.

Sources: Hidai Bar-Mor — GitHub · EvalView — repo · PyPI · Scorecard: io.github.hidai25 (score 66)

← Government Data Finds MCPCryptographic Receipts for AI Agent Transactions →