evaluation

Build evaluation frameworks for agent systems

Author

Install

Hot:12

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-evaluation&locale=en&source=copy

Agent Evaluation Framework - Building an Evaluation Framework for Agent Systems

Skills Overview


The Evaluation skill helps you build a comprehensive evaluation framework for non-deterministic agent systems using multi-dimensional scoring criteria and continuous monitoring pipelines. This ensures system quality and catches regressions.

Suitable Use Cases

1. Systematically Test Agent Performance


When you need to validate an agent’s performance on complex tasks, this skill provides an outcome-focused evaluation approach. The agent may use different valid paths to achieve the goal, and traditional step-based testing will fail. This skill teaches you how to design multi-dimensional scoring criteria—factual accuracy, completeness, citation accuracy, source quality, and tool efficiency—to determine whether the agent achieves the correct results while following a reasonable process.

2. Validate Context Engineering Choices


When you optimize prompts, context windows, or tool configurations, you need a systematic way to evaluate the real impact of these changes. This skill guides you to build an evaluation pipeline that compares quality scores, token usage, and efficiency metrics across different context strategies on the same test set. It also uses context degradation tests to identify performance thresholds.

3. Continuous Monitoring in Production


After deployment, you need to continuously track agent quality. This skill provides a production monitoring plan. By evaluating through random sampling of interactions, setting quality degradation alerts, and maintaining trend-analysis dashboards, you can ensure that agent behavior in production meets expectations.

Core Features

1. Multi-Dimensional Evaluation Criteria Design


Create a comprehensive scoring system that covers factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. Convert each dimension into a numeric score (0.0 to 1.0), compute an overall score using weights based on use-case requirements, and set clear pass/fail thresholds.

2. LLM-as-Judge Automated Evaluation


Use large language models as judges to enable scalable evaluations of large test sets. With carefully designed evaluation prompts, capture the target evaluation dimensions by providing clear task descriptions, the agent’s output, reference answers (if available), an evaluation rubric with level descriptions, and request structured judgments.

3. Test Set Stratification


Build test sets according to complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interactions, deep reasoning). Sample from real usage patterns and add known edge cases to ensure coverage of all complexity levels.

Common Questions

How do you evaluate non-deterministic agent systems?


Agents may take completely different valid paths to achieve the goal across different runs. Traditional evaluation methods that check specific steps will not work in this scenario. The solution is to use an outcome-oriented evaluation approach: assess whether the agent produces the correct final result while following a reasonable process, rather than forcing a specific execution path.

What is an LLM-as-Judge evaluation?


LLM-as-judge is an automated evaluation method that uses a large language model to assess agent outputs. Its core is to design effective evaluation prompts that capture the target evaluation dimensions. During evaluation, you should provide a clear task description, the agent’s output, reference answers (if available), an evaluation rubric with level descriptions, and request structured judgments. This approach can scale to large test sets and provide consistent judgments, but it should be complemented with human evaluation to capture edge cases.

What is the “95% discovery” in the BrowseComp research?


The study shows that three factors—token usage (80%), number of tool calls (about 10%), and model selection (about 5%)—explain 95% of the differences in agent performance. This means: evaluations should use realistic token budgets rather than infinite resources; upgrading to newer models provides more benefit than adding token budget on older versions; and this finding validates an architecture pattern that distributes work across multiple agents with independent context windows.