evaluation
Build evaluation frameworks for agent systems
Author
Category
AI Skill DevelopmentInstall
Hot:12
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-evaluation&locale=en&source=copy
Agent Evaluation Framework - Building an Evaluation Framework for Agent Systems
Skills Overview
The Evaluation skill helps you build a comprehensive evaluation framework for non-deterministic agent systems using multi-dimensional scoring criteria and continuous monitoring pipelines. This ensures system quality and catches regressions.
Suitable Use Cases
1. Systematically Test Agent Performance
When you need to validate an agent’s performance on complex tasks, this skill provides an outcome-focused evaluation approach. The agent may use different valid paths to achieve the goal, and traditional step-based testing will fail. This skill teaches you how to design multi-dimensional scoring criteria—factual accuracy, completeness, citation accuracy, source quality, and tool efficiency—to determine whether the agent achieves the correct results while following a reasonable process.
2. Validate Context Engineering Choices
When you optimize prompts, context windows, or tool configurations, you need a systematic way to evaluate the real impact of these changes. This skill guides you to build an evaluation pipeline that compares quality scores, token usage, and efficiency metrics across different context strategies on the same test set. It also uses context degradation tests to identify performance thresholds.
3. Continuous Monitoring in Production
After deployment, you need to continuously track agent quality. This skill provides a production monitoring plan. By evaluating through random sampling of interactions, setting quality degradation alerts, and maintaining trend-analysis dashboards, you can ensure that agent behavior in production meets expectations.
Core Features
1. Multi-Dimensional Evaluation Criteria Design
Create a comprehensive scoring system that covers factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. Convert each dimension into a numeric score (0.0 to 1.0), compute an overall score using weights based on use-case requirements, and set clear pass/fail thresholds.
2. LLM-as-Judge Automated Evaluation
Use large language models as judges to enable scalable evaluations of large test sets. With carefully designed evaluation prompts, capture the target evaluation dimensions by providing clear task descriptions, the agent’s output, reference answers (if available), an evaluation rubric with level descriptions, and request structured judgments.
3. Test Set Stratification
Build test sets according to complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interactions, deep reasoning). Sample from real usage patterns and add known edge cases to ensure coverage of all complexity levels.
Common Questions
How do you evaluate non-deterministic agent systems?
Agents may take completely different valid paths to achieve the goal across different runs. Traditional evaluation methods that check specific steps will not work in this scenario. The solution is to use an outcome-oriented evaluation approach: assess whether the agent produces the correct final result while following a reasonable process, rather than forcing a specific execution path.
What is an LLM-as-Judge evaluation?
LLM-as-judge is an automated evaluation method that uses a large language model to assess agent outputs. Its core is to design effective evaluation prompts that capture the target evaluation dimensions. During evaluation, you should provide a clear task description, the agent’s output, reference answers (if available), an evaluation rubric with level descriptions, and request structured judgments. This approach can scale to large test sets and provide consistent judgments, but it should be complemented with human evaluation to capture edge cases.
What is the “95% discovery” in the BrowseComp research?
The study shows that three factors—token usage (80%), number of tool calls (about 10%), and model selection (about 5%)—explain 95% of the differences in agent performance. This means: evaluations should use realistic token budgets rather than infinite resources; upgrading to newer models provides more benefit than adding token budget on older versions; and this finding validates an architecture pattern that distributes work across multiple agents with independent context windows.