agent-orchestration-improve-agent

通过对现有智能体进行性能分析、提示工程优化以及持续迭代,实现系统性的性能提升。

查看详情
name:agent-orchestration-improve-agentdescription:"Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration."

Agent Performance Optimization Workflow

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]

Use this skill when

  • Improving an existing agent's performance or reliability

  • Analyzing failure modes, prompt quality, or tool usage

  • Running structured A/B tests or evaluation suites

  • Designing iterative optimization workflows for agents
  • Do not use this skill when

  • You are building a brand-new agent from scratch

  • There are no metrics, feedback, or test cases available

  • The task is unrelated to agent performance or prompt quality
  • Instructions

  • Establish baseline metrics and collect representative examples.

  • Identify failure modes and prioritize high-impact fixes.

  • Apply prompt and workflow improvements with measurable goals.

  • Validate with tests and roll out changes in controlled stages.
  • Safety

  • Avoid deploying prompt changes without regression testing.

  • Roll back quickly if quality or safety metrics regress.
  • Phase 1: Performance Analysis and Baseline Metrics

    Comprehensive analysis of agent performance using context-manager for historical data collection.

    1.1 Gather Performance Data

    Use: context-manager
    Command: analyze-agent-performance $ARGUMENTS --days 30

    Collect metrics including:

  • Task completion rate (successful vs failed tasks)

  • Response accuracy and factual correctness

  • Tool usage efficiency (correct tools, call frequency)

  • Average response time and token consumption

  • User satisfaction indicators (corrections, retries)

  • Hallucination incidents and error patterns
  • 1.2 User Feedback Pattern Analysis

    Identify recurring patterns in user interactions:

  • Correction patterns: Where users consistently modify outputs

  • Clarification requests: Common areas of ambiguity

  • Task abandonment: Points where users give up

  • Follow-up questions: Indicators of incomplete responses

  • Positive feedback: Successful patterns to preserve
  • 1.3 Failure Mode Classification

    Categorize failures by root cause:

  • Instruction misunderstanding: Role or task confusion

  • Output format errors: Structure or formatting issues

  • Context loss: Long conversation degradation

  • Tool misuse: Incorrect or inefficient tool selection

  • Constraint violations: Safety or business rule breaches

  • Edge case handling: Unusual input scenarios
  • 1.4 Baseline Performance Report

    Generate quantitative baseline metrics:

    Performance Baseline:
  • Task Success Rate: [X%]

  • Average Corrections per Task: [Y]

  • Tool Call Efficiency: [Z%]

  • User Satisfaction Score: [1-10]

  • Average Response Latency: [Xms]

  • Token Efficiency Ratio: [X:Y]
  • Phase 2: Prompt Engineering Improvements

    Apply advanced prompt optimization techniques using prompt-engineer agent.

    2.1 Chain-of-Thought Enhancement

    Implement structured reasoning patterns:

    Use: prompt-engineer
    Technique: chain-of-thought-optimization

  • Add explicit reasoning steps: "Let's approach this step-by-step..."

  • Include self-verification checkpoints: "Before proceeding, verify that..."

  • Implement recursive decomposition for complex tasks

  • Add reasoning trace visibility for debugging
  • 2.2 Few-Shot Example Optimization

    Curate high-quality examples from successful interactions:

  • Select diverse examples covering common use cases

  • Include edge cases that previously failed

  • Show both positive and negative examples with explanations

  • Order examples from simple to complex

  • Annotate examples with key decision points
  • Example structure:

    Good Example:
    Input: [User request]
    Reasoning: [Step-by-step thought process]
    Output: [Successful response]
    Why this works: [Key success factors]

    Bad Example:
    Input: [Similar request]
    Output: [Failed response]
    Why this fails: [Specific issues]
    Correct approach: [Fixed version]

    2.3 Role Definition Refinement

    Strengthen agent identity and capabilities:

  • Core purpose: Clear, single-sentence mission

  • Expertise domains: Specific knowledge areas

  • Behavioral traits: Personality and interaction style

  • Tool proficiency: Available tools and when to use them

  • Constraints: What the agent should NOT do

  • Success criteria: How to measure task completion
  • 2.4 Constitutional AI Integration

    Implement self-correction mechanisms:

    Constitutional Principles:
  • Verify factual accuracy before responding

  • Self-check for potential biases or harmful content

  • Validate output format matches requirements

  • Ensure response completeness

  • Maintain consistency with previous responses
  • Add critique-and-revise loops:

  • Initial response generation

  • Self-critique against principles

  • Automatic revision if issues detected

  • Final validation before output
  • 2.5 Output Format Tuning

    Optimize response structure:

  • Structured templates for common tasks

  • Dynamic formatting based on complexity

  • Progressive disclosure for detailed information

  • Markdown optimization for readability

  • Code block formatting with syntax highlighting

  • Table and list generation for data presentation
  • Phase 3: Testing and Validation

    Comprehensive testing framework with A/B comparison.

    3.1 Test Suite Development

    Create representative test scenarios:

    Test Categories:
  • Golden path scenarios (common successful cases)

  • Previously failed tasks (regression testing)

  • Edge cases and corner scenarios

  • Stress tests (complex, multi-step tasks)

  • Adversarial inputs (potential breaking points)

  • Cross-domain tasks (combining capabilities)
  • 3.2 A/B Testing Framework

    Compare original vs improved agent:

    Use: parallel-test-runner
    Config:
    - Agent A: Original version
    - Agent B: Improved version
    - Test set: 100 representative tasks
    - Metrics: Success rate, speed, token usage
    - Evaluation: Blind human review + automated scoring

    Statistical significance testing:

  • Minimum sample size: 100 tasks per variant

  • Confidence level: 95% (p < 0.05)

  • Effect size calculation (Cohen's d)

  • Power analysis for future tests
  • 3.3 Evaluation Metrics

    Comprehensive scoring framework:

    Task-Level Metrics:

  • Completion rate (binary success/failure)

  • Correctness score (0-100% accuracy)

  • Efficiency score (steps taken vs optimal)

  • Tool usage appropriateness

  • Response relevance and completeness
  • Quality Metrics:

  • Hallucination rate (factual errors per response)

  • Consistency score (alignment with previous responses)

  • Format compliance (matches specified structure)

  • Safety score (constraint adherence)

  • User satisfaction prediction
  • Performance Metrics:

  • Response latency (time to first token)

  • Total generation time

  • Token consumption (input + output)

  • Cost per task (API usage fees)

  • Memory/context efficiency
  • 3.4 Human Evaluation Protocol

    Structured human review process:

  • Blind evaluation (evaluators don't know version)

  • Standardized rubric with clear criteria

  • Multiple evaluators per sample (inter-rater reliability)

  • Qualitative feedback collection

  • Preference ranking (A vs B comparison)
  • Phase 4: Version Control and Deployment

    Safe rollout with monitoring and rollback capabilities.

    4.1 Version Management

    Systematic versioning strategy:

    Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
    Example: customer-support-v2.3.1

    MAJOR: Significant capability changes
    MINOR: Prompt improvements, new examples
    PATCH: Bug fixes, minor adjustments

    Maintain version history:

  • Git-based prompt storage

  • Changelog with improvement details

  • Performance metrics per version

  • Rollback procedures documented
  • 4.2 Staged Rollout

    Progressive deployment strategy:

  • Alpha testing: Internal team validation (5% traffic)

  • Beta testing: Selected users (20% traffic)

  • Canary release: Gradual increase (20% → 50% → 100%)

  • Full deployment: After success criteria met

  • Monitoring period: 7-day observation window
  • 4.3 Rollback Procedures

    Quick recovery mechanism:

    Rollback Triggers:
  • Success rate drops >10% from baseline

  • Critical errors increase >5%

  • User complaints spike

  • Cost per task increases >20%

  • Safety violations detected
  • Rollback Process:

  • Detect issue via monitoring

  • Alert team immediately

  • Switch to previous stable version

  • Analyze root cause

  • Fix and re-test before retry
  • 4.4 Continuous Monitoring

    Real-time performance tracking:

  • Dashboard with key metrics

  • Anomaly detection alerts

  • User feedback collection

  • Automated regression testing

  • Weekly performance reports
  • Success Criteria

    Agent improvement is successful when:

  • Task success rate improves by ≥15%

  • User corrections decrease by ≥25%

  • No increase in safety violations

  • Response time remains within 10% of baseline

  • Cost per task doesn't increase >5%

  • Positive user feedback increases
  • Post-Deployment Review

    After 30 days of production use:

  • Analyze accumulated performance data

  • Compare against baseline and targets

  • Identify new improvement opportunities

  • Document lessons learned

  • Plan next optimization cycle
  • Continuous Improvement Cycle

    Establish regular improvement cadence:

  • Weekly: Monitor metrics and collect feedback

  • Monthly: Analyze patterns and plan improvements

  • Quarterly: Major version updates with new capabilities

  • Annually: Strategic review and architecture updates
  • Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.

      agent-orchestration-improve-agent - Agent Skills