daily-news-report - Agent Skills

Daily News Report v3.0

> Architecture Upgrade: Main Agent Orchestration + SubAgent Execution + Browser Scraping + Smart Caching

Core Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Main Agent (Orchestrator)                    │
│  Role: Scheduling, Monitoring, Evaluation, Decision, Aggregation    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│   │ 1. Init     │ → │ 2. Dispatch │ → │ 3. Monitor  │ → │ 4. Evaluate │     │
│   │ Read Config │    │ Assign Tasks│    │ Collect Res │    │ Filter/Sort │     │
│   └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘     │
│         │                  │                  │                  │           │
│         ▼                  ▼                  ▼                  ▼           │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│   │ 5. Decision │ ← │ Enough 20?  │    │ 6. Generate │ → │ 7. Update   │     │
│   │ Cont/Stop   │    │ Y/N         │    │ Report File │    │ Cache Stats │     │
│   └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
         ↓ Dispatch                          ↑ Return Results
┌─────────────────────────────────────────────────────────────────────┐
│                        SubAgent Execution Layer                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐              │
│   │ Worker A    │   │ Worker B    │   │ Browser     │              │
│   │ (WebFetch)  │   │ (WebFetch)  │   │ (Headless)  │              │
│   │ Tier1 Batch │   │ Tier2 Batch │   │ JS Render   │              │
│   └─────────────┘   └─────────────┘   └─────────────┘              │
│         ↓                 ↓                 ↓                        │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    Structured Result Return                 │   │
│   │  { status, data: [...], errors: [...], metadata: {...} }    │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Configuration Files

This skill uses the following configuration files:

File	Purpose
`sources.json`	Source configuration, priorities, scrape methods
`cache.json`	Cached data, historical stats, deduplication fingerprints

Execution Process Details

Phase 1: Initialization

Steps:
  1. Determine date (user argument or current date)
  2. Read sources.json for source configurations
  3. Read cache.json for historical data
  4. Create output directory NewsReport/
  5. Check if a partial report exists for today (append mode)

Phase 2: Dispatch SubAgents

Strategy: Parallel dispatch, batch execution, early stopping mechanism

Wave 1 (Parallel):
  - Worker A: Tier1 Batch A (HN, HuggingFace Papers)
  - Worker B: Tier1 Batch B (OneUsefulThing, Paul Graham)
Wait for results → Evaluate count
If < 15 high-quality items:
  Wave 2 (Parallel):
    - Worker C: Tier2 Batch A (James Clear, FS Blog)
    - Worker D: Tier2 Batch B (HackerNoon, Scott Young)If still < 20 items:
  Wave 3 (Browser):
    - Browser Worker: ProductHunt, Latent Space (Require JS rendering)

Phase 3: SubAgent Task Format

Task format received by each SubAgent:

task: fetch_and_extract sources: - id: hn url: https://news.ycombinator.com extract: top_10 - id: hf_papers url: https://huggingface.co/papers extract: top_voted output_schema: items: - source_id: string # Source Identifier title: string # Title summary: string # 2-4 sentence summary key_points: string[] # Max 3 key points url: string # Original URL keywords: string[] # Keywords quality_score: 1-5 # Quality Score constraints: filter: "Cutting-edge Tech/Deep Tech/Productivity/Practical Info" exclude: "General Science/Marketing Puff/Overly Academic/Job Posts" max_items_per_source: 10 skip_on_error: true

return_format: JSON

Phase 4: Main Agent Monitoring & Feedback

Main Agent Responsibilities:

Monitoring:
  - Check SubAgent return status (success/partial/failed)
  - Count collected items
  - Record success rate per source
Feedback Loop:
  - If a SubAgent fails, decide whether to retry or skip
  - If a source fails persistently, mark as disabled
  - Dynamically adjust source selection for subsequent batchesDecision:
  - Items >= 25 AND HighQuality >= 20 → Stop scraping
  - Items < 15 → Continue to next batch
  - All batches done but < 20 → Generate with available content (Quality over Quantity)

Phase 5: Evaluation & Filtering

Deduplication: - Exact URL match - Title similarity (>80% considered duplicate) - Check cache.json to avoid history duplicates Score Calibration: - Unify scoring standards across SubAgents - Adjust weights based on source credibility - Bonus points for manually curated high-quality sources

Sorting: - Descending order by quality_score - Sort by source priority if scores are equal - Take Top 20

Phase 6: Browser Scraping (MCP Chrome DevTools)

For pages requiring JS rendering, use a headless browser:

Process: 1. Call mcp__chrome-devtools__new_page to open page 2. Call mcp__chrome-devtools__wait_for to wait for content load 3. Call mcp__chrome-devtools__take_snapshot to get page structure 4. Parse snapshot to extract required content 5. Call mcp__chrome-devtools__close_page to close page

Applicable Scenarios: - ProductHunt (403 on WebFetch) - Latent Space (Substack JS rendering) - Other SPA applications

Phase 7: Generate Report

Output: - Directory: NewsReport/ - Filename: YYYY-MM-DD-news-report.md - Format: Standard Markdown

Content Structure: - Title + Date - Statistical Summary (Source count, items collected) - 20 High-Quality Items (Template based) - Generation Info (Version, Timestamps)

Phase 8: Update Cache

Update cache.json:
  - last_run: Record this run info
  - source_stats: Update stats per source
  - url_cache: Add processed URLs
  - content_hashes: Add content fingerprints
  - article_history: Record included articles

SubAgent Call Examples

Using general-purpose Agent

Since custom agents require session restart to be discovered, use general-purpose and inject worker prompts:

Task Call:
  subagent_type: general-purpose
  model: haiku
  prompt: |
    You are a stateless execution unit. Only do the assigned task and return structured JSON.
    Task: Scrape the following URLs and extract content
    URLs:
    - https://news.ycombinator.com (Extract Top 10)
    - https://huggingface.co/papers (Extract top voted papers)
    Output Format:
    {
      "status": "success" | "partial" | "failed",
      "data": [
        {
          "source_id": "hn",
          "title": "...",
          "summary": "...",
          "key_points": ["...", "...", "..."],
          "url": "...",
          "keywords": ["...", "..."],
          "quality_score": 4
        }
      ],
      "errors": [],
      "metadata": { "processed": 2, "failed": 0 }
    }
    Filter Criteria:
    - Keep: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
    - Exclude: General Science/Marketing Puff/Overly Academic/Job Posts    Return JSON directly, no explanation.

Using worker Agent (Requires session restart)

Task Call:
  subagent_type: worker
  prompt: |
    task: fetch_and_extract
    input:
      urls:
        - https://news.ycombinator.com
        - https://huggingface.co/papers
    output_schema:
      - source_id: string
      - title: string
      - summary: string
      - key_points: string[]
      - url: string
      - keywords: string[]
      - quality_score: 1-5
    constraints:
      filter: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
      exclude: General Science/Marketing Puff/Overly Academic

Output Template

# Daily News Report (YYYY-MM-DD) > Curated from N sources today, containing 20 high-quality items > Generation Time: X min | Version: v3.0 > > Warning: Sub-agent 'worker' not detected. Running in generic mode (Serial Execution). Performance might be degraded. 1. Title Summary: 2-4 lines overview Key Points: 1. Point one 2. Point two 3. Point three Source: Link Keywords: keyword1 keyword2 keyword3 Score: ⭐⭐⭐⭐⭐ (5/5) 2. Title ...

Generated by Daily News Report v3.0 Sources: HN, HuggingFace, OneUsefulThing, ...

Constraints & Principles

Quality over Quantity: Low-quality content does not enter the report.

Early Stop: Stop scraping once 20 high-quality items are reached.

Parallel First: SubAgents in the same batch execute in parallel.

Fault Tolerance: Failure of a single source does not affect the whole process.

Cache Reuse: Avoid re-scraping the same content.

Main Agent Control: All decisions are made by the Main Agent.

Fallback Awareness: Detect sub-agent availability, gracefully degrade if unavailable.

Expected Performance

Scenario	Expected Time	Note
Optimal	~2 mins	Tier1 sufficient, no browser needed
Normal	~3-4 mins	Requires Tier2 supplement
Browser Needed	~5-6 mins	Includes JS rendered pages

Error Handling

Error Type	Handling
SubAgent Timeout	Log error, continue to next
Source 403/404	Mark disabled, update sources.json
Extraction Failed	Return raw content, Main Agent decides
Browser Crash	Skip source, log entry

Compatibility & Fallback

To ensure usability across different Agent environments, the following checks must be performed:

Environment Check:

- In Phase 1 initialization, attempt to detect if worker sub-agent exists.
- If not exists (or plugin not installed), automatically switch to Serial Execution Mode.

Serial Execution Mode:

- Do not use parallel block.
- Main Agent executes scraping tasks for each source sequentially.
- Slower, but guarantees basic functionality.

User Alert:

- MUST include a clear warning in the generated report header indicating the current degraded mode.