daily-news-report

根据预设URL列表抓取内容,筛选高质量技术资讯,并生成每日Markdown报告。

查看详情
name:daily-news-reportdescription:Scrapes content based on a preset URL list, filters high-quality technical information, and generates daily Markdown reports.argument-hint:[optional: date]disable-model-invocation:falseuser-invocable:trueallowed-tools:Task, WebFetch, Read, Write, Bash(mkdir*), Bash(date*), Bash(ls*), mcp__chrome-devtools__*

Daily News Report v3.0

> Architecture Upgrade: Main Agent Orchestration + SubAgent Execution + Browser Scraping + Smart Caching

Core Architecture

┌─────────────────────────────────────────────────────────────────────┐
│ Main Agent (Orchestrator) │
│ Role: Scheduling, Monitoring, Evaluation, Decision, Aggregation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 1. Init │ → │ 2. Dispatch │ → │ 3. Monitor │ → │ 4. Evaluate │ │
│ │ Read Config │ │ Assign Tasks│ │ Collect Res │ │ Filter/Sort │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 5. Decision │ ← │ Enough 20? │ │ 6. Generate │ → │ 7. Update │ │
│ │ Cont/Stop │ │ Y/N │ │ Report File │ │ Cache Stats │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
↓ Dispatch ↑ Return Results
┌─────────────────────────────────────────────────────────────────────┐
│ SubAgent Execution Layer │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker A │ │ Worker B │ │ Browser │ │
│ │ (WebFetch) │ │ (WebFetch) │ │ (Headless) │ │
│ │ Tier1 Batch │ │ Tier2 Batch │ │ JS Render │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Structured Result Return │ │
│ │ { status, data: [...], errors: [...], metadata: {...} } │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Configuration Files

This skill uses the following configuration files:

FilePurpose
sources.jsonSource configuration, priorities, scrape methods
cache.jsonCached data, historical stats, deduplication fingerprints

Execution Process Details

Phase 1: Initialization

Steps:
1. Determine date (user argument or current date)
2. Read sources.json for source configurations
3. Read cache.json for historical data
4. Create output directory NewsReport/
5. Check if a partial report exists for today (append mode)

Phase 2: Dispatch SubAgents

Strategy: Parallel dispatch, batch execution, early stopping mechanism

Wave 1 (Parallel):
- Worker A: Tier1 Batch A (HN, HuggingFace Papers)
- Worker B: Tier1 Batch B (OneUsefulThing, Paul Graham)

Wait for results → Evaluate count

If < 15 high-quality items:
Wave 2 (Parallel):
- Worker C: Tier2 Batch A (James Clear, FS Blog)
- Worker D: Tier2 Batch B (HackerNoon, Scott Young)

If still < 20 items:
Wave 3 (Browser):
- Browser Worker: ProductHunt, Latent Space (Require JS rendering)

Phase 3: SubAgent Task Format

Task format received by each SubAgent:

task: fetch_and_extract
sources:
- id: hn
url: https://news.ycombinator.com
extract: top_10
- id: hf_papers
url: https://huggingface.co/papers
extract: top_voted

output_schema:
items:
- source_id: string # Source Identifier
title: string # Title
summary: string # 2-4 sentence summary
key_points: string[] # Max 3 key points
url: string # Original URL
keywords: string[] # Keywords
quality_score: 1-5 # Quality Score

constraints:
filter: "Cutting-edge Tech/Deep Tech/Productivity/Practical Info"
exclude: "General Science/Marketing Puff/Overly Academic/Job Posts"
max_items_per_source: 10
skip_on_error: true

return_format: JSON

Phase 4: Main Agent Monitoring & Feedback

Main Agent Responsibilities:

Monitoring:
- Check SubAgent return status (success/partial/failed)
- Count collected items
- Record success rate per source

Feedback Loop:
- If a SubAgent fails, decide whether to retry or skip
- If a source fails persistently, mark as disabled
- Dynamically adjust source selection for subsequent batches

Decision:
- Items >= 25 AND HighQuality >= 20 → Stop scraping
- Items < 15 → Continue to next batch
- All batches done but < 20 → Generate with available content (Quality over Quantity)

Phase 5: Evaluation & Filtering

Deduplication:
- Exact URL match
- Title similarity (>80% considered duplicate)
- Check cache.json to avoid history duplicates

Score Calibration:
- Unify scoring standards across SubAgents
- Adjust weights based on source credibility
- Bonus points for manually curated high-quality sources

Sorting:
- Descending order by quality_score
- Sort by source priority if scores are equal
- Take Top 20

Phase 6: Browser Scraping (MCP Chrome DevTools)

For pages requiring JS rendering, use a headless browser:

Process:
1. Call mcp__chrome-devtools__new_page to open page
2. Call mcp__chrome-devtools__wait_for to wait for content load
3. Call mcp__chrome-devtools__take_snapshot to get page structure
4. Parse snapshot to extract required content
5. Call mcp__chrome-devtools__close_page to close page

Applicable Scenarios:
- ProductHunt (403 on WebFetch)
- Latent Space (Substack JS rendering)
- Other SPA applications

Phase 7: Generate Report

Output:
- Directory: NewsReport/
- Filename: YYYY-MM-DD-news-report.md
- Format: Standard Markdown

Content Structure:
- Title + Date
- Statistical Summary (Source count, items collected)
- 20 High-Quality Items (Template based)
- Generation Info (Version, Timestamps)

Phase 8: Update Cache

Update cache.json:
- last_run: Record this run info
- source_stats: Update stats per source
- url_cache: Add processed URLs
- content_hashes: Add content fingerprints
- article_history: Record included articles

SubAgent Call Examples

Using general-purpose Agent

Since custom agents require session restart to be discovered, use general-purpose and inject worker prompts:

Task Call:
subagent_type: general-purpose
model: haiku
prompt: |
You are a stateless execution unit. Only do the assigned task and return structured JSON.

Task: Scrape the following URLs and extract content

URLs:
- https://news.ycombinator.com (Extract Top 10)
- https://huggingface.co/papers (Extract top voted papers)

Output Format:
{
"status": "success" | "partial" | "failed",
"data": [
{
"source_id": "hn",
"title": "...",
"summary": "...",
"key_points": ["...", "...", "..."],
"url": "...",
"keywords": ["...", "..."],
"quality_score": 4
}
],
"errors": [],
"metadata": { "processed": 2, "failed": 0 }
}

Filter Criteria:
- Keep: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
- Exclude: General Science/Marketing Puff/Overly Academic/Job Posts

Return JSON directly, no explanation.

Using worker Agent (Requires session restart)

Task Call:
subagent_type: worker
prompt: |
task: fetch_and_extract
input:
urls:
- https://news.ycombinator.com
- https://huggingface.co/papers
output_schema:
- source_id: string
- title: string
- summary: string
- key_points: string[]
- url: string
- keywords: string[]
- quality_score: 1-5
constraints:
filter: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
exclude: General Science/Marketing Puff/Overly Academic

Output Template

# Daily News Report (YYYY-MM-DD)

> Curated from N sources today, containing 20 high-quality items
> Generation Time: X min | Version: v3.0
>
> Warning: Sub-agent 'worker' not detected. Running in generic mode (Serial Execution). Performance might be degraded.


1. Title

  • Summary: 2-4 lines overview

  • Key Points:

  • 1. Point one
    2. Point two
    3. Point three
  • Source: Link

  • Keywords: keyword1 keyword2 keyword3

  • Score: ⭐⭐⭐⭐⭐ (5/5)

  • 2. Title


    ...


    Generated by Daily News Report v3.0
    Sources: HN, HuggingFace, OneUsefulThing, ...

    Constraints & Principles

  • Quality over Quantity: Low-quality content does not enter the report.

  • Early Stop: Stop scraping once 20 high-quality items are reached.

  • Parallel First: SubAgents in the same batch execute in parallel.

  • Fault Tolerance: Failure of a single source does not affect the whole process.

  • Cache Reuse: Avoid re-scraping the same content.

  • Main Agent Control: All decisions are made by the Main Agent.

  • Fallback Awareness: Detect sub-agent availability, gracefully degrade if unavailable.
  • Expected Performance

    ScenarioExpected TimeNote
    Optimal~2 minsTier1 sufficient, no browser needed
    Normal~3-4 minsRequires Tier2 supplement
    Browser Needed~5-6 minsIncludes JS rendered pages

    Error Handling

    Error TypeHandling
    SubAgent TimeoutLog error, continue to next
    Source 403/404Mark disabled, update sources.json
    Extraction FailedReturn raw content, Main Agent decides
    Browser CrashSkip source, log entry

    Compatibility & Fallback

    To ensure usability across different Agent environments, the following checks must be performed:

  • Environment Check:

  • - In Phase 1 initialization, attempt to detect if worker sub-agent exists.
    - If not exists (or plugin not installed), automatically switch to Serial Execution Mode.

  • Serial Execution Mode:

  • - Do not use parallel block.
    - Main Agent executes scraping tasks for each source sequentially.
    - Slower, but guarantees basic functionality.

  • User Alert:

  • - MUST include a clear warning in the generated report header indicating the current degraded mode.

      daily-news-report - Agent Skills