Architecture
WebCanon is a retrieval pipeline + policy engine + evidence log. Given a URL, it builds a verified retrieval plan (robots → llms.txt → fetch → extract), executes it, and returns extracted, provenance-bearing content.
Scope: WebCanon’s purpose is correct, policy-aware scraping of a given URL. Web search engines are out of scope — finding candidate URLs is a separate concern. The pipeline diagram below shows the original vision (including a search adapter) for context; the supported entry point is
retrieve_url. Scraping and AI reasoning are injectable (see customization.md).
Pipeline
┌──────────────────────┐
│ User Request │
│ URL or Search Query │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Input Router │ url / search / ai
└──────────┬───────────┘
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌────────────────┐ ┌────────────────┐
│ Search │ │ Origin Manifest│ │ URL Normalizer │
│ Adapter │ │ Collector │ │ Canonicalizer │
└──────┬───────┘ └───────┬────────┘ └───────┬────────┘
│ ▼ │
│ ┌────────────────┐ │
│ │ robots.txt │ │
│ │ llms.txt │ │
│ │ sitemap.xml │ │
│ └───────┬────────┘ │
└──────────────────┼─────────────────────┘
▼
┌──────────────────────┐
│ Retrieval Planner │ robots + llms + sitemap
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Fetch Orchestrator │ HTTP + SSRF guard
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Extractor │ HTML → Markdown
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Evidence Bundle │ markdown + provenance
└──────────────────────┘
Module map (Python)
| Memo concept | Python module | Notes |
|---|---|---|
| URL Normalizer / Canonicalizer | webcanon.urls |
normalize_url, origin_of, manifest_url |
| Origin Manifest Collector | webcanon.client (_load_robots, _load_llms) |
fetches well-known files per origin |
| Robots Policy Engine | webcanon.robots |
RFC 9309 parser + evaluator (pure, I/O-free) |
| LLMs Resolver | webcanon.llms |
parse + ordered candidate resolution |
| Sitemap Resolver | webcanon.sitemap |
urlset + sitemapindex parsing |
| Fetch Orchestrator | webcanon.fetch |
manual redirects, per-hop SSRF re-check |
| SSRF guard | webcanon.ssrf |
DNS-resolved IP range checks |
| Extractor | webcanon.extract |
stdlib HTML → Markdown baseline |
| Evidence / RBOM | webcanon.types, webcanon.provenance |
RetrievalResult, sha256 hashes |
| Client / Pipeline | webcanon.client |
WebCanon.retrieve_url |
| CLI | webcanon.cli |
fetch, inspect |
Design principle
Don’t pass search results to the AI. Build a verified retrieval plan from them, and pass only fetched, transformed, and provenance-bearing context.
Everything in the pipeline is designed to be measurable and replaceable. Extractors, search providers, and (later) headless renderers plug in behind narrow interfaces so the standard layer stays stable while implementations improve.
Roadmap
| Version | Scope |
|---|---|
| v0.1 (this release) | URL retrieval quality baseline: normalize, robots, llms, sitemap, SSRF fetch, basic extraction, provenance, CLI |
| v0.2 | Full llms.txt resolution polish, manifest caching with TTL, malicious-llms.txt fixtures |
| v0.3 | Framework adapters (LangChain/LlamaIndex/MCP), async path. (Search adapters are out of scope for this module.) |
| v0.4 | Readability/Trafilatura extractors, Playwright renderer, table/code preservation, LLM-assisted repair, quality scoring |
| v1.0 | Conformance test suite, stable API, security review, Docker image, docs site, RAG/MCP integrations |