Architecture

WebCanon is a retrieval pipeline + policy engine + evidence log. Given a URL, it builds a verified retrieval plan (robots → llms.txt → fetch → extract), executes it, and returns extracted, provenance-bearing content.

Scope: WebCanon’s purpose is correct, policy-aware scraping of a given URL. Web search engines are out of scope — finding candidate URLs is a separate concern. The pipeline diagram below shows the original vision (including a search adapter) for context; the supported entry point is retrieve_url. Scraping and AI reasoning are injectable (see customization.md).

Pipeline

                ┌──────────────────────┐
                │  User Request         │
                │  URL or Search Query  │
                └──────────┬───────────┘
                           ▼
                ┌──────────────────────┐
                │ Input Router          │  url / search / ai
                └──────────┬───────────┘
       ┌───────────────────┼───────────────────┐
       ▼                   ▼                   ▼
┌──────────────┐   ┌────────────────┐   ┌────────────────┐
│ Search       │   │ Origin Manifest│   │ URL Normalizer  │
│ Adapter      │   │ Collector      │   │ Canonicalizer   │
└──────┬───────┘   └───────┬────────┘   └───────┬────────┘
       │                   ▼                    │
       │          ┌────────────────┐            │
       │          │ robots.txt     │            │
       │          │ llms.txt       │            │
       │          │ sitemap.xml    │            │
       │          └───────┬────────┘            │
       └──────────────────┼─────────────────────┘
                          ▼
                ┌──────────────────────┐
                │ Retrieval Planner     │  robots + llms + sitemap
                └──────────┬───────────┘
                           ▼
                ┌──────────────────────┐
                │ Fetch Orchestrator    │  HTTP + SSRF guard
                └──────────┬───────────┘
                           ▼
                ┌──────────────────────┐
                │ Extractor             │  HTML → Markdown
                └──────────┬───────────┘
                           ▼
                ┌──────────────────────┐
                │ Evidence Bundle       │  markdown + provenance
                └──────────────────────┘

Module map (Python)

Memo concept	Python module	Notes
URL Normalizer / Canonicalizer	`webcanon.urls`	`normalize_url`, `origin_of`, `manifest_url`
Origin Manifest Collector	`webcanon.client` (`_load_robots`, `_load_llms`)	fetches well-known files per origin
Robots Policy Engine	`webcanon.robots`	RFC 9309 parser + evaluator (pure, I/O-free)
LLMs Resolver	`webcanon.llms`	parse + ordered candidate resolution
Sitemap Resolver	`webcanon.sitemap`	urlset + sitemapindex parsing
Fetch Orchestrator	`webcanon.fetch`	manual redirects, per-hop SSRF re-check
SSRF guard	`webcanon.ssrf`	DNS-resolved IP range checks
Extractor	`webcanon.extract`	stdlib HTML → Markdown baseline
Evidence / RBOM	`webcanon.types`, `webcanon.provenance`	`RetrievalResult`, sha256 hashes
Client / Pipeline	`webcanon.client`	`WebCanon.retrieve_url`
CLI	`webcanon.cli`	`fetch`, `inspect`

Design principle

Don’t pass search results to the AI. Build a verified retrieval plan from them, and pass only fetched, transformed, and provenance-bearing context.

Everything in the pipeline is designed to be measurable and replaceable. Extractors, search providers, and (later) headless renderers plug in behind narrow interfaces so the standard layer stays stable while implementations improve.

Roadmap

Version	Scope
v0.1 (this release)	URL retrieval quality baseline: normalize, robots, llms, sitemap, SSRF fetch, basic extraction, provenance, CLI
v0.2	Full `llms.txt` resolution polish, manifest caching with TTL, malicious-`llms.txt` fixtures
v0.3	Framework adapters (LangChain/LlamaIndex/MCP), async path. (Search adapters are out of scope for this module.)
v0.4	Readability/Trafilatura extractors, Playwright renderer, table/code preservation, LLM-assisted repair, quality scoring
v1.0	Conformance test suite, stable API, security review, Docker image, docs site, RAG/MCP integrations