WebCanon documentation
Policy-aware web retrieval for AI — robots.txt, llms.txt, sitemap.xml,
fetching, extraction, and provenance in one layer.
WebCanon turns a URL into trustworthy, policy-checked, citation-ready
context for LLMs. It evaluates robots.txt (RFC 9309), resolves LLM-friendly
alternatives via llms.txt (optionally with your own AI), fetches behind an
SSRF guard, converts HTML into structured Markdown, and returns full
provenance for every retrieved document.
Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope. Scraping and AI reasoning are injectable. 日本語の概要は README.ja.md を参照してください。
Install
pip install webcanon
Quick start
from webcanon import WebCanon
client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
print(result.document.markdown) # extracted Markdown
print(result.policy.robots.verdict) # e.g. "allowed_implicit"
print(result.provenance.source_hash) # sha256 of the source body
Documentation
| Document | Contents |
|---|---|
| Architecture | Pipeline overview, module map, roadmap |
| Policy model | How robots / llms / sitemap authorities differ |
| robots.txt compliance | RFC 9309 parsing & matching rules |
| llms.txt resolution | llms.txt parsing & URL resolution |
| Extraction quality | HTML→Markdown extraction & quality scoring |
| Customization | Injectable fetcher / extractor / AI resolver hooks |
| AI models | Supported model strings per provider (Anthropic / OpenAI / Gemini) |
| Security | SSRF guard, prompt-injection firewall, provenance |
| AI framework affinity | Fit with LangChain/LlamaIndex/MCP |
| Branching & commits | Branch model, Conventional Commits, versioning |
| Publishing | Step-by-step PyPI release procedure |
The retrieval constitution
- Search results are leads, not sources.
robots.txtis evaluated before fetch.llms.txtcan guide retrieval, not override policy.- Every transformed document must retain provenance.
- Web content is untrusted input.
- Markdown is an interface, not the source of truth.
- Extraction quality must be measurable.