WebCanon documentation

Policy-aware web retrieval for AI — robots.txt, llms.txt, sitemap.xml, fetching, extraction, and provenance in one layer.

Get started View on GitHub


WebCanon turns a URL into trustworthy, policy-checked, citation-ready context for LLMs. It evaluates robots.txt (RFC 9309), resolves LLM-friendly alternatives via llms.txt (optionally with your own AI), fetches behind an SSRF guard, converts HTML into structured Markdown, and returns full provenance for every retrieved document.

Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope. Scraping and AI reasoning are injectable. 日本語の概要は README.ja.md を参照してください。

Install

pip install webcanon

Quick start

from webcanon import WebCanon

client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

print(result.document.markdown)        # extracted Markdown
print(result.policy.robots.verdict)    # e.g. "allowed_implicit"
print(result.provenance.source_hash)   # sha256 of the source body

Documentation

Document Contents
Architecture Pipeline overview, module map, roadmap
Policy model How robots / llms / sitemap authorities differ
robots.txt compliance RFC 9309 parsing & matching rules
llms.txt resolution llms.txt parsing & URL resolution
Extraction quality HTML→Markdown extraction & quality scoring
Customization Injectable fetcher / extractor / AI resolver hooks
AI models Supported model strings per provider (Anthropic / OpenAI / Gemini)
Security SSRF guard, prompt-injection firewall, provenance
AI framework affinity Fit with LangChain/LlamaIndex/MCP
Branching & commits Branch model, Conventional Commits, versioning
Publishing Step-by-step PyPI release procedure

The retrieval constitution

  1. Search results are leads, not sources.
  2. robots.txt is evaluated before fetch.
  3. llms.txt can guide retrieval, not override policy.
  4. Every transformed document must retain provenance.
  5. Web content is untrusted input.
  6. Markdown is an interface, not the source of truth.
  7. Extraction quality must be measurable.