Customization hooks
WebCanon’s pipeline runs through three replaceable callables. All default to the
built-in implementations; override any of them on RetrievalConfig.
| Hook | Type | Replaces | Default |
|---|---|---|---|
fetcher |
Fetcher |
the scraping transport | webcanon.fetch.fetch (SSRF-guarded httpx) |
extractor |
Extractor |
HTML → Markdown conversion | webcanon.extract.extract_html |
ai_resolver |
AiResolver |
AI reasoning over llms.txt + URL |
none (rule-based resolve_candidates) |
Default identity
The default User-Agent product token is WebCanon (WebCanon/<version>),
configurable via UserAgentConfig.
The two flows
URL only (no AI)
- Fetch
robots.txt. - Evaluate the target URL against it (is
User-agent: */ our tokenDisallow?). - Return the scraped content and the robots recommendation together.
- Return rule-based HTML → Markdown together.
result = WebCanon().retrieve_url("https://example.com/page")
result.document.markdown # rule-based Markdown
result.policy.robots.recommendation # "recommended" / "not_recommended" / ...
URL + AI enabled (ai_reasoning=True)
- Fetch
robots.txt. - Evaluate the target URL against it.
- Hand the URL + parsed
llms.txt+ robots recommendation to theai_resolver. The AI decides how to scrape — e.g. read a different URL, or send a specific request header (some docs expose a Markdown variant via anAcceptheader or a.mdURL). - Return the content fetched per the AI’s decision, the
llms.txt-derived decision, and the robots recommendation.
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
result.policy.llms.resolved_by # "ai" or "rule_based"
result.policy.llms.applied_headers # headers the AI asked us to send
result.selected_source.final_url # the URL actually fetched
If no ai_resolver is configured, WebCanon falls back to the built-in
rule-based resolver (exact llms.txt match → .md variant → original URL).
Built-in AI resolvers (Anthropic / OpenAI / Gemini, via env vars)
WebCanon ships ai_resolver implementations for three providers. Enable one
from the environment so the CLI and the library share a single switch:
| Variable | Meaning |
|---|---|
WEBCANON_AI_PROVIDER |
anthropic | openai | gemini to enable; unset / none to disable |
WEBCANON_AI_MODEL |
model id (per-provider default if unset) |
| provider API key | ANTHROPIC_API_KEY / OPENAI_API_KEY / GEMINI_API_KEY (or GOOGLE_API_KEY) |
| Provider | WEBCANON_AI_PROVIDER |
Extra | Default model | Resolver class |
|---|---|---|---|---|
| Anthropic (Claude) | anthropic |
pip install "webcanon[ai]" |
claude-opus-4-8 |
AnthropicAiResolver |
| OpenAI | openai |
pip install "webcanon[openai]" |
gpt-5 |
OpenAiAiResolver |
| Google Gemini | gemini |
pip install "webcanon[gemini]" |
gemini-2.5-pro |
GeminiAiResolver |
CLI — pick the provider/model via env vars or flags (flags win; --ai-provider
implies --ai):
# Flags
export OPENAI_API_KEY=sk-...
webcanon fetch https://example.com/docs/api --ai-provider openai --ai-model gpt-4o
# Or env vars
export WEBCANON_AI_PROVIDER=openai # or anthropic / gemini
export OPENAI_API_KEY=sk-...
# optional: export WEBCANON_AI_MODEL=gpt-4o
webcanon fetch https://example.com/docs/api --ai
| Environment variable | CLI flag | |
|---|---|---|
| Provider | WEBCANON_AI_PROVIDER |
--ai-provider {anthropic,openai,gemini} |
| Model | WEBCANON_AI_MODEL |
--ai-model MODEL |
Library — ai_resolver_from_env() returns the configured resolver or None:
from webcanon import WebCanon
from webcanon.ai import ai_resolver_from_env
from webcanon.config import RetrievalConfig
client = WebCanon(RetrievalConfig(ai_resolver=ai_resolver_from_env()))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
print(result.policy.llms.resolved_by) # "ai" when the model rerouted
The model is handed the URL + parsed llms.txt + robots verdict and returns a
URL read-through plus safe content-negotiation headers. Its choice is never
trusted: the URL is re-evaluated against robots.txt and the SSRF guard, and
only allowlisted headers are sent (see security.md). If the
anthropic package isn’t installed or the API errors, the resolver declines
and WebCanon falls back to the rule engine.
Writing a custom ai_resolver
from typing import Optional
from webcanon import AiContext, AiHint
def my_ai(ctx: AiContext) -> Optional[AiHint]: # Optional keeps it 3.9-compatible
# ctx.requested_url, ctx.origin
# ctx.llms_manifest -> parsed llms.txt (title/summary/links) or None
# ctx.llms_url -> the llms.txt URL (or None)
# ctx.robots_recommendation / ctx.robots_verdict
if ctx.llms_manifest:
for link in ctx.llms_manifest.links:
if link.url.endswith(".md"):
return AiHint(url=link.url, reason="llms.txt markdown doc")
return AiHint(headers={"Accept": "text/markdown"}, reason="prefer markdown")
# return None => no opinion; proceed normally
AiHint fields:
url— the URL to fetch (Nonekeeps the requested URL).headers— extra request headers to send.reason— recorded inresult.policy.llms.reason(provenance).extra— free-form dict for your own bookkeeping.
Policy is never overridden
robots.txt is re-evaluated for whatever URL the AI chooses, against that
URL’s own origin — a cross-origin hint causes the target host’s robots.txt
to be loaded and evaluated, so a hint can never bypass another site’s rules. If
the chosen URL is disallowed, the entire hint is dropped (URL and
headers) and WebCanon continues with normal resolution (or raises PolicyError
when llms.strategy="force"). The AI is untrusted: it can guide retrieval,
not bypass policy. See security.md and
policy-model.md.
Injected headers are restricted
Headers an ai_resolver (or custom caller) supplies are limited to a safe
allowlist (Accept, Accept-Language, Accept-Encoding, If-None-Match,
If-Modified-Since). Credential-like headers (Authorization, Cookie, …)
are dropped, and all injected headers are dropped on cross-origin redirects
so they cannot leak to another host. User-Agent is always sent from
UserAgentConfig.
Writing a fetcher / extractor
from webcanon.fetch import FetchResponse
from webcanon.extract import ExtractedDocument
def my_fetcher(url, *, config, user_agent, headers=None) -> FetchResponse:
... # MUST still enforce the SSRF guard (see webcanon.ssrf.assert_safe_url)
def my_extractor(body, *, content_type) -> ExtractedDocument:
... # e.g. wrap Trafilatura / Readability
A custom fetcher is responsible for honouring the SSRF guard and the transport
limits in config (timeout, redirects, body size, content types).
Headless browser (JavaScript-rendered pages)
For single-page apps and client-rendered content, use the built-in
PlaywrightFetcher, which renders the page in a real headless browser and
returns the post-JavaScript HTML. Playwright is an optional dependency:
pip install "webcanon[headless]"
python -m playwright install chromium
from webcanon import WebCanon
from webcanon.config import RetrievalConfig
from webcanon.headless import PlaywrightFetcher
client = WebCanon(RetrievalConfig(
fetcher=PlaywrightFetcher(
browser="chromium", # or "firefox" / "webkit"
wait_until="networkidle", # good default for SPAs
wait_selector="#app", # optional: wait for a content container
extra_wait_ms=0, # optional fixed delay
)
))
result = client.retrieve_url("https://example.com/spa")
print(result.document.html) # rendered HTML
print(result.document.markdown) # extracted from the rendered DOM
It enforces the SSRF guard for the target and the final (post-navigation)
URL, and applies the FetchConfig timeout and body-size limits. If Playwright
is not installed, it raises a clear FetchError telling you how to install it.
document.html and document.markdown semantics
document.html always holds raw HTML (or None), and document.markdown
always holds Markdown. How they’re populated depends on what was fetched:
| Fetched content | document.markdown |
document.html |
|---|---|---|
| HTML (AI off, or AI/llms chose an HTML page) | rule-based conversion of that HTML | that HTML |
| Markdown via AI/llms reroute (final URL ≠ requested) | the fetched Markdown | the originally-requested URL’s HTML, fetched separately |
| Markdown directly from the requested URL | the fetched Markdown | None (no distinct HTML exists) |
The separate “original HTML” fetch (second row) is best-effort and
policy-aware: it works like curl/httpx by default, but if the original URL
is disallowed by robots.txt (in respect mode), errors, or isn’t HTML, then
document.html is None and the Markdown result still succeeds. This keeps the
module usable as a general fetcher while staying governance-friendly for
robots-sensitive deployments.
document.html is included in to_dict() but not in to_document() (the
RAG/document shape stays lean).