Customization hooks

WebCanon’s pipeline runs through three replaceable callables. All default to the built-in implementations; override any of them on RetrievalConfig.

Hook Type Replaces Default
fetcher Fetcher the scraping transport webcanon.fetch.fetch (SSRF-guarded httpx)
extractor Extractor HTML → Markdown conversion webcanon.extract.extract_html
ai_resolver AiResolver AI reasoning over llms.txt + URL none (rule-based resolve_candidates)

Default identity

The default User-Agent product token is WebCanon (WebCanon/<version>), configurable via UserAgentConfig.

The two flows

URL only (no AI)

  1. Fetch robots.txt.
  2. Evaluate the target URL against it (is User-agent: * / our token Disallow?).
  3. Return the scraped content and the robots recommendation together.
  4. Return rule-based HTML → Markdown together.
result = WebCanon().retrieve_url("https://example.com/page")
result.document.markdown         # rule-based Markdown
result.policy.robots.recommendation  # "recommended" / "not_recommended" / ...

URL + AI enabled (ai_reasoning=True)

  1. Fetch robots.txt.
  2. Evaluate the target URL against it.
  3. Hand the URL + parsed llms.txt + robots recommendation to the ai_resolver. The AI decides how to scrape — e.g. read a different URL, or send a specific request header (some docs expose a Markdown variant via an Accept header or a .md URL).
  4. Return the content fetched per the AI’s decision, the llms.txt-derived decision, and the robots recommendation.
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
result.policy.llms.resolved_by      # "ai" or "rule_based"
result.policy.llms.applied_headers  # headers the AI asked us to send
result.selected_source.final_url    # the URL actually fetched

If no ai_resolver is configured, WebCanon falls back to the built-in rule-based resolver (exact llms.txt match → .md variant → original URL).

Built-in AI resolvers (Anthropic / OpenAI / Gemini, via env vars)

WebCanon ships ai_resolver implementations for three providers. Enable one from the environment so the CLI and the library share a single switch:

Variable Meaning
WEBCANON_AI_PROVIDER anthropic | openai | gemini to enable; unset / none to disable
WEBCANON_AI_MODEL model id (per-provider default if unset)
provider API key ANTHROPIC_API_KEY / OPENAI_API_KEY / GEMINI_API_KEY (or GOOGLE_API_KEY)
Provider WEBCANON_AI_PROVIDER Extra Default model Resolver class
Anthropic (Claude) anthropic pip install "webcanon[ai]" claude-opus-4-8 AnthropicAiResolver
OpenAI openai pip install "webcanon[openai]" gpt-5 OpenAiAiResolver
Google Gemini gemini pip install "webcanon[gemini]" gemini-2.5-pro GeminiAiResolver

CLI — pick the provider/model via env vars or flags (flags win; --ai-provider implies --ai):

# Flags
export OPENAI_API_KEY=sk-...
webcanon fetch https://example.com/docs/api --ai-provider openai --ai-model gpt-4o

# Or env vars
export WEBCANON_AI_PROVIDER=openai          # or anthropic / gemini
export OPENAI_API_KEY=sk-...
# optional: export WEBCANON_AI_MODEL=gpt-4o
webcanon fetch https://example.com/docs/api --ai
  Environment variable CLI flag
Provider WEBCANON_AI_PROVIDER --ai-provider {anthropic,openai,gemini}
Model WEBCANON_AI_MODEL --ai-model MODEL

Library — ai_resolver_from_env() returns the configured resolver or None:

from webcanon import WebCanon
from webcanon.ai import ai_resolver_from_env
from webcanon.config import RetrievalConfig

client = WebCanon(RetrievalConfig(ai_resolver=ai_resolver_from_env()))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
print(result.policy.llms.resolved_by)  # "ai" when the model rerouted

The model is handed the URL + parsed llms.txt + robots verdict and returns a URL read-through plus safe content-negotiation headers. Its choice is never trusted: the URL is re-evaluated against robots.txt and the SSRF guard, and only allowlisted headers are sent (see security.md). If the anthropic package isn’t installed or the API errors, the resolver declines and WebCanon falls back to the rule engine.

Writing a custom ai_resolver

from typing import Optional
from webcanon import AiContext, AiHint

def my_ai(ctx: AiContext) -> Optional[AiHint]:  # Optional keeps it 3.9-compatible
    # ctx.requested_url, ctx.origin
    # ctx.llms_manifest  -> parsed llms.txt (title/summary/links) or None
    # ctx.llms_url       -> the llms.txt URL (or None)
    # ctx.robots_recommendation / ctx.robots_verdict
    if ctx.llms_manifest:
        for link in ctx.llms_manifest.links:
            if link.url.endswith(".md"):
                return AiHint(url=link.url, reason="llms.txt markdown doc")
    return AiHint(headers={"Accept": "text/markdown"}, reason="prefer markdown")
    # return None  => no opinion; proceed normally

AiHint fields:

  • url — the URL to fetch (None keeps the requested URL).
  • headers — extra request headers to send.
  • reason — recorded in result.policy.llms.reason (provenance).
  • extra — free-form dict for your own bookkeeping.

Policy is never overridden

robots.txt is re-evaluated for whatever URL the AI chooses, against that URL’s own origin — a cross-origin hint causes the target host’s robots.txt to be loaded and evaluated, so a hint can never bypass another site’s rules. If the chosen URL is disallowed, the entire hint is dropped (URL and headers) and WebCanon continues with normal resolution (or raises PolicyError when llms.strategy="force"). The AI is untrusted: it can guide retrieval, not bypass policy. See security.md and policy-model.md.

Injected headers are restricted

Headers an ai_resolver (or custom caller) supplies are limited to a safe allowlist (Accept, Accept-Language, Accept-Encoding, If-None-Match, If-Modified-Since). Credential-like headers (Authorization, Cookie, …) are dropped, and all injected headers are dropped on cross-origin redirects so they cannot leak to another host. User-Agent is always sent from UserAgentConfig.

Writing a fetcher / extractor

from webcanon.fetch import FetchResponse
from webcanon.extract import ExtractedDocument

def my_fetcher(url, *, config, user_agent, headers=None) -> FetchResponse:
    ...  # MUST still enforce the SSRF guard (see webcanon.ssrf.assert_safe_url)

def my_extractor(body, *, content_type) -> ExtractedDocument:
    ...  # e.g. wrap Trafilatura / Readability

A custom fetcher is responsible for honouring the SSRF guard and the transport limits in config (timeout, redirects, body size, content types).

Headless browser (JavaScript-rendered pages)

For single-page apps and client-rendered content, use the built-in PlaywrightFetcher, which renders the page in a real headless browser and returns the post-JavaScript HTML. Playwright is an optional dependency:

pip install "webcanon[headless]"
python -m playwright install chromium
from webcanon import WebCanon
from webcanon.config import RetrievalConfig
from webcanon.headless import PlaywrightFetcher

client = WebCanon(RetrievalConfig(
    fetcher=PlaywrightFetcher(
        browser="chromium",        # or "firefox" / "webkit"
        wait_until="networkidle",  # good default for SPAs
        wait_selector="#app",      # optional: wait for a content container
        extra_wait_ms=0,           # optional fixed delay
    )
))
result = client.retrieve_url("https://example.com/spa")
print(result.document.html)      # rendered HTML
print(result.document.markdown)  # extracted from the rendered DOM

It enforces the SSRF guard for the target and the final (post-navigation) URL, and applies the FetchConfig timeout and body-size limits. If Playwright is not installed, it raises a clear FetchError telling you how to install it.

document.html and document.markdown semantics

document.html always holds raw HTML (or None), and document.markdown always holds Markdown. How they’re populated depends on what was fetched:

Fetched content document.markdown document.html
HTML (AI off, or AI/llms chose an HTML page) rule-based conversion of that HTML that HTML
Markdown via AI/llms reroute (final URL ≠ requested) the fetched Markdown the originally-requested URL’s HTML, fetched separately
Markdown directly from the requested URL the fetched Markdown None (no distinct HTML exists)

The separate “original HTML” fetch (second row) is best-effort and policy-aware: it works like curl/httpx by default, but if the original URL is disallowed by robots.txt (in respect mode), errors, or isn’t HTML, then document.html is None and the Markdown result still succeeds. This keeps the module usable as a general fetcher while staying governance-friendly for robots-sensitive deployments.

document.html is included in to_dict() but not in to_document() (the RAG/document shape stays lean).