robots.txt compliance (RFC 9309)

webcanon.robots is a pure, I/O-free implementation of the Robots Exclusion Protocol (RFC 9309). HTTP status handling lives in the fetch/client layer; parsing and matching live here so they can be unit-tested without a network.

Parsing

Lines are split on # to strip comments, then on the first :.
User-agent, Allow, Disallow, and Sitemap records are recognised.
Consecutive User-agent lines (with no rule in between) share the rule block that follows them.
Rules appearing before any User-agent are attributed to *.

User-agent matching

policy.rules_for("WebCanonBot")

A group token matches the crawler if it is a case-insensitive prefix of (or substring within) the product token. If any non-* group matches, those rules are merged and used; otherwise the * group is the fallback.

Path matching

Allow / Disallow patterns support:

* — matches any run of characters.
$ — anchors the end of the path (only meaningful as the final character).
Everything else is matched literally (after re.escape).

Matching is evaluated against path[?query], percent-decoded.

Precedence

The most specific rule wins, where specificity is the pattern length (excluding a trailing $).
On a length tie, Allow wins over Disallow.
An empty Disallow: value means “allow everything” and never produces a positive match.
/robots.txt itself is always implicitly allowed.

Transport-level verdicts

Handled in webcanon.client._load_robots, per RFC 9309 §2.3.1:

HTTP result	Meaning	Effect
2xx	robots available	parse and evaluate
4xx	robots unavailable	treat as “allow all”
5xx / network error / timeout	robots unreachable	`unknown_unreachable` → deny by default

Caching should not exceed 24h (RobotsConfig.max_cache_seconds); a persistent cache is planned for v0.2.

Worked examples

robots.txt	URL	Verdict
`Disallow: /private`	`/private/a`	`disallowed_explicit`
`Disallow: /private`	`/public`	`allowed_implicit`
`Disallow: /docs` + `Allow: /docs/public`	`/docs/public/a`	`allowed_explicit`
`Disallow: /docs` + `Allow: /docs/public`	`/docs/secret`	`disallowed_explicit`
`Disallow: /*.pdf$`	`/file.pdf`	`disallowed_explicit`
`Disallow: /*.pdf$`	`/file.pdf?x=1`	`allowed_implicit`

These are exercised in tests/test_robots.py.