Extraction quality
Extraction turns fetched bytes into LLM-ready Markdown plus a quality signal. WebCanon treats extraction as a measurable, replaceable step, not a one-shot HTML→Markdown call.
v0.1 baseline (webcanon.extract)
The shipped extractor uses only the standard library (html.parser). It:
- Drops non-content tags:
script,style,nav,footer,aside,noscript,template,svg. - Preserves headings (
#–######), paragraphs, lists (nested),pre/codeblocks, blockquotes, bold/italic, and links ([text](href)). - Extracts the
<title>. - Collects all link hrefs into
document.links. - Detects hidden text (
hidden,aria-hidden="true",display:none,visibility:hidden) and raises a warning — hidden content is a common prompt-injection vector (seesecurity.md). - Returns a coarse
quality_scorein[0, 1]based on the ratio of extracted text to raw HTML length.
If the response is already Markdown or plain text (content_type indicates
markdown/text/plain), the body is passed through with quality_score = 1.0.
Quality dimensions (planned)
Higher-quality extractors should report a structured score:
| Dimension | Question |
|---|---|
| Extraction rate | How much main content survived boilerplate removal? |
| Link preservation | Were in-content links kept? |
| Table preservation | Were tables rendered as Markdown tables? |
| Code preservation | Were code blocks kept verbatim? |
| Duplication | Is repeated boilerplate present? |
Pluggable extractors (planned)
The standard layer is the interface, not the implementation. Future extractors plug in behind a common shape and are selected per input:
class Extractor(Protocol):
name: str
def can_handle(self, response) -> bool: ...
def extract(self, response) -> ExtractedDocument: ...
Planned extractor implementations: Readability, Trafilatura, and an optional LLM-assisted repair pass for DOMs that defeat rule-based extraction.
Available now: a Playwright headless fetcher for JS-heavy pages —
webcanon.headless.PlaywrightFetcher(optionalwebcanon[headless]extra). It renders the page in a real browser and feeds the post-JavaScript HTML into whichever extractor is configured. See customization.md.
Conformance fixtures (planned)
A fixtures/html/ corpus (article-basic, docs-page, ecommerce-product,
spa-rendered, table-heavy, hostile-hidden-text) will pin extraction
behaviour across extractors so the standard stays comparable.