Crawlability & Indexation
Can search engines and AI crawlers reach every page that should be indexed, and nothing that shouldn’t?
- robots.txt present, valid, not blocking critical paths
- sitemap.xml referenced in robots.txt and contains only indexable URLs
- Single canonical domain — all four variants 301 → one canonical
- Self-referencing canonical tags on indexable pages
- noindex directives audited — only on pages we want excluded
- Indexation coverage (submitted vs indexed) — gap explained, not accidental
- Soft 404s — zero tolerated on important pages
- Orphan pages identified, then either linked or deleted
- Crawl budget concentrated on valuable URLs (log-file analysis)
- Rendered HTML === raw HTML for content (JavaScript rendering parity)
- Paginated archive URLs + faceted navigation eating crawl budget
- Staging / preview URLs indexed by accident
- noindex left over from a migration
- Crawlers blocked at the WAF or CDN (common with Cloudflare defaults)