Technical GEO
Technical GEO for marketing engineers: robots.txt, structured data, llms.txt | Suparanku
Technical GEO has two pillars — crawler access and server-rendered HTML. Allow retrieval bots like OAI-SearchBot, PerplexityBot and ClaudeBot, then verify with logs. Structured data helps when it carries real facts (price, rating, specs). llms.txt is a cheap extra, not a citation factor.
What engineering owns in GEO
Content is the marketer’s job; whether AI can read the site at all is an infrastructure problem. The two pillars the technical side owns are crawler access and rendering — if either fails, nothing else matters. Structured data helps in specific, well-evidenced cases, and llms.txt is a cheap optional extra. This guide covers all of it at implementation level.
1. Admit the AI crawlers
Retrieval bots vs training scrapers
Not all AI crawlers do the same job. Retrieval bots — OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot — fetch pages to power live answers and citations; blocking them removes you from AI answers. Training scrapers — GPTBot, anthropic-ai — collect data for model training; blocking them only affects training, not your visibility in search. OpenAI documents its bots by purpose,* so you can opt out of training while staying visible and citable.
This matters in practice: an Otterly analysis of over one million AI citations (2026) found that 73% of sites have technical barriers blocking AI crawler access.†
# Retrieval — powers citations; allow if you want AI visibility
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
# Training — affects model training only, not citations
User-agent: GPTBot
Allow: /
Two caveats. First, robots.txt is honored by the major bots but not by all — Bytespider ignores it. Second, user agents are spoofed; to know who really visited, validate log entries against official IP ranges (for example, OpenAI publishes its searchbot IP list as searchbot.json).
Also check that your CDN or WAF bot protection isn’t silently blocking AI crawlers. An allow rule in robots.txt means nothing if the firewall returns 403.
HTML that doesn’t depend on JavaScript
Vercel’s analysis found that major AI crawlers do not execute JavaScript — across 500,000+ GPTBot requests, zero traces of JS execution.** A client-side-rendered page is blank to ChatGPT-, Claude- and Perplexity-class fetchers. Google is the exception: it can render JavaScript for AI Overviews when not blocked — though since December 2025 Google excludes non-200 pages from rendering entirely. SSR, SSG or prerendering remains the safe baseline so the body text is present in the initial HTML.
Verification is one command:
curl -A "GPTBot" https://example.com/page/ | grep "key copy"
If the copy isn’t in the initial HTML, the rendering strategy needs work.
2. Structured data: useful, but not where you’d expect
Two facts up front. Google officially states that structured data is not required for generative AI search and that there is no special schema.org markup to add for AI features.*** And the strongest controlled experiment to date — Ahrefs tracking 1,885 pages that added JSON-LD against matched controls — found no citation uplift on any AI platform.††
There is one proven exception, and it’s where to start:
- Product / SoftwareApplication + Offer with real attributes — pages with Product/Review schema filled with concrete price, rating and specifications were cited 61.7% of the time versus 41.6% for generic schema types, with the strongest effect for low-authority domains. Explicit price is also one of the four citation “gatekeepers” identified in a SIGIR ‘26 study of 252,000 controlled probes.‡ The value is not the tag — it’s the machine-readable facts the tag carries. Price misinformation is a frequent AI-answer error; correct machine-readable values counteract it.
- Organization — what actually canonicalizes the entity is not the markup itself but
sameAslinks to official profiles plus consistent brand facts across the web. Use it to anchor legal name, address and spelling variants. - FAQPage — the wrapper itself is not a signal: a pure Q&A format tested at −5.7% influence versus non-Q&A pages. FAQ helps only when each answer carries evidence density — numbers, definitions, comparisons — rather than short isolated replies.
- Article + Person — authorship and dates support E-E-A-T, but note Google’s own framing: E-E-A-T is not a direct ranking factor. Trust is its core, and the weight is highest on YMYL topics.
After implementation, validate with both the Rich Results Test and the Schema.org validator.
3. llms.txt: publish it, but know what it is
Honest framing first: llms.txt is not a ranking or citation factor today. Google does not support it and is not planning to (Gary Illyes), and John Mueller has noted that no major AI system is confirmed to use it for answers.‡‡ A meta-synthesis of 54 studies scored it 2.0 out of 9.5 — no credible evidence that it influences AI citations in any way. The only verified behavior: OpenAI crawls llms.txt on some sites.
So why publish it at all? Because it’s cheap. A Markdown summary of the site’s structure and key content at the site root costs nothing if generated automatically, and it positions you for whatever agents do adopt it.
One operational rule: never maintain it by hand. Hand-edited llms.txt files always go stale. Generate it at build time from your content collections — this site’s llms.txt is built automatically from every article and glossary term.
Verification checklist
- robots.txt explicitly allows the retrieval bots you want (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot)
- WAF/CDN doesn’t 403 AI crawler user agents (check access logs)
curl -A "GPTBot"shows body copy in the initial HTML- Product pages carry attribute-rich Product/Offer schema with real price and specs; Organization schema has
sameAslinks - llms.txt is generated at build time (cheap extra — not a citation factor)
- Server logs show real AI crawler visits, validated against official IP ranges and reviewed regularly
With this foundation in place, content improvements show up directly in measurement. Without it, the best article in the world is invisible to AI.
* OpenAI, “Overview of OpenAI Crawlers” (as of May 2025) ** Vercel, “The rise of the AI crawler” (January 2025) *** Google Search Central, “AI Features and Your Website” (as of December 2025) † OtterlyAI, “The AI Citation Economy: 1+ Million Data Points” (2026) †† Ahrefs, “We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved.” (May 2026) ‡ Vishwakarma et al., “What Gets Cited: Competitive GEO in AI Answer Engines”, SIGIR ‘26 ‡‡ Search Engine Land, “Google says normal SEO works … and LLMS.txt won’t be used” (July 2025)
Sources
- OpenAI, "Overview of OpenAI Crawlers"
- Vercel, "The rise of the AI crawler"
- Google Search Central, "AI Features and Your Website"
- Search Engine Land, "Google says normal SEO works … and LLMS.txt won't be used"
- Ahrefs, "We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved."
- Vishwakarma et al. (Sprinklr), "What Gets Cited: Competitive GEO in AI Answer Engines" (SIGIR '26)
- OtterlyAI, "The AI Citation Economy: 1+ Million Data Points"