Technical GEO

Technical GEO for marketing engineers: robots.txt, structured data, llms.txt | Suparanku

Technical GEO has two pillars — crawler access and server-rendered HTML. Allow retrieval bots like OAI-SearchBot, PerplexityBot and ClaudeBot, then verify with logs. Structured data helps when it carries real facts (price, rating, specs). llms.txt is a cheap extra, not a citation factor.

Maksim Gurchenkov (CEO, Apurichoumi Inc.) Jun 11, 2026 ↻ Jun 12, 2026

What engineering owns in GEO

Content is the marketer’s job; whether AI can read the site at all is an infrastructure problem. The two pillars the technical side owns are crawler access and rendering — if either fails, nothing else matters. Structured data helps in specific, well-evidenced cases, and llms.txt is a cheap optional extra. This guide covers all of it at implementation level.

1. Admit the AI crawlers

Retrieval bots vs training scrapers

Not all AI crawlers do the same job. Retrieval bots — OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot — fetch pages to power live answers and citations; blocking them removes you from AI answers. Training scrapers — GPTBot, anthropic-ai — collect data for model training; blocking them only affects training, not your visibility in search. OpenAI documents its bots by purpose,* so you can opt out of training while staying visible and citable.

This matters in practice: an Otterly analysis of over one million AI citations (2026) found that 73% of sites have technical barriers blocking AI crawler access.†

# Retrieval — powers citations; allow if you want AI visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

# Training — affects model training only, not citations
User-agent: GPTBot
Allow: /

Two caveats. First, robots.txt is honored by the major bots but not by all — Bytespider ignores it. Second, user agents are spoofed; to know who really visited, validate log entries against official IP ranges (for example, OpenAI publishes its searchbot IP list as searchbot.json).

Also check that your CDN or WAF bot protection isn’t silently blocking AI crawlers. An allow rule in robots.txt means nothing if the firewall returns 403.

HTML that doesn’t depend on JavaScript

Vercel’s analysis found that major AI crawlers do not execute JavaScript — across 500,000+ GPTBot requests, zero traces of JS execution.** A client-side-rendered page is blank to ChatGPT-, Claude- and Perplexity-class fetchers. Google is the exception: it can render JavaScript for AI Overviews when not blocked — though since December 2025 Google excludes non-200 pages from rendering entirely. SSR, SSG or prerendering remains the safe baseline so the body text is present in the initial HTML.

Verification is one command:

curl -A "GPTBot" https://example.com/page/ | grep "key copy"

If the copy isn’t in the initial HTML, the rendering strategy needs work.

2. Structured data: useful, but not where you’d expect

Two facts up front. Google officially states that structured data is not required for generative AI search and that there is no special schema.org markup to add for AI features.*** And the strongest controlled experiment to date — Ahrefs tracking 1,885 pages that added JSON-LD against matched controls — found no citation uplift on any AI platform.††

There is one proven exception, and it’s where to start:

Product / SoftwareApplication + Offer with real attributes — pages with Product/Review schema filled with concrete price, rating and specifications were cited 61.7% of the time versus 41.6% for generic schema types, with the strongest effect for low-authority domains. Explicit price is also one of the four citation “gatekeepers” identified in a SIGIR ‘26 study of 252,000 controlled probes.‡ The value is not the tag — it’s the machine-readable facts the tag carries. Price misinformation is a frequent AI-answer error; correct machine-readable values counteract it.
Organization — what actually canonicalizes the entity is not the markup itself but sameAs links to official profiles plus consistent brand facts across the web. Use it to anchor legal name, address and spelling variants.
FAQPage — the wrapper itself is not a signal: a pure Q&A format tested at −5.7% influence versus non-Q&A pages. FAQ helps only when each answer carries evidence density — numbers, definitions, comparisons — rather than short isolated replies.
Article + Person — authorship and dates support E-E-A-T, but note Google’s own framing: E-E-A-T is not a direct ranking factor. Trust is its core, and the weight is highest on YMYL topics.

After implementation, validate with both the Rich Results Test and the Schema.org validator.

3. llms.txt: publish it, but know what it is

Honest framing first: llms.txt is not a ranking or citation factor today. Google does not support it and is not planning to (Gary Illyes), and John Mueller has noted that no major AI system is confirmed to use it for answers.‡‡ A meta-synthesis of 54 studies scored it 2.0 out of 9.5 — no credible evidence that it influences AI citations in any way. The only verified behavior: OpenAI crawls llms.txt on some sites.

So why publish it at all? Because it’s cheap. A Markdown summary of the site’s structure and key content at the site root costs nothing if generated automatically, and it positions you for whatever agents do adopt it.

One operational rule: never maintain it by hand. Hand-edited llms.txt files always go stale. Generate it at build time from your content collections — this site’s llms.txt is built automatically from every article and glossary term.

Verification checklist

robots.txt explicitly allows the retrieval bots you want (OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot)
WAF/CDN doesn’t 403 AI crawler user agents (check access logs)
curl -A "GPTBot" shows body copy in the initial HTML
Product pages carry attribute-rich Product/Offer schema with real price and specs; Organization schema has sameAs links
llms.txt is generated at build time (cheap extra — not a citation factor)
Server logs show real AI crawler visits, validated against official IP ranges and reviewed regularly

With this foundation in place, content improvements show up directly in measurement. Without it, the best article in the world is invisible to AI.

* OpenAI, “Overview of OpenAI Crawlers” (as of May 2025) ** Vercel, “The rise of the AI crawler” (January 2025) *** Google Search Central, “AI Features and Your Website” (as of December 2025) † OtterlyAI, “The AI Citation Economy: 1+ Million Data Points” (2026) †† Ahrefs, “We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved.” (May 2026) ‡ Vishwakarma et al., “What Gets Cited: Competitive GEO in AI Answer Engines”, SIGIR ‘26 ‡‡ Search Engine Land, “Google says normal SEO works … and LLMS.txt won’t be used” (July 2025)