Skip to content
Structuring your data for the AI era: schema.org, JSON-LD and beyond

Structuring your data for the AI era: schema.org, JSON-LD and beyond

Back to articles

Structured data (schema.org in JSON-LD) is the most underestimated lever for getting cited by AI engines. But schema quality matters more than presence: thin markup actually performs worse than no markup at all. Beyond schema.org, a full stack is emerging — llms.txt, NLWeb, MCP — to make your site readable by AI agents.

When you type a question into ChatGPT, Perplexity, or Google AI Overview, the answer rarely cites sources at random. Behind the choice of which site gets cited and which gets ignored lies a factor most companies underestimate: structured data. Not the content visible to humans — the content machines read behind the scenes. Schema.org, JSON-LD, entity graphs: these are the signals that help AI engines understand who you are, what you publish, and why you're authoritative. I discussed this in my article on GEO: optimizing for generative engines is a paradigm shift. This article goes deeper — it covers the technical foundation that makes that optimization possible.

How AI engines consume your site

To understand why structured data matters, you first need to understand how AI systems access your content. There are two distinct pathways, and they don't process the same signals. Pathway 1: via the search index. Google AI Overview and Bing Copilot rely on their existing index — the one Googlebot and Bingbot build by crawling your site. In this case, the JSON-LD you placed on your pages has been pre-processed and integrated into the knowledge graph. The AI reasons over semantically enriched data. This is where schema.org has the most impact. Pathway 2: direct real-time fetch. ChatGPT and Perplexity, when they crawl a page live, read raw HTML without a full rendering engine. Empirical tests (SearchViu, November 2025) confirmed that these agents treat JSON-LD as plain text — they don't parse it as structured data. For them, what matters is your clean semantic HTML and your llms.txt file. In practice: JSON-LD is crucial upstream, at the indexing and knowledge graph level. But for agents crawling you in real time, it's the quality of your HTML and metadata that makes the difference. The two approaches are complementary, not competing.

Schema.org and JSON-LD: what actually makes a difference

Having schema.org on your site is common advice. But recent studies reveal a counterintuitive nuance: thin schema performs worse than no schema at all. The Growth Marshal study (n=730 AI citations) shows that rich, well-populated schema achieves a 61.7% citation rate, versus 59.8% for pages without schema — but pages with minimal, poorly-filled schema drop to 41.6%. Quality beats quantity. So which types and properties actually matter for AI visibility? Identity first. Organization with sameAs pointing to Wikidata, Wikipedia, LinkedIn — this is your entity anchor. LLMs use these cross-references to verify that you exist and resolve ambiguity. Add a stable @id (e.g., https://yoursite.com/#organization) to create a persistent identifier in the graph. Content next. BlogPosting with an author typed as Person (not a plain string), publisher, datePublished, and dateModified. The mainEntityOfPage property helps LLMs identify the primary subject. FAQPage remains the highest-impact type for AI citation — 67% citation rate according to Frase.io — but only when Q&A is the primary content of the page. AI-native properties. speakable signals extractable passages for AI. mentions explicitly links entities referenced in your content. And sameAs — which I keep coming back to — is arguably the most impactful property for LLM recognition. It connects your local entity to the global knowledge web that AI systems use as their source of truth.

Beyond schema.org: the complete 2026 stack

Schema.org is no longer alone. In 2026, a multi-layered stack is emerging to make websites readable by AI systems and autonomous agents. llms.txt — proposed by Jeremy Howard in 2024 — is a Markdown file placed at your site's root that gives LLMs a curated semantic map of your key content. Where JSON-LD describes each page individually, llms.txt offers a navigable overview. Adopted by Anthropic, Vercel, and Hugging Face. Google doesn't use it (confirmed), but its value is real for RAG pipelines and documentation agents. I cover it in detail in my dedicated article. NLWeb may be the most significant development. Created by R.V. Guha — the inventor of RSS, RDF, and schema.org himself — this Microsoft project turns any website into a conversational interface by consuming its existing schema.org markup and RSS feeds. Every NLWeb instance is also an MCP server. The message is clear: the structured data world is officially pivoting toward AI agent consumption. MCP, A2A, and WebMCP form the agent protocol layer. MCP (97 million SDK downloads per month) connects agents to tools. A2A (Google) enables agent-to-agent coordination. WebMCP (W3C draft, February 2026) standardizes exposing web capabilities to AI agents through the browser. I detailed MCP in a dedicated article. And in the middle of all this, RSS/Atom feeds are experiencing an unexpected renaissance: a freshness signal for crawlers (Google Feedfetcher crawls them hourly), a primary data source for NLWeb, and a structured feed directly consumable by AI pipelines.

Costly mistakes — and the checklist to get it right

Structured data mistakes aren't harmless. They can actively hurt your visibility, not just fail to help. Here are the most common and most costly. Minimal or sloppy schema. This is the number one mistake. Putting an Organization with just a name and URL, without sameAs, without logo, without contactPoint — that's worse than having nothing. AI engines interpret an empty schema as a low authority signal. Author as a string instead of an object. Writing "author": "John Smith" instead of "author": {"@type": "Person", "name": "John Smith"} — this breaks the entity graph. The AI can't resolve the author's identity or link it to other authority signals. FAQPage on a page where Q&A isn't the primary content. Google has tightened the rules: FAQ schema is only eligible for rich results when questions and answers are the primary content of the page. Appending them at the bottom of a blog post is now considered spam. Wrong date formats. ISO 8601 is mandatory: 2026-04-02T09:00:00+02:00. Not 04/02/2026, not April 2, 2026. Markup describing invisible content. Schema for content inside closed tabs, behind JavaScript, or simply absent from the page — Google can apply a manual action. The checklist to get it right: implement Organization with sameAs and @id on every page. Add BlogPosting with typed author, publisher, dates, and mainEntityOfPage. Use BreadcrumbList on every non-homepage URL. Connect entities to each other via @id instead of duplicating data. Validate with Google's Rich Results Test and the schema.org validator. And monitor in Google Search Console's Enhancements tab weekly.

Measuring impact: the KPIs that matter

Implementing structured data without measuring its impact is flying blind. Here are the metrics to track, in order of priority.

Level 1 — Rich Results. In Google Search Console, the Enhancements tab shows impressions, clicks, and errors for your rich results by schema type. Studies show a CTR lift of 3 to 7 percentage points for pages with active rich results. This is your first indicator: zero errors = full eligibility.

Level 2 — AI Visibility. Measure your citation rate in AI responses. How? By regularly querying ChatGPT, Perplexity, and Gemini with your strategic keywords and tracking mention frequency. Tools like Otterly.ai or Semrush's AI module can automate this monitoring. For Google AI Overviews, BrightEdge and Semrush offer dedicated tracking. I explain the full methodology in my article on auditing LLM presence.

Level 3 — AI Referral Traffic. In Google Analytics 4, track sources from chatgpt.com, perplexity.ai, and gemini.google.com. This traffic is still small in absolute volume, but it's growing fast and its conversion rate often exceeds standard organic traffic — users arrive with a precise intent.

Level 4 — Business Impact. Segment sessions from rich results in GA4 and measure their conversion rate vs. standard organic sessions. Structured data only justifies itself if it generates value — not vanity metrics.

Our free SEO & GEO audit tool lets you check the state of your structured data in seconds. And if you want a complete action plan to optimize your AI visibility — from schema.org to GEO strategy — let's talk.

Further reading