AI search now drives 15% of queries — is your brand being cited? →
Back to Insights
Foundation 8 min read

The Three Layers of GEO: Training Data, Retrieval, and Entity Signals

JK
Jag
Mar 03, 2026

When a CMO asks me "how do we get mentioned by ChatGPT?", my first question back is always: "Which ChatGPT do you mean?" That might sound pedantic, but it matters enormously. The answer a large language model gives today draws from at least three distinct information layers — each with different timelines, different update mechanisms, and different optimisation levers. Conflating them is the single most common strategic mistake I see marketing teams make when they start thinking about generative engine optimisation. So let's pull the layers apart, examine what actually happens inside an AI response, and work out where you can realistically intervene.

Why Thinking in Layers Changes Your GEO Strategy

Most marketers approach GEO the same way they approached SEO in 2010: as one monolithic channel. Publish good content, build links, hope for the best. But generative AI engines don't work like a single index. They combine pre-trained knowledge, real-time retrieval, and structured entity understanding into a blended response. If you optimise for only one layer, you leave two-thirds of the opportunity on the table — and you won't understand why your interventions aren't producing results.

Generative Engine Optimisation (GEO) is the practice of structuring your brand's content and digital presence so that AI language models accurately cite, reference, and recommend you when answering relevant queries. To do GEO well, you need to understand the architecture you're optimising for. And that architecture has three layers.

If you optimise for only one layer of how AI generates answers, you leave two-thirds of the opportunity on the table.

Layer 1: Training Data — What the Model Already "Knows"

Every large language model starts with a training corpus — a massive collection of text scraped from the web, books, academic papers, and other sources. GPT-4, Claude, Gemini — they all have a knowledge cutoff date. Anything the model learned during training is baked into its weights. It doesn't "look up" this information; it already has it, encoded as statistical patterns across billions of parameters.

Here's the thing: you can't change training data retroactively. If your brand was poorly represented — or entirely absent — in the corpus that trained a model released in early 2025, no amount of content published in March 2025 will fix that specific version's knowledge. Training data is historical. It reflects what existed on the web during the crawl window, which for most frontier models is roughly 6 to 18 months before release.

So what can you actually do about this layer? You play the long game. Content you publish today, if it gets cited on authoritative sources, indexed broadly, and referenced across multiple domains, stands a strong chance of being included in future training runs. According to a 2024 analysis by BrightEdge, 58% of AI-generated answers in informational queries drew on content that was at least 12 months old — suggesting that the training data layer still dominates for many query types.

Practical Interventions for the Training Data Layer

  1. Publish substantive, well-structured content on your own domain that clearly associates your brand name with your core topics. This is the raw material future training runs will ingest.
  2. Earn citations on high-authority third-party sources — Wikipedia, industry publications, academic papers, government databases. These tend to be over-represented in training corpora.
  3. Ensure your content is crawlable. If your best thought leadership is locked behind login walls or rendered entirely in JavaScript, it may never make it into a training dataset.
  4. Maintain consistency in how you describe your brand, products, and expertise. Models learn patterns — if ten sources describe you differently, the model's representation of you becomes noisy and unreliable.

Worth noting: this layer rewards patience. You're not optimising for next quarter's AI mentions. You're building the foundation that makes your brand a durable part of the model's knowledge. Think of it as the GEO equivalent of domain authority — slow to build, hard to lose.

Layer 2: Retrieval — What the Model Looks Up in Real Time

This is where things get interesting — and where most GEO activity produces its fastest results. Nearly every major AI assistant now uses some form of retrieval-augmented generation (RAG). When you ask Perplexity a question, or use Bing Chat, or trigger a Google AI Overview, the system doesn't just rely on what it was trained on. It performs a live web search, pulls in fresh sources, and synthesises those results into its answer.

The retrieval layer is closer to traditional SEO in many respects. The AI engine sends a query — or a decomposed set of sub-queries — to a search index. It retrieves a set of candidate documents. Then it reads, summarises, and cites from those documents. If your page ranks well for relevant queries and is structured in a way that's easy for an AI to extract information from, you have a meaningful chance of being cited.

But there are important differences from conventional search. AI retrieval systems tend to favour content that provides direct, concise answers with clear attributions. Long, meandering blog posts that bury the answer under 800 words of preamble perform poorly. The AI is looking for extractable claims — sentences that can be lifted and cited with minimal rewriting.

58%
of AI answers draw on content at least 12 months old (BrightEdge, 2024)
3.2x
more likely to be cited if content includes structured claims (Semrush, 2025)
44%
of Perplexity citations come from pages ranking in Google's top 5 (Authoritas, 2024)

What Makes Content Retrieval-Friendly?

A good example of retrieval-layer optimisation done well is how HubSpot structures its marketing glossary pages. Each page leads with a crisp one-sentence definition, expands into practical detail, and ends with related questions — all of which makes them extremely retrievable by AI engines. You don't need HubSpot's domain authority to apply the same structural principles.

Layer 3: Entity Signals — How AI Understands Who You Are

This is the layer most marketers haven't thought about yet, and in my experience it's the one that will matter most over the next two to three years. Entity signals are the structured and semi-structured data points that help AI systems understand what your brand is — not just what your website says, but how the broader web defines and categorises you.

Think of it this way: when someone asks an AI "what's the best project management tool for remote teams?", the model doesn't just retrieve blog posts. It draws on an internal representation of entities — companies, products, categories, relationships. If the model has a strong, well-connected entity representation for your brand that associates it with "project management", "remote teams", and "positive user sentiment", you're more likely to be mentioned.

Entity signals come from multiple sources: your Google Business Profile, your Knowledge Panel, Wikidata entries, Crunchbase, LinkedIn company pages, schema markup on your site, mentions in structured databases, and the consistency of how third-party sources describe you. At Arclign, we've started calling this your entity footprint — the sum total of structured signals that tell AI systems who you are, what you do, and how you relate to other entities in your space.

The brands getting this right aren't just publishing content — they're actively managing their structured identity. Canva, for instance, has an exceptionally clean entity footprint: consistent descriptions across platforms, a well-maintained Wikipedia article, robust schema markup, and thousands of third-party mentions that all use similar language to describe its core product category. That consistency is a signal that AI models can rely on.

How Entity Signals Differ from Traditional Brand SEO

Some marketers hear "entity signals" and think it's just brand SEO repackaged. It isn't — though the two are related. Traditional brand SEO focuses on making sure your branded search results look good in Google. Entity optimisation for GEO focuses on making sure AI systems have a coherent, accurate, and well-connected understanding of your brand's identity.

The difference matters because AI models don't just retrieve your homepage and read it. They build internal representations — sometimes called knowledge graphs or entity embeddings — that encode relationships. "Arclign is a consultancy. It specialises in generative engine optimisation. It works with B2B SaaS companies. Its team includes former SEO strategists." Each of those associations is an entity signal, and they come not from a single page but from the pattern of information across many sources.

Three Things That Strengthen Your Entity Footprint

How the Three Layers Interact

The three layers aren't independent — they reinforce each other. Content that performs well in retrieval today is more likely to be included in next year's training data. A strong entity footprint makes it easier for retrieval systems to identify your content as authoritative. And training data shapes the model's baseline understanding of entities, which influences how it interprets and weights retrieved content.

This is why a comprehensive GEO strategy can't focus on just one layer. I've seen companies pour resources into producing retrieval-optimised content — perfectly structured, FAQ-rich, AI-friendly — while completely neglecting their entity signals. The result? They get cited occasionally, but never recommended. The AI mentions their content as a source, but doesn't identify the brand as an authority in the space. The entity layer was missing.

Conversely, some brands have strong entity footprints — everyone knows who they are — but their actual content is poorly structured for retrieval. They rely on the training data layer almost entirely, which means they only show up in AI answers where the model has memorised information about them. For any query that triggers live retrieval, they're invisible.

A Prioritisation Framework for Marketing Teams

So where should you start? My take: it depends on your brand's current position. Here's a simple framework I use with clients at Arclign.

  1. Audit your current AI visibility. Search for your brand and your core topics in ChatGPT, Perplexity, and Google AI Overviews. Note where you're mentioned, where you're absent, and where you're misrepresented. This tells you which layers are working and which aren't.
  2. Fix your entity layer first. If AI models don't have a clear understanding of who you are, optimising content won't help as much as it should. Clean up your structured data, ensure consistency across platforms, and claim your profiles on key databases.
  3. Make your existing content retrieval-friendly. You probably already have content that covers your core topics. Restructure it: lead with definitions, add FAQ sections, use clear headings, include specific claims with supporting data.
  4. Build for the training layer over time. Invest in earning mentions on authoritative third-party sources. Contribute to industry publications. Get cited in research. This is a 6-to-18-month investment that pays dividends across future model versions.
  5. Measure and iterate. GEO measurement is still maturing, but tools from companies like Semrush and Authoritas are beginning to offer AI citation tracking. Monitor which content gets cited, in which AI engines, and adjust accordingly.

The temptation is to skip straight to content production. Resist it. Without the entity foundation, you're building on sand.

What This Means for the Next 12 Months

The brands that will benefit most from the shift to AI-mediated search are those that understand these layers and invest across all three. It's not enough to be "AI-friendly" in some vague sense. You need a specific strategy for each layer, with different tactics, different timelines, and different success metrics.

Training data is your long-term moat. Retrieval is your near-term opportunity. Entity signals are the connective tissue that makes both layers work harder. And the interaction between them is where the real strategic advantage lies — because most of your competitors are still treating GEO as "SEO but for AI", which means they're optimising for one layer at best.

That gap won't last forever. As GEO matures as a discipline, the three-layer model will become common knowledge. The question is whether you'll have built your foundation by then, or whether you'll be playing catch-up.

Frequently Asked Questions

What are the three layers of generative engine optimisation (GEO)?

The three layers of GEO are training data, retrieval, and entity signals. The training data layer refers to information already encoded in an AI model's weights from its pre-training corpus. The retrieval layer involves real-time web searches that AI engines perform to supplement their knowledge with fresh content. The entity signal layer is the structured and semi-structured data that helps AI systems understand what a brand is, what it does, and how it relates to other entities. An effective GEO strategy requires interventions at all three layers.

How does AI training data affect whether my brand is mentioned in AI answers?

AI models like GPT-4 and Claude are trained on massive text corpora scraped from the web, books, and academic sources. If your brand was well-represented in those sources during the crawl window — typically 6 to 18 months before a model's release — the model may already 'know' about you and include you in relevant answers. You can't change what's already in a trained model, but you can influence future training runs by publishing substantive content, earning citations on high-authority sites like Wikipedia and industry publications, and maintaining consistent brand descriptions across the web.

What are entity signals in GEO and why do they matter?

Entity signals are the structured and semi-structured data points that help AI systems build an internal representation of your brand — including what you are, what category you belong to, and how you relate to other known entities. These signals come from sources like Google Knowledge Panels, Wikidata, Crunchbase, LinkedIn, schema markup, and the consistency of third-party descriptions. Entity signals matter because they influence whether AI models recommend your brand as an authority in a given space, not just cite a single piece of your content. Brands with strong, consistent entity footprints are significantly more likely to appear in AI-generated recommendations.

How can I make my content more likely to be retrieved and cited by AI search engines?

To optimise content for AI retrieval, structure it so that key claims and definitions appear early in each section rather than being buried in long preambles. Use clear H2 and H3 headings that mirror natural question phrasing. Include specific data points, named entities, and attributable statistics — AI systems cite specific claims more readily than vague generalisations. Adding FAQ sections with standalone answers is particularly effective, because each question-answer pair maps directly to a potential AI query. According to Semrush's 2025 research, content with structured claims is 3.2 times more likely to be cited in AI-generated responses.

What's the difference between GEO and traditional SEO?

GEO (Generative Engine Optimisation) and traditional SEO share some foundations — both benefit from authoritative content, strong backlinks, and good site structure — but they differ in important ways. SEO focuses on ranking pages in a list of search results, while GEO focuses on being cited, referenced, or recommended within AI-generated answers. GEO requires attention to three distinct layers: training data, real-time retrieval, and entity signals. It also demands a different content structure, emphasising extractable claims, clear definitions, and machine-readable identity data. The two disciplines are complementary, not competing; strong SEO performance tends to improve GEO visibility, particularly in the retrieval layer.

Sources & Further Reading

I opened by asking "which ChatGPT do you mean?" — and now you can see why the question matters. The answer a user gets depends on what the model was trained on, what it retrieves in real time, and how well it understands your brand as an entity. Three layers, three sets of levers, three different timelines for results. If there's one thing I'd want you to take away, it's this: stop treating AI visibility as a single problem with a single solution. Map your current presence across all three layers, identify the gaps, and build a strategy that addresses each one. The companies doing this now — systematically and patiently — are the ones that will own their categories in AI search by 2027.

Find out where you stand in AI search

Get a free GEO audit showing exactly how ChatGPT and Perplexity describe your brand today — and what it'll take to reach the top.

Book a Free Audit