Search engines don't read pages. They read chunks. Your beautiful design might be severing the connection between your user's question and your answer. The fix is at the code level: replace div soup with <section> tags, keep headers and answers physically adjacent in the DOM, and use <aside> for self-contained takeaways.
We used to build websites for human eyes. We obsessed over above-the-fold content, whitespace, and visual hierarchy. We treated HTML as a skeleton to hang CSS on. As long as it looked right, the code didn't matter. In the age of Generative Engine Optimization (GEO), that indifference is a liability.
When an AI crawler (GPTBot, a custom RAG agent) visits your site, it ignores your CSS and looks at your DOM. It chops your code into fixed-size chunks to feed a vector database. If your HTML structure doesn't align with these invisible chunk boundaries, you're shredding your own content. A heading lands in chunk A. The answer lands in chunk B. Context is severed. Retrieval fails. The AI tells the user: "I couldn't find an answer."
We broke down the theoretical mechanics of the guillotine effect and vector drift in our deep dive on The Semantic Schism of Fixed-Size Chunking. If you want to understand why the ingestion pipeline fails, start there. This guide focuses on the code: auditing your actual HTML tags and implementing chunk-aware formatting to guide the robot's knife.
The Mechanics of the Cut
Most RAG (Retrieval-Augmented Generation) pipelines are surprisingly primitive. They don't read a whole article to understand nuance. They use fixed-size chunking.
The pipeline sets a token limit (e.g. 500 tokens). It counts 1, 2, 3... 500. CUT. Then it starts the next chunk. It doesn't care that it just sliced through the middle of a sentence. It doesn't care that it separated an <h2> from its <p>.
The Div Soup Problem
Modern web development (especially with frameworks like React and Tailwind) has led to div soup: nested layers of generic <div> tags that offer no semantic meaning.
<div>
<div>What is the pricing?</div>
</div>
<div></div>
<div>
<div>$50/month</div>
</div>
</div>
To a dumb chunker this is just a stream of text. No signal that "Pricing" and "$50" belong together. They're words floating in a sea of generic tags. When the knife falls, it falls randomly.
Strategic Framework: Chunk-Aware HTML
The fix is chunk-aware formatting: use HTML5 semantic tags to signal logical boundaries. You're drawing dotted lines on the page and telling the robot: if you must cut, cut here.
Smart crawlers (using DOM-aware splitters like LangChain) look for semantic containers. They try to keep the contents of a <section>, <article>, or <aside> tag together. Wrap every distinct topic in a <section> tag.
<h3>How do I reset my password?</h3> <div></div> <p>Go to settings and click reset.</p>
<section> <h3>How do I reset my password?</h3> <p>Go to settings and click reset.</p> </section>
By wrapping the question and answer in a <section>, you create a hard container. Even as the chunk limit approaches, the parser tries to preserve this unit or treats it as a priority block.
The most common design mistake that kills AI retrieval. Designers love hero images, so you often see this layout: H2 header, then a massive high-res photo (alt text "Farm in summer"), then the body text "Blueberries are high in antioxidants..."
The mismatch: the image code (source URL, alt text, responsive source sets, wrapper classes) consumes physical space in the text stream. Often 200-300 tokens worth of code. If the chunker has a 500-token limit, chunk 1 gets the header plus the image code, and chunk 2 gets the body text.
The result: a user asks "What are the benefits of blueberries?". The vector database finds chunk 1 (it has the keyword "benefits"). But chunk 1 contains no text answer, only an image of a farm. The LLM hallucinates or fails. Chunk 2 (which has the answer) is ignored because it lacks the context keyword. This is the same orphaned-data failure mode we mapped in the Semantic Schism breakdown.
The fix: move the visual media below the core semantic pair. Header, then body text (the answer), then image. Keep the label and the value physically touching in the DOM.
For TL;DR summaries or key takeaways, use the <aside> tag. This signals to the AI that the content is tangentially related but standalone. Many RAG agents prioritize <aside> content for generating quick summaries because it's structurally separated from the main narrative flow. The same logic applies to the <details> and <summary> disclosure pattern covered in our semantic HTML guide, and it's why clean tables beat prose for data, as we show in The Token Tax.
The Future is Semantic
We're moving toward a web where HTML is the API. You don't need to redesign your site's visual frontend. You do need to refactor your backend templates. Audit your code. Replace generic <div>s with <section>s. Check the token distance between your headers and your answers. The same front-loading discipline from The Context Window Economy applies at the structural level here.
Audit your DOM for chunk-severance risk.
Free scan. Flags div soup, hero-image traps, and header-to-answer token gaps on any URL.
Run a DOM structure audit →In the chunking economy, the connection between your question and your answer is only as strong as the HTML that binds them. Don't let a div sever your relevance.
References & Further Reading
- MDN Web Docs: HTML5 Semantic Elements. Official documentation on why semantic tags exist and how they describe meaning to machines.
- LangChain Documentation: HTML Header Metadata Splitter. Detailed explanation of how modern AI parsers use HTML hierarchy to split text intelligently.
- Google Search Central: Semantic Structure. Google's guidance on how proper HTML heading structure aids indexing and understanding.

