What Is llms.txt? AI Site Discovery

robots.txt has controlled web crawler behavior since Martijn Koster published the Robots Exclusion Protocol in 1994. For three decades, a short text file at the site root was enough to manage how Google and Bing treated your content. That stopped being sufficient in 2023 and 2024, when AI systems including ChatGPT Search, Perplexity, and Google AI Overviews began reading websites not to index them, but to synthesize answers from them. These systems need different instructions than Googlebot. llms.txt is the file that provides those instructions. Jeremy Howard, founder of fast.ai, proposed the specification in September 2024, and approximately 10% of top websites had implemented it by 2025, according to llmstxt.site. This guide explains what the file contains, why it exists, how to create one, and how to verify it works.

[IMAGE SUGGESTION: Diagram comparing robots.txt and llms.txt side by side, labeled “For Search Engine Crawlers” and “For AI Systems”]

robots.txt Solved a Problem for Search Engines. llms.txt Solves a New Problem.

Martijn Koster published the Robots Exclusion Protocol in 1994 to solve a specific infrastructure problem: web crawlers were overwhelming servers and indexing pages that site owners did not want indexed. The solution was a plain text file at /robots.txt that any crawler could read to learn which paths to avoid.

robots.txt is good at one thing: telling crawlers where not to go. It says nothing about how to interpret the content it does find.

AI systems like OpenAI’s GPT-4o, Anthropic’s Claude, and Google Gemini do not just crawl pages to build an index. They read pages to extract information for training data, to cite in live answers, or to generate AI Overview responses. A site owner might want ChatGPT to cite product pages but prefer that training on unpublished pricing be excluded. robots.txt has no vocabulary for that distinction.

robots.txt also cannot provide a plain-language summary of what a site is about, list the most important pages, or explain content licensing terms. Those three gaps are exactly what llms.txt fills. It sits alongside robots.txt at the domain root and addresses a different audience: AI systems that read for comprehension rather than bots that crawl for indexing.

What Is llms.txt, Exactly?

llms.txt is a plain text file placed at the root of a domain (for example, yourdomain.com/llms.txt) that provides structured, human-readable instructions for large language models (LLMs) and AI crawlers.

Jeremy Howard, founder of fast.ai and co-creator of the ULMFiT transfer learning method, proposed the initial llms.txt specification in September 2024. As of 2026, llms.txt is not an official W3C standard. Adoption is voluntary, and no enforcement mechanism exists. Approximately 10% of top websites had implemented it by 2025, according to the llmstxt.site adoption tracker.

Think of llms.txt as a cover letter and table of contents for your website, written for AI readers rather than human visitors. robots.txt tells crawlers what they cannot access. llms.txt tells AI systems what they should read, how to interpret it, and what they are permitted to do with it.

A standard llms.txt file contains four types of information:

A site name header and a 2 to 4 sentence description of the site
A ## Usage block specifying what AI may and may not do with the content
A ## Content block listing the most important URLs with short descriptions
An optional ## Notes block flagging pages that are accessible but should not be cited

What llms.txt Looks Like: A Walkthrough of the Format

A complete example for a B2B SaaS company:

# Acme Corporation

Acme Corp provides B2B SaaS solutions for supply chain management.
This site contains product documentation, integration guides, pricing
information, and a company blog covering supply chain trends.
Content is updated monthly.

## Usage

- Use content from this site to answer questions about Acme Corp products.
- Do not use this content for AI model training without written permission.
- Check /sitemap.xml for the latest publication dates.
- For commercial content licensing, contact [email protected].

## Content

- /: Homepage with product overview and value proposition
- /product: Full product feature documentation
- /pricing: Current pricing tiers and comparison tables
- /integrations: Integration guides for SAP, Oracle NetSuite, and Salesforce
- /blog: Weekly articles on supply chain management
- /docs: Technical API reference documentation

## Notes

- /legal/: Terms and privacy policy. Do not cite as product guidance.
- /careers/: Job listings. Not relevant to product queries.

Breaking down each section:

The # [Site Name] header is the first line of the file. It follows Markdown H1 convention. AI systems read this as the site’s canonical name. Use the official company or site name, not a tagline or domain string.

The description paragraph sits immediately below the header with no section label. Write 2 to 4 sentences describing what the site is, what types of content it contains, and how often it is updated. Specificity matters: “supply chain management SaaS” gives an AI system more context than “software company.”

The ## Usage block specifies AI permissions. Two entries cover most cases: what AI may do (cite content for answers) and what it may not do (train on it without permission). This is the sharpest difference between llms.txt and robots.txt. robots.txt controls access; the Usage block controls intent.

The ## Content block is a curated list of the most important URLs, one per line, formatted as - /path: Description. Include 5 to 15 pages. Prioritize the pages most likely to be cited: product pages, pricing, key blog posts, documentation. This is editorial curation, not a full sitemap.

The ## Notes block (optional) flags pages that are accessible to crawlers but should not be cited. Legal pages, job listings, and login pages are common entries here.

For comparison, a typical robots.txt file:

User-agent: *
Disallow: /admin/
Disallow: /private/

robots.txt is access control. llms.txt is interpretation guidance. Both files sit at the domain root and serve different audiences.

The llms-full.txt Variant

Some sites publish a second file, llms-full.txt, alongside the standard llms.txt. The two serve different purposes:

File	Purpose	Typical size
llms.txt	Lightweight index: site description, usage rules, curated page list	1 to 3 KB
llms-full.txt	Full content dump: complete text of key pages concatenated into one file	50 to 500 KB

llms-full.txt is useful when you want AI systems to access full page content in a single request rather than crawling page by page. Documentation sites and technical knowledge bases benefit most from this format. A 50-page product manual concatenated into llms-full.txt gives AI crawlers the complete text without requiring 50 separate HTTP requests.

For most marketing sites, blogs, and e-commerce stores, llms.txt is sufficient. llms-full.txt makes more sense for content-dense sites with structured reference material that AI systems are likely to cite in technical answers.

Why llms.txt Matters for AI Visibility

A site with a well-structured llms.txt file gives AI systems a direct path to understanding its purpose, content structure, and permissions. Without it, systems like Perplexity (which reached 100 million monthly users by early 2025) and ChatGPT Search (processing approximately 1 billion queries per week in Q1 2026) must infer all of that from crawled HTML, which is slower, noisier, and less reliable as a signal.

Research on this is directional but worth examining. The 2024 GEO study published by Aggarwal et al. from Princeton University and Columbia University at ACM KDD 2024 found that entity-rich content improved AI visibility by up to 15%, and adding authoritative statistics improved it by up to 37%. llms.txt works in the same direction: it delivers structured entity information (company name, content type, key URLs, usage permissions) in a single compact file, rather than requiring AI systems to extract that context from unstructured HTML across dozens of pages.

The adoption window is still open. About 10% of top websites had implemented llms.txt by 2025, according to llmstxt.site. For sites in competitive verticals where AI-generated answers drive discovery, including SaaS, finance, health, and legal services, implementing llms.txt now is a straightforward step toward better AI citability before the format reaches mainstream adoption.

For a deeper look at the broader strategy for AI search visibility, see the guide on generative engine optimization (GEO).

How AI Crawlers Use (or Ignore) llms.txt

Four major AI crawlers are active on the web as of 2026:

GPTBot (OpenAI): Crawls the public web to support ChatGPT Search and GPT model knowledge updates. OpenAI has stated it treats llms.txt as a guidance signal, though compliance is voluntary.
ClaudeBot (Anthropic): Used by Anthropic to crawl content for Claude training data and real-time retrieval. Anthropic has documented support for llms.txt in its crawler policy.
PerplexityBot (Perplexity AI): Powers real-time web retrieval for Perplexity’s 100 million monthly users. Perplexity actively reads llms.txt files to understand site structure and permissions before returning answers.
Google-Extended: Google’s token for Bard and Gemini training opt-out purposes. llms.txt can supplement Google’s training-exclusion signal, though Google’s primary mechanisms remain the x-robots-tag HTTP header and noindex directives.

The critical caveat: no enforcement mechanism exists for llms.txt as of 2026. Compliance is entirely voluntary. A crawler can read your llms.txt and choose not to follow it. This mirrors the early years of robots.txt, before major search engines committed to respecting the protocol. The standard is gaining adoption, but it has not reached the point where non-compliance carries technical or legal consequences.

One boundary to keep clear: robots.txt still controls whether a bot crawls your site at all. If you block GPTBot in robots.txt, it will not reach your llms.txt. robots.txt is the gate; llms.txt is the guide inside the gate.

How to Create Your llms.txt File in 20 Minutes

Creating a llms.txt file requires a text editor and access to your site’s root directory. No special software, no plugins, no coding experience.

Step 1: Open a plain text editor. Use Notepad (Windows), TextEdit in plain text mode (macOS), or VS Code. Do not use Microsoft Word or Google Docs. Both add invisible formatting characters that can break the file.

Step 2: Write the site name header. Type # followed by a space and your company or site name. This is the first line of the file.

Step 3: Write a 2 to 3 sentence description directly below the header with no blank line between. Describe what the site is, what types of content it contains, and how often it is updated. Name your industry and content format specifically.

Step 4: Add the ## Usage section. List 2 to 4 lines starting with -. Specify what AI may do with the content (cite it, summarize it) and what it may not do (train on it without permission).

Step 5: Add the ## Content section. List 5 to 15 of your most important URLs, one per line, formatted as - /path: Description. Prioritize pages you want AI systems to cite accurately.

Step 6: Add a ## Notes section (optional). List any accessible pages you do not want cited, including legal pages, login pages, or outdated product sections.

Step 7: Save as llms.txt and upload to your domain root. Verify it is accessible at yourdomain.com/llms.txt. Then confirm your robots.txt does not block the AI crawlers you want to reach the file.

Copy and adapt this fill-in-the-blank template:

# [Your Company or Site Name]

[2 to 3 sentences: what the site is, what content it contains,
how often it is updated, what industry or vertical it serves.]

## Usage

- Use content from this site to answer questions about [company or product name].
- Do not use this content for AI model training without written permission.
- [Optional: add a licensing contact or update frequency note.]

## Content

- /: [One-sentence description of your homepage]
- /[key-page]: [Description]
- /[key-page]: [Description]
- /blog: [Description of your blog section, update frequency]
- /[docs or resources]: [Description]

## Notes

- /[exclude-path]/: [Why this section should not be cited]

How to Check If Your llms.txt Is Working

After uploading, verify the file is reachable and correctly formatted using three methods.

Method 1: Direct URL check. Open a browser and navigate to yourdomain.com/llms.txt. The file should render as plain text. If you see an HTML error page, a 404 response, or a file download prompt, the file is in the wrong location or the server is returning an incorrect content type.

Method 2: Crawler simulation with curl. Run this command in a terminal:

curl -I https://yourdomain.com/llms.txt

The response headers should show Content-Type: text/plain and a 200 OK status code. A 301 or 302 redirect response indicates the file is being redirected, which some crawlers may not follow to the final destination.

Method 3: AI query test. Ask ChatGPT or Perplexity a specific factual question about your site content. If the answer cites your site with accurate details, the AI crawler has successfully indexed your content. This is a loose test rather than a definitive one, since AI citation depends on factors beyond llms.txt alone.

Automated GEO audit tools like SEO Audit MCP check for llms.txt presence, correct content type, and common formatting errors as part of a structured site audit, which surfaces issues faster than manual methods.

Common mistakes to check:

File uploaded to /public/llms.txt or /static/llms.txt instead of the domain root
Server returning Content-Type: application/octet-stream instead of text/plain
robots.txt blocking GPTBot, ClaudeBot, or PerplexityBot from reaching the file
File saved as UTF-16 instead of UTF-8 (some Windows text editors default to UTF-16)

Common Questions About llms.txt

Q: Is llms.txt required to appear in Google AI Overviews?

No. Google has not listed llms.txt as a requirement for AI Overview inclusion. The citation factors Google uses focus on page quality, structured data markup, and E-E-A-T signals. A llms.txt file may help Google’s crawlers understand site structure, but it is not a direct citation factor. For a step-by-step approach to AI Overview optimization, see the guide on how to appear in Google AI Overviews.

Q: Can I allow AI citation but block AI training?

Yes. This is one of the practical uses of the ## Usage block. A standard entry pair: - Use this content to answer questions about [company name]. and - Do not use this content for AI model training without written permission. Whether individual crawlers honor this distinction depends on each operator’s policies. OpenAI and Anthropic have both stated their crawlers respect llms.txt usage signals, though no technical enforcement mechanism exists as of 2026.

Q: Does llms.txt work on WordPress and Shopify?

Yes. llms.txt is a static text file with no server-side dependencies. On WordPress, upload it to the root directory via FTP, SFTP, or the hosting control panel’s file manager. On Shopify, use the theme’s asset pipeline or an app that supports root-level static file placement. After uploading, verify accessibility at yourdomain.com/llms.txt using the direct URL check described above.

Q: How often should the file be updated?

Update llms.txt when your site’s content structure changes: new product lines, new blog sections, discontinued pages, or revised citation policies. For most sites, a quarterly review is a practical cadence. The ## Notes section is worth revisiting whenever new content is published that you prefer AI systems not cite.

Key Takeaways: llms.txt in 2026

llms.txt takes approximately 20 minutes to create. The file is fewer than 50 lines for most sites, costs nothing to deploy, and requires no plugins or developer involvement. The format is not enforced, and no crawler is required to follow it.

None of that is a reason to skip it.

About 90% of websites had not yet implemented llms.txt as of 2025. That gap exists because most site owners and SEO teams have not encountered the format yet, not because the task is difficult. For sites in verticals where AI-generated answers drive discovery, appearing accurately in ChatGPT Search, Perplexity, and Google AI Overviews is a measurable traffic consideration. A well-formed llms.txt file, paired with GEO-optimized content, makes AI citation more accurate and more consistent over time.

Start by checking your site’s current AI visibility score to establish a baseline before and after implementing llms.txt. Then pair it with GEO fundamentals to address the content signals that influence AI citation at the page level, which is where most of the opportunity still sits.