Robots.txt Explained — How to Control What Search Engines Can Crawl

What is robots.txt and what does it actually do

A robots.txt file is a plain text file placed in the root of your website, usually at https://yourdomain.com/robots.txt. Its job is to give instructions to search engine crawlers such as Googlebot — telling them which folders, pages, or file types they should or should not crawl.

The important word here is crawl. robots.txt controls crawling, not guaranteed indexing. Blocking a URL in robots.txt does not automatically remove it from Google search results. If other pages link to that URL, Google may still show it without fully understanding its content.

It also does not protect private information. If a page must stay private, use proper authentication or server-side access controls — not robots.txt.

What it does well: it guides search engines away from pages that waste crawl budget or do not need to appear in search results, such as internal search pages, admin areas, or duplicate filter URLs.

The basic syntax — User-agent, Disallow, Allow, Sitemap

A robots.txt file uses a few simple directives. Once you understand these, you can read or write most files easily.

User-agent — tells the crawler which bot the rule applies to:

User-agent: Googlebot

To target all crawlers:

User-agent: *

Disallow — tells a crawler not to access a specific path:

Disallow: /admin/

Allow — used when you block a folder but want one part to remain crawlable:

Disallow: /images/
Allow: /images/logo.png

Sitemap — tells search engines where your XML sitemap lives:

Sitemap: https://yourdomain.com/sitemap.xml

A clean basic example:

User-agent: *
Disallow: /admin/
Disallow: /search/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Common robots.txt patterns you can use today

Most websites do not need a complex robots.txt file. These five patterns cover the majority of use cases.

1. Allow everything except admin pages:

User-agent: *
Disallow: /admin/
Disallow: /login/
Sitemap: https://yourdomain.com/sitemap.xml

2. Block internal search results:

User-agent: *
Disallow: /search/
Disallow: /?s=

3. Block parameter-heavy filtered URLs:

User-agent: *
Disallow: /products?color=
Disallow: /products?sort=

4. Block a staging or test area:

User-agent: *
Disallow: /staging/
Disallow: /test/

5. Allow one file inside a blocked folder:

User-agent: *
Disallow: /assets/
Allow: /assets/main.css
Allow: /assets/main.js

Mistakes that accidentally block Google

This is where most sites run into trouble. One wrong line in robots.txt can stop search engines from crawling valuable content.

Blocking the whole site by mistake:

User-agent: *
Disallow: /

That single slash means do not crawl anything on the site. It is sometimes used during development and then accidentally left in place when the site goes live.

Blocking CSS or JavaScript files:

Disallow: /css/
Disallow: /js/

If Google cannot access important CSS or JavaScript, it may not render the page properly, which can hurt indexing and page understanding.

Confusing noindex with robots.txt: Blocking a page in robots.txt does not remove it from Google. If you need a page removed from search results, use proper indexing controls — not crawl blocking.

Blocking important pages accidentally: On content and utility websites, accidentally blocking folders like /tools/, /images/, or /blog/ can wipe out organic visibility for core pages.

Syntax errors: Repeating conflicting rules, using wrong paths, or copying examples without adjusting them to your own site often causes problems that are hard to diagnose.

How to test your robots.txt before it causes problems

Never publish a new robots.txt file without testing it. A file that looks harmless can still block important directories, templates, or tool pages.

Before going live, check whether your key URLs are crawlable. Test your homepage, blog posts, tool pages, images, and any folders you intentionally restricted. That way you can catch bad rules before Google does.

A good testing workflow is simple: paste in your file, validate the syntax, then check important URLs one by one. This gives you confidence that search engines can reach the pages you want indexed and avoid the ones you want ignored.

Robots.txt Tester — Validate your robots.txt syntax, test URL rules, and generate a clean file instantly.

Test your robots.txt free →

Final Thoughts

robots.txt is a small file with a big impact. It helps search engines crawl your site more efficiently, reduces wasted crawling on low-value pages, and gives you more control over how your website is explored.

Keep it simple. Use it to guide crawlers, not to hide content. Double-check every rule — especially before a redesign, migration, or launch. And always test it before pushing changes live.

A clean robots.txt file will not magically improve SEO on its own, but a broken one can absolutely hurt it. That is why getting this file right matters.