What’s Important for a Robot

This entry is part of the series:
The Art of Search Engine Optimization

Designing a website nowadays always includes the task of optimizing the website for search engines. Otherwise you might have designed a brilliant website but nobody will be able to find it! Ideally, your site will be in the top 10 search results, i.e. on the first page. This blog series by Marcus Günther and Oliver Schmidt describes how to attain this goal. The first lesson is to master the art of being crawled by a search engine robot.

Search engine robots are programs that automatically browse websites on the Internet and download new or modified webpages for later indexing by the search engines. Other terms used for search engine robots are bots, crawlers, spiders etc. And the task itself is often called crawling or spidering. In order to make a robot spider your website, you either need links from websites already know to the search engine or you need to submit your website to the search engines which have special pages for that.
Before robots start crawling your website, all well-behaved robots (e.g. from Bing, Google, Yahoo, etc.) check the existence and contents of the file /robots.txt.

Guiding Robots with robots.txt

The robots.txt file is a plain text file where you can specify which content of your website can or cannot be indexed by search engine robots. The file’s structure is standardized by the Robots Exclusion Standard Protocol (REP).

You can include/exclude single files or folders as well as file and directory name patterns. For some search engine robots, you can even create a rule to ex- or include URLs which contain specific characters, for example the question mark character (“?”).

However, it’s not absolutely necessary having a robots.txt file. If your website is well-organized, chances are very high that bots will crawl all your content successfully even without that file.

But what if you do not want all of your pages to be indexed? Maybe you provide pages which are useful for visitors but do not need to be indexed at all, e.g. error pages. These pages provide no further information about you, your services or the content of your website, they even might look strange on result pages, so it’s better to exclude them from indexing by using the following statement in robots.txt:

Disallow: /errors

Another scenario you might want to anticipate: Someone copying your website using tools like wget, mechanize, etc. With a robots.txt file, you can make it harder for them (but beware, these tools have options for ignoring robots.txt):

User-Agent: Wget
User-Agent: Wget/1.6
User-Agent: Wget/1.5.3
Disallow: /

There are more ways to tell search engine robots not to crawl specific pages. That will be covered in our next part of this blog series.

However, the robots.txt file is only a hint for robots. If some robots are not well-behaved, ignore the Robots Exclusion Standard and try to spider your website anyway, then your robots.txt file is nothing more than an edgeless sword. To completely exclude these bots, you have to know their “user agent string” and can instruct the webserver to no serve any content for them.

Nevertheless, you can provide the robots.txt file for all “friendly” search engine robots to specify how they should index your website. That makes their and your life easier. Here’s another example of a short robots.txt file:

# robots.txt file for www.example.org
Sitemap: http://www.example.org/sitemap.xml

# First part – bad bots
User-Agent: badBot
User-Agent: badBot/2.1
Disallow: /

# Second part – Google image bot
UserAgent: googlebot-image
Disallow: /errors
Disallow: /contact
Disallow: /disclaimers
Crawl-delay: 180

# Third part – all other bots
UserAgent: *
Disallow: /errors/
Disallow: /*.pdf$

In the above example, we used some non-standard directives like the “$” sign to mark an end of line. Google, Bing and Yahoo! can interpret those useful directives, however.

The first part of the example (lines 4-7) forbids a bot called “badBot” (as well as another version of that bot) access to all files and directories on the domain http://www.example.org/.

Lines 9-14: For the user agent “googlebot-image”, which is Google’s robot for the image search, the example disallows access to some directories where there are no relevant images. With the “Crawl-delay” statement, each page of the website will be accessed after a delay of 180 seconds.

The third part (lines 16-19) of the example forbids all bots access to all files under directories starting with the string “errors” and all PDF documents under the domain.

At the beginning of the example, there is a statement which has not been discussed yet in this article: With the Sitemap statement, you can specify where search engine robots can find your sitemap.xml file. The following section describes what it is about the sitemap.xml.

Further Info:

Describing your site’s structure

The sitemap.xml file really is a file which resides on your website and lists links to all of your assets that robots should spider and index. That’s important because some of your pages might only be accessible via Flash or JavaScript links which could easily be missed by a crawling process.

Below is a simple example for a sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>
<urlset >specification.

You might guess that creating a sitemap.xml file is not a difficult but time-consuming task. Google who has invented the sitemap.xml standard is here to help you with Google’s sitemap generator which is a module for the Apache web server that can parse Apache’s log files, scan the website root directory and filter URLs in order to create a complete sitemap.xml specific for your website. Additionally, it can automatically submit the newest versions of your sitemap to Google, Bing, Yahoo! and Ask.

However, if you have any chance of creating that file on your own, that will be the better alternative as all automated tools are having a hard time distinguishing between real content and e.g. search results which should not be indexed.

The Google Sitemap Generator module for the Apache web server

Most Blog/CMS frameworks as well as enterprise portals also provide automatic generation of the sitemap.xml file while you add more and more pages, see

Once you have created and adjusted the sitemap.xml for your website, there are several ways for its publication. You can

  1. store it anywhere on your webserver and provide its exact location in the robots.txt file (see above),
  2. submit it in the search engine provider’s webmaster tools – such tools exist for several providers, for example Google, Bing and Yahoo!
  3. submit it via your search engine providers’ ping service
    1. Bing sitemap ping service http://www.bing.com/webmaster/ping.aspx?siteMap= http%3A%2F%2Fwww.yoursite.com%2Fsitemap.gz
    2. Yahoo! Site Explorer
    3. Google sitemap ping service for re-submission http://www.google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.yoursite.com%2Fsitemap.gz

Sitemap files can get very large, therefore most search engines support adding more than one sitemap.xml. Another useful case for multiple sitemaps are websites which use different sofware for different sections (like shop, CMS, blog, etc.).

In order to use multiple sitemaps you have to submit a sitemap index file to your search engine provider and refer to the other sitemaps from there. Google supports adding more than one sitemap.xml in their webmaster central to make that a bit easier.

Submitting a sitemap.xml file to Bing webmaster tools

Further Info:

The next articles will cover the topics how to structure your website, how to structure your URLs and keywords; it will also cover topics like how to find keywords, what kind of keywords you should focus on, where you can place your keywords and so on.

Series NavigationConcise Guide to HTML Head Tags >>
Share