
Tout comme les récentes annonces sur des accords concernant le protocole Sitemap, les trois principaux moteurs de recherche, Google, Yahoo! et Msn Live Search viennent d'établir un standard de fichier robots.txt.
Pour rappel, le fichier robots.txt est un petit fichier qui se place à la racine d'un site, et qui est le premier document lu par le robot du moteur de recherche, lorsqu'il parcoure un site web. Il est utilisé pour donner des indications au moteur sur la façon d'indexer le site (si certains répertoires, fichiers sont interdits et ne doivent pas être indexés, comment utiliser les balises META...).
Alors qu'auparavant, chaque moteur avait sa propre syntaxe (même si certains termes se recoupaient), ce temps est désormais révolu puisque les trois principaux moteurs de recherche se sont mis d'accord sur un standard, qui permettra de simplifier quelque peu la vie aux webmasters.
Voici donc pour exemple la déclaration comme indiquée sur le blog officiel de Yahoo! (en anglais), avec tout de même quelques spécificités pour ce moteur (comme pour les autres d'ailleurs):
1. Robot.txt Directives
| DIRECTIVE | IMPACT | USE CASE(s) |
| Disallow | Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled. | 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling. |
| Allow | Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule - the longest rule - applies. | This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it. |
| $ Wildcard Support | Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages. | 'No Crawl' files with specific patterns, for eg., files with certain filetypes that always have a certain extension, say pdf; etc. |
| Sitemap Location | Tells a crawler where it can find your sitemaps. | Point to other locations where feeds exist to point the crawlers to the site's content. |
2. HTML META Directives
These directives can either be placed in the HTML of a page or in the HTTP header for non-HTML content like PDF, video, etc. using an X-Robots-Tag. The X-Robots-Tag mechanism allows these directives to be available for all types of documents -- HTML or otherwise. If both forms of the tag, HTML META and X-Robots-Tag in the header are present, the most restrictive one applies.
| DIRECTIVE | IMPACT | USE CASE(s) |
| NOINDEX META Tag | Tells a crawler not to index a given page. | Don't index the page. This allows pages that are crawled to be kept out of the index. |
| NOFOLLOW META Tag | Tells a crawler not to follow a link to other content on a given page. | Prevent publicly writeable areas from being abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page. |
| NOSNIPPET META Tag | Tells a crawler not to display snippets in the search results for a given page. | Present no abstract for the page on search results. |
| NOARCHIVE META Tag | Tells a search engine not to show a "cached" link for a given page. | Do not make a copy of the page available to users from the search engine cache. |
| NOODP META Tag | Tells a crawler not to use a title and snippet from the Open Directory Project for a given page. | Do not use the ODP (Open Directory Project) title and abstract for this page in Search. |
Other REP Directives
Yahoo!-specific REP directives that are not supported by Microsoft and Google include:
- Crawl-Delay: Allows a site to delay the frequency with which a crawler checks for new content
- NOYDIR META Tag: This is similar to the NOODP META Tag above but applies to the Yahoo! Directory, instead of the Open Directory Project
- Robots-nocontent Tag: Allows you to identify the main content of your page so that our crawler targets the right pages on your site for specific search queries by marking out non content parts of your page. We won't use the sections tagged as such for indexing the page or for the abstract in the search results."
Source: Yahoosearchblog