Free content for your website or blog
Home About Us Article Writing Most Read Articles Authors Blog Wiki Contact Us
RSS Register Login
Topics
 
Home > Search Engines >

How to Control Search Engine Robots

Date Published: 17th May 2005
Bookmark and Share Republish How to Control Search Engine Robots
Author: Michael Rock RSS Views: N/A PRINT ASK ABOUT THIS ARTICLE

How to Control
Search Engine Robots


Wouldn't it be nice to be able to leave some code in your web site to tell
the search engine spider crawlers to make your site number one? Unfortunately a
robots.txt file or robots meta tag won't do that, but they can help the crawlers
to index your site better and block out the unwanted ones.


First a little definition explaining:


Search Engine Spiders or Crawlers - A web crawler (also
known as web spider) is a program which browses the World Wide Web in a
methodical, automated manner. Web crawlers are mainly used to create a copy of
all the visited pages for later processing by a search engine, that will index
the downloaded pages to provide fast searches.


A web crawler is one type of bot, or software agent. In general, it starts


with a list of URLs to visit. As it visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit, recursively
browsing the Web according to a set of policies.


Robots.txt - The robots exclusion standard or
robots.txt protocol
is a convention to prevent well-behaved web spiders and
other web robots from accessing all or part of a website. The information
specifying the parts that should not be accessed is specified in a file called
robots.txt in the top-level directory of the website.


The robots.txt protocol is purely advisory, and relies on the cooperation of
the web robot, so that marking an area of your site out of bounds with
robots.txt does not guarantee privacy. Many web site administrators have been


caught out trying to use the robots file to make private parts of a website
invisible to the rest of the world. However the file is necessarily publicly
available and is easily checked by anyone with a web browser.


The robots.txt patterns are matched by simple substring comparisons, so care
should be taken to make sure that patterns matching directories have the final
'/' character appended: otherwise all files with names starting with that
substring will match, rather than just those in the directory intended.


Meta Tag - Meta tags are used to provide structured data about
data.


In the early 2000s, search engines veered away from reliance on Meta tags, as
many web sites used inappropriate keywords, or were keyword stuffing to obtain
any and all traffic possible.


Some search engines, however, still take Meta tags into some consideration
when delivering results. In recent years, search engines have become smarter,
penalizing websites that are cheating (by repeating the same keyword several
times to get a boost in the search ranking). Instead of going up rankings, these
websites will go down in rankings or, on some search engines, will be kicked off
of the search engine completely.


Index a site - The act of crawling your site and gathering
information.

How can the robots.txt file and meta tag help you?


In the robots.txt you can tell the harmful 'web crawlers' to leave your web
site alone, and give helpful hints to the ones you want to crawl your site.
Here is an example on how to disallow a web crawler to search your site:


# this identifies the wayback machine

User-agent: ia_archiver

Disallow: /


ia_archiver is the crawler name for the wayback machine that you may have
heard of, and the / after disallow tells ai_archiver not to index any of your
site. The # allows you to write comments to yourself so you
can keep track of what you typed.


Type the above three lines into notepad from your computer and save it to the
root directory of your web site as robots.txt. Web crawlers look for this
document first at a web site before doing anything else. This helps the
crawler to do its job, and helps the web site owner tell the spider what to do.
Say for instance you have some data that you don't want the crawlers to see.
(Like duplicate content for other browser referrer pages) You can deter
crawlers from indexing the 'duplicate' directory by typing this into your
robots.txt file.


Or if you would like to have the robots.txt file created for you, visit
www.rietta.com/robogen. To validate
your robots.txt file to make sure it works properly you can visit
www.searchengineworld.com/cgi-bin/robotcheck.cgi


User-agent: *

Disallow: /duplicate/


The * after user-agent says that this action applies to all crawlers and
/duplicate/ after disallow tells all crawlers to ignore this directory and not
search it. For each user-agent and disallow line there must be a blank
space between them in order for it to function correctly. So this is how
you would create the above two commands into a robots.txt file:


# this identifies the wayback machine

User-agent: ia_archiver

Disallow: /


User-agent: *

Disallow: /duplicate/


One thing to note that is very important: Anyone can access the
robots.txt file of a site. So if you have information that you don't want
anyone to see don't include it into the robots.txt file. If the directory
that you don't want anyone to see is not linked to from your web site the
crawlers won't index it anyway.


An alternative to blocking indexing of your site is to put a meta tag into
the page. It looks like this:


You put this into the tag of your web page. This line tells the
robot crawlers not to index (search) the page and not to follow any of the
hyperlinks on the page. So as an example
tells the robots crawlers to not index the page, but follow the hyperlinks on
this page.


Did you know that Google has its own tag?


It looks like this:
This tells the Google robot crawler not to index the page, not to follow any of
the links, and not to keep from storing cached versions of your web site.
You will want this done if you update the content on your site frequently.
This prevents the web user from seeing outdated content that isn't refreshed
because of storage in the cache.


You can use the tag to specifically talk to Google's robots to avoid
complications or if you are optimizing your site for Google's search engine.
This concludes this month's article.


Until the next article have a great day!


Copyright © Michael Rock

(You have permission to copy this article as long as it remains intact with the
author's byline)


Web development contractor (Web Design and Hosting)


Internet Presence

www.TheInternetPresence.com



The owner of this registered company
has over twenty years experience with DOS, windows business applications, numerous
programming languages, artistic development, and web design. Other areas of
interest include web marketing, web promoting, and business marketing and
development. After the persuasion of those praising his work, he decided to go
into business himself and highly suggests everyone else to do the same.



Internet Presence was founded in 2003
from a desire to become independent. Less than 1 year later Internet Presence
has had accounts in three different states ranging from a locally owned auto
collision repair shop to a glass packaging industry that sells its product
worldwide.


This article is free for republishing
Source: http://www.articlealley.com/article_1786_6.html
About the Author
Lois Thompson of Prudco Marketing. Lois is a Web Consultant and moderator of buildingwealth eGroup. http://www.building-your-internet-presence.com/freetips.htm mailto: byipwebservices@freeautobot.com
Bookmark and Share Republish How to Control Search Engine Robots

Related Video

Google Hacks 2.0 Top Make Money Online Today -Secret Tutorial #3 Peer to Peer Web Search with Minerva How do Search Engines Find My Website?
 

Ask a Question About this Article

>> Is there anything such as Guaranteed Seo ...
>> Engine filled with milky oil!
>> Engine failure
>> Cobra TT racing game controllers for platstation 2 ...
Powered by