Crawlability of most sites is a big issue when it come to Search Engines Optimization(SEO) efforts and unless all the hindrances and barriers are removed, your site stands a greater risk of not being crawled by Search Bots like Googlebot and Yahoo! Slurp. The following are common barriers to crawlability:
1. Website Structure should be corrected before any SEO and Link Building campaign
(A). First Layer - The homepage of a website is supposed to introduce the main purpose of the
website. Thus, webmasters should make sure that there are links to the most important information
files in the root directory of the website from the homepage.
(B). 2nd Layer - This is the hub to the bottom-line information existing in a website.The files in this
section will be crawled by search engines after the ones in the root directory. So, they should be the
ones that aren't needed to be immediately found by search engines. Thus, they should be the
second important files after the ones at the first layer of the website structure.
(C). 3rd Layer - This is where the least important website files are.These files should contain the>
primary information website visitors may be looking for. Hence, people aren't looking for anything
deeper than those files. This could be where downloadable files are stored.
The fascinating thing about search engines' behavior is that they evaluate the files based on what comes first as more important. Thus, the closer a file is saved to the root of a website directory, the more
importance it will be given by the search engines.
2. If you have robot.txt file, check it well to avoid any interference with the Search engines.
One common mistake people do is;
User-agent:*
Disallow:/
INSTEAD OF;
User-agent:*
Disallow:
You can learn everything about robot.txt files at: www.robotstxt.org.
3. Session IDs in your URLs
Search engine spiders absolutely do not like to see session IDs in URLs. If you’re using session IDs
on your site, please be sure to store or place them in cookies (which i guess spiders don’t accept
anyway) instead of including them as part of your URLs. Session IDs normally cause a single page
of content to be visible at multiple URLs, and that would just obstruct the SERPs. Thus, search
engines don’t like to crawl URLs with session IDs.
4. Poor Navigation and InternaL links Coding
Its's well documented that Google and the other search engines have crawlability problems with ajax, flash, and javascripts
5. Too many variables/parameters in your URLs?
Search engines are getting better at crawling long, 'yucky' links — but hey, they still don’t like them.
Google’s webmaster guidelines explains it better
* If you decide to use dynamic pages - i.e., the URL contains a "?" character, be mindful that not
every search engine spider crawls dynamic pages in addition to static pages. It helps to keep the
parameters short and the number of them to a minimum.
6. Code Bloat
Generally, Spiders are good at distincting code from content, but that doesn’t mean we should make
it more difficult by having so much code that the content is hard to find. If you examine the source
code of your web pages, and basically, finding the content is a bit problematic, then you might really
have problems with crawlability.
PS: If Search Bots find it difficult to crawl your site(s), then remeber your site may not be found in the Search engines.