Truth About Web Crawlers

Truth About Web Crawlers


Wouldnt it be nice to be able to leave some code in your web site to tell the search engine spider crawlers to make your site number one Unfortunately a robots.txt file or robots meta tag wont do that, but they can help the crawlers to index your site better and block out the unwanted ones.

First a little definition explaining:

Search Engine Spiders or Crawlers - A web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web according to a set of policies.

Robots.txt - The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

The robots.txt protocol is purely advisory, and relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught out trying to use the robots file to make private parts of a website invisible to the rest of the world. However the file is necessarily publicly available and is easily checked by anyone with a web browser.

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final / character appended: otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

Meta Tag - Meta tags are used to provide structured data about data.

In the early 2000s, search engines veered away from reliance on Meta tags, as many web sites used inappropriate keywords, or were keyword stuffing to obtain any and all traffic possible.

Some search engines, however, still take Meta tags into some consideration when delivering results. In recent years, search engines have become smarter, penalizing websites that are cheating (by repeating the same keyword several times to get a boost in the search ranking). Instead of going up rankings, these websites will go down in rankings or, on some search engines, will be kicked off of the search engine completely.

Index a site - The act of crawling your site and gathering information.

How can the robots.txt file and meta tag help you

In the robots.txt you can tell the harmful web crawlers to leave your web site alone, and give helpful hints to the ones you want to crawl your site. Below is an example on how to disallow a web crawler to search your site:

# this identifies the wayback machine User-agent:

ia_archiver

Disallow: /

ia_archiver is the crawler name for the wayback machine that you may have heard of, and the / after disallow tells ia_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.

Type the above three lines into notepad from your computer and save it to the root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps the crawler to do its job, and helps the web site owner tell the spider what to do. Say for instance you have some data that you dont want the crawlers to see. (Like duplicate content for other browser referrer pages)

You can deter crawlers from indexing the duplicate directory by typing this into your robots.txt file.

User-agent: *

Disallow: /duplicate/

The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create the above two commands into a robots.txt file:

# this identifies the wayback machine

User-agent: ia_archiver

Disallow: /

User-agent: *

Disallow: /duplicate/

One thing to note that is very important: Anyone can access the robots.txt file of a site. So if you have information that you dont want anyone to see dont include it into the robots.txt file. If the directory that you dont want anyone to see is not linked to from your web site the crawlers wont index it anyway.

An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this:

< meta name=robots content=noindex,nofollow >

You put this into the tag of your web page. This line tells the robot crawlers not to index (search) the page and not to follow any of the hyperlinks on the page. So as an example < meta name=robots content=noindex,follow > tells the robot crawlers to not index the page, but follow the hyperlinks on this page.

Did You Know That Google Has Its Own Meta Tag

It looks like this:

< meta name=googlebot content=noindex,nofollow,noarchive >

This tells the Google robot crawler not to index the page, not to follow any of the links, and not to keep from storing cached versions of your web site. You will want this done if you update the content on your site frequently. This prevents the web user from seeing outdated content that isnt refreshed because of storage in the cache.

You can use the meta tag to specifically talk to Googles robots to avoid complications or if you are optimizing your site for Googles search engine.

 .

 

The Overture Keyword Assistant Tool, Highly Inflated Impressions and Phantom Traffic

Websites that are dependent on search engine traffic rely heavily on detailed keyword research to reach their target audience. Whether the resulting information is used for PPC, SEO or featured ads is beside the point. Simply put, if you want to exploit search traffic, you need accurate data on the number of searches carried out for each particular keyword.

Some companies will sub contract the keyword research to a specialist company and others will tackle it in-house. Regardless of who performs the research, a large number of people will primarily use the information provided by the Overture Keyword Assistant as the foundation of the project. Ive been of the view for some time that the data Overture provides is often inflated, especially primary keywords. Recently I have been conducting tests to ascertain the accuracy of Overtures data in an effort to prove my suspicions and to see how big the problem is. The results so far are way beyond what I expected.

The SEO Research

Approximately one year ago I set up a new website focused on VoIP phone systems (www.ip-phone-system.co.uk). The website was built to rank highly on Yahoo for the search phrase Phone System and a number of other keyword phrases. According to Overture the phrase phone system has 350,066 searches performed each month in the UK alone. The website is currently on the first page of results in Yahoo.com and in the top three positions for Yahoos UK only search.

With the keyword tool reporting this amount of searches and the websites position, you would expect the site to be receiving a large volume of traffic. But to put it simply, it does not. For example, over the last two months the site has only received three visits from people searching for phone system.

This test is not concrete because the majority of searches for phone system could be performed on another engine that Overture pulls its results from like MSN. But you would have to agree that its not very likely. Especially when you consider the site ranks in the top three positions for the search phrase phone system on MSN.

Overtures keyword tool pulls its results from a number of sources, Yahoo and MSN being the largest in terms of traffic. The site has a large number of top three listings on apparently high traffic yielding phrases e.g. IP Phone, Business Phone System, Office Phone System etc. yet only receives a very small number of visitors.

Phantom Traffic

So whats causing the highly inflated number of impressions the tool returns I cant say for sure but can certainly name a few things that could be significantly contributing to the effect. Im also going to try and coin a phrase here and call the phenomena Phantom Traffic, which simply means non-genuine traffic or searches conducted for other reasons than an actual genuine interest of finding a site relating to a keywords particular theme. I strongly believe both of the examples below are affecting Overtures data and are contributors of phantom traffic.

1. Manual SEO Position checking

People manually checking the search results to ascertain a websites position. Search phrases that are perceived to be high traffic yielding in theory will have more people conducting optimisation and therefore more people manually checking their positions. More people manually checking their positions causes the number of impressions to be inflated (phantom traffic). This is self perpetuating; the more people checking results inflate the number of impressions, causing even more people to target the phrase and manually check their positions etc. etc. etc.

Im certain this is impacting the Overture Keyword Suggestion tool significantly enough to cause many sites to chase phantom traffic. I also believe this to be the biggest contributing source of phantom traffic. Many webmasters manually check their rankings every day and some even more.

Auto Generated Pages Compiled from SERPs (Search Engine Results Pages)

Spam sites gathering keyword rich content from the SERPs. These sites will automatically query search engines for their most sort after keywords (probably researched via the Overture keyword suggestion tool). The sites automatically copy the results pages of the search engines which are highly keyword focused. Quite often these sites will auto generate tens of thousands of pages, all focused on a select number of keyword variations. These keyword rich pages are normally buried quite deep in the site because they have no value to human visitors. Each page will be linked using rich anchor text and then pass the relevance back to one of the main pages via an anchor text link.

The idea behind this SEO trick is simply to produce large amounts of optimised content thats linked together in a favourable fashion. Some of the programming that goes into these kinds of practices can be very clever, while others are very basic indeed. The problem is virtually every page being generated or regenerated for that matter is influencing Overtures data, unless the programmer is using an API key (which is unlikely).

Conclusion

This is very worrying to me as there must be a large number of people who base their entire keyword research campaigns on the data from Overture. This may cause their entire marketing campaign to focus on nothing more than Phantom Traffic. So what can one do to avoid targeting phrases that mainly consist of phantom traffic

Well first of all its wise to use a combination of data sources. Wordtracker provides similar data to Overture but its gathered from different sources. Comparing the two data sources can sometimes highlight phantom traffic. If you notice keywords with an extremely high number of impressions, just ask yourself if its believable. Common sense can go a long way in this game.

Personally Ive always advised clients to target sub-primary keyword phrases first and once rankings are achieved to focus on the next sub-primary phrase. If you intelligently select sub-primary keyword phrases that include the primary keywords, you are optimising the primary keyword at the same time.

Example

A good example of this is the sub primary keyword phrases, web design Manchester. The company I work for is currently listed on the first page of the major search engines for this phrase. The primary phrase is web design and is also being optimised at the same time because the words are contained in web design Manchester (were currently holding position 11 on Yahoo UK for search phrase web design). The search phrase web design Manchester is also one of our best performing keywords because it is so targeted. Anyone searching on that term is specifically interested in web design in the Manchester area.

Optimising in this fashion has several benefits. First of all sub-primary phrases should be less effected by Phantom Traffic and the number of impressions you see should be similar to the number of genuine searches carried out. Sub-primary phrases tend to also be less competitive with fewer people specifically optimising for them (however, this is not always the case). So reaching a traffic generating position is easier and faster resulting in faster ROI.

Once enough sub-primary phrases are optimised to rank well for primary keywords. The campaign will already be bringing in targeted traffic and therefore cause much less pain and wasted effort if the primary keyword is heavily affected by Phantom Traffic.

The other advantage is much of the time sub-primary phrases are more targeted and the traffic they bring tend to convert much better. I have personally seen this time and time again. Sites that have little traffic but enjoy a conversion ratio of 1/3 because the traffic they do receive is extremely targeted sub-primary keyword phrases. These websites often out perform sites receiving ten times the amount of traffic from primary keywords. Its all down to specifics though and what works for some may not work for you. As mentioned before, common sense goes a very long way in this game. Just dont get caught up chasing phantom traffic.

 

James Anderson is the author of SEO Forum Watch and is currently working for Podium Solutions a   and Internet Marketing Company as aSearch Engine Optimisation Consultant. James has helped many companies conduct keyword research for SEM campaigns.

Related Topics
Blogging Chocolate Purses, Counterfeit Handbags, & Purse Riots For SEO
Jagger and the Age of Business Failure-For What?
Search Engine Optimisation (SEO) is Dead!
Three Basic Link Building Strategies
Consumers Cant Find Most US Businesses Online
Finding Your Niche in Keyword Research
One Million Pages of WebmasterWorld Dropped by Google as Forum Bans Bots
Is Your Website Fully Exposed?
Website Content is King
Walking the SEO Tightrope ? Algorithms or Viewers?
Seo