What is a
search engine robot?
Yes, robots are everywhere. Search engine
robots, spiders, crawlers and bots that is. Whatever name
they go by there is really nothing to be frightened of. Actually
these robots are programs that a search engine will send out
to surf web pages. The robots 'spider' through web pages they
find to assist the crawler based engines index the vast amount
of pages that are on the world wide web. Each search engine
has their own robot, sometimes more than one and they all
do things a bit differently. This article does not focus on
any particular robot or search engine but rather gives you
a general overview of robots, spiders and crawlers.
How search engine robots
work
In it's essence the robot goes out and visits
Site A - it reads the metadata and the entire text on the
site. And, I mean text! Not images, not text rendered as an
image - just plain old text. The robot also reads the URL's
of any outbound links pointing to Site B and Site C. The robot
will then spider Site B and Site C as well. As you can see
the method of having a good linking strategy will prove to
be very important.
The pages of Site A, B and C are then stored
in the search engines database where it will be indexed. So
when you do a search on Google
for instance, you aren't searching all the documents on the
Internet. You are only looking at the pages that Google has
in it's index. You can experiment with this by doing the same
search in different engines - you'll get varying results.
For more information on how each major search engine builds
their index see the article "Who
powers whom?".
Ah, but you may ask yourself - how does the
search engine ever find out about my site? As stated before,
there are just too many pages out there. The robot will only
crawl the pages it knows about. So you need to let the search
engine know your site is out there. But how? First, use a
robots.txt file. This file is loaded into the root directory
of the site and search engines look for it. It tells the robot
what parts of your site you want spidered or not. See the
robots.txt
file (feel free to use it) for this site: this is our Robots
Exclusion Protocol.
As mentioned earlier you can build up your
link popularity. This is actually very effective although
it can be a lot of work. You need to get good quality sites
with relevant content to put a link on their site pointing
back to yours - the more the better. This will almost always
require you to reciprocate by linking your site to theirs
(to check your link popularity try Marketleap's
Link Popularity Tool).
The other way to get the search engines to
crawl your site is by adding your URL to the search engine's
index. You can go to each engine individually and fill out
a form so that your URL will be indexed the next time the
robot goes out. Not all of the engines allow this type of
submission and some have dropped the service. Also, read the
fine print for each search engine's submission rules. Just
because you submitted doesn't mean you will be indexed. Again
you can see how important a good linking strategy will be
because the spiders will find you anyway without you submitting
to the engines themselves.
Identifying search engine
robots and spiders
When a search engine robot is sent to your
site it will be logged in your site's log file (statistics)
the same way users are. You can check your log files to see
if any of these critters crawled your site. You can then use
this information to analyze whether or not a robot has been
to crawl your site. The following was pulled from the log
files for this site:
[04/May/2005:02:12:43 -0400] "GET /robots.txt HTTP/1.0"
200 68 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
66.249.71.1 - - |
Now I know Google has been here. Some of
the robots are quite easy to identify (as above). See that
name in your log files and then you know the Google robot
has been to your site for a visit. Keep in mind that search
engines can have more than one robot and you may see some
that you can not recognize (the engine simply may be using
a new name for their robot); there are a lot of search engines
out there using robots and the technology changes daily.
The search
engine robots have visited, what next?
OK, so you've got your linking strategy happening
and/or you've made manual submissions to individual search
engines now what? When the search engines index the sites
they will then have programs (algorithms)
that rank the site. This is where good search engine optimization
comes in to play. You want your site to appear in the Top
10 of the 200,000 sites that an engine has displayed in the
search results for the keyword "ballet shoes" (that's
of course if you are selling ballet shoes!). Please read "What
is SEO?" for more information on optimizing your
site for the major search engines.
The spider will be back to crawl your site
soon as the engines send them out whenever that is (the engines
vary in their scheduling). Make sure you keep your site updated
and add new content as often as possible (a page a week if
you can manage) and make sure it is keyword rich. Some search
engines do have a limit on the number of pages they will crawl
on any one site so optimize well. Also, last but not least,
be patient. This is an iterative process and does take time
before you may see any results. If done properly you can watch
your site climb up the rankings to be placed high in the search
engines results pages (SERP's).
Helpful links about robots:
Checklist
for Search Robot Crawling and Indexing
SpiderHunter.com
Includes tutorial on cloaking scripts and how to track spiders
from search engines.
Botspot
A Bot monitor site, with regular updates and links to the
bot's home pages.
6 May 2005
|