SEARCH
ENGINES AND THEIR CLASSIFICATION
By:
Gaurav Srivastava, Anirban
Biswas
MSCLIS (2007-09), IIIT-Allahabad
Search engines are the
most popular means for us to get study material. Currently,
on an average around 400,000,000 searches are being
performed per day by search engines. The basis of the
searches is the keywords. Keywords basically are phrases,
sentences etc. of 5 to 6 words. Every search is termed as
query.
To respond to the query,
a search engine retrieves web pages that are relevant to the
terms of the query from its large database and displays a
list of the results to the user. Recall the number of
results whenever we search anything in goggle, the list
usually has several million results. Currently, total number
of web pages available on internet is more than 2.5
trillion. The results are displayed according to the degree
of relevance of the content with the keywords searched.
Further I would like to
discuss about the classification of the search engines.
Classification
of Search Engines
There are four types of
search engines:
-
crawler-based
(traditional, common) engines
-
directories (mostly
human-edited catalogs)
-
hybrid engines ( META
engines and those using other engines' results)
-
pay-per-performance
and paid inclusion engines
Crawler-based Search Engines:
Crawler-based Search Engines
are
also referred to as
spiders or web crawlers. These search engines basically use
special software known as
bot,
robot,
spider,
or
crawler. These programs
run on search engines. They browse the already existing web
pages in their repositories, and find the web sites by
following the links from those pages. On the other hand, if
someone has submitted his web pages to search engine, these
pages have been put in a queue for getting scanned by a web
crawler; so that the page could be found by crawling through
the lists of the pages in the queue.
When a web crawler has
found a page to scan, it retrieves this page via HTTP
request. By going through the log files of web server we can
determine when a web crawler has visited our pages. The Web
server returns the HTML source code of the page to the web
crawler. The web crawler then reads the web page. Reading
process is termed as
crawling. After the web crawler reads the pages,
it compress them in such a way to conveniently store them in
the in a gigantic repository of Web pages. This massive
repository is known as search engine index. The data stored
in the index is helpful to are stored in the search engine
index so that it's available to promptly determine the
relevance of the page with the query keywords under
consideration. If the page is found to be relevant then the
page is retrieved and is incorporated in the results
displayed for the user. The process of inserting the page in
the index is termed as
indexing. After a page has been indexed, it
appears on search engine results pages for the words and
phrases most common on the indexed web page.
Later, when someone
searches the engine for their terms, that page will be
retrieved from the index and is incorporated in the search
results.
Google (http://www.google.com)
is an example of the crawler-based Search Engine.
Human-edited directories:
Human-edited directories based search engines are a bit
different. Their repository is being created by manual
submission. The directories strictly require manual
submission and therefore use various methods to check
whether a page is being submitted manually or automatically.
After manual submission of the web page the URL will be
included in a queue which is for the review of a human
usually an editor. The editor of the directory visits the
URL and then decide whether the pages are relevant to the
queries claimed by the owner of that web site or not.
Crawler-based search engines use directories as a major
source of new pages to crawl.
To determine the rank crawler-based
search engine visits web sites regularly after indexing it,
and check for any change made to the web pages while in a
directory, decision are based on editors and their
discretion. Since directories are created by experienced
editors, they provide better filtered results. The
best-known and most important directories are Yahoo (www.yahoo.com)
and DMOZ (www.dmoz.org).
Hybrid engines: A hybrid engine
uses both the aforementioned techniques for making the
indexes and ranking the pages. Usually, a hybrid search
engine will favor one type of listings over another.
Almost all the search
engines are hybrid search engines nowadays. Yahoo
(http://www.yahoo.com) and Google (http://www.google.com),
are in fact hybrid engines. As a rule, a hybrid search
engine will favor one type of listings over another. For
example, Yahoo is more likely to present human-powered
listings and Google its crawled listings. Another example is
Live Search (Previously Known as MSN) that presents
human-powered listings from LookSmart. However, it does also
present crawler-based results (provided by its own Web
crawler), especially for more obscure queries.
Meta Search Engines: Meta search
engines use the results from a number of search engines and
integrate them to give a concrete and better output. That is
why Meta search engines are also known as multi-engine. A
natural language request is translated to several search
engines. Each of the search engines is then directed to
process the user’s query. The results responses of all the
search engines then incorporated into one index and
displayed to the user. By this approach it is possible to
provide the user with more precise and relevant results for
his search.
Examples of
multi-engines are MetaCrawler (http://www.metacrawler.com)
and DogPile (http://www.dogpile.com).
By this I have tried to
give an idea about search engines and their classification.
Being an IT Professional we all must be aware of the various
types of search engines and their working patterns.
|