B'Cognizance eMagzine

SEARCH ENGINES AND THEIR CLASSIFICATION

By:

Gaurav Srivastava, Anirban Biswas

MSCLIS (2007-09), IIIT-Allahabad

Search engines are the most popular means for us to get study material. Currently, on an average around 400,000,000 searches are being performed per day by search engines. The basis of the searches is the keywords. Keywords basically are phrases, sentences etc. of 5 to 6 words. Every search is termed as query.

To respond to the query, a search engine retrieves web pages that are relevant to the terms of the query from its large database and displays a list of the results to the user. Recall the number of results whenever we search anything in goggle, the list usually has several million results. Currently, total number of web pages available on internet is more than 2.5 trillion. The results are displayed according to the degree of relevance of the content with the keywords searched.

Further I would like to discuss about the classification of the search engines.

Classification of Search Engines

There are four types of search engines:

crawler-based (traditional, common) engines
directories (mostly human-edited catalogs)
hybrid engines ( META engines and those using other engines' results)
pay-per-performance and paid inclusion engines

Crawler-based Search Engines: Crawler-based Search Engines are also referred to as spiders or web crawlers. These search engines basically use special software known as bot, robot, spider, or crawler. These programs run on search engines. They browse the already existing web pages in their repositories, and find the web sites by following the links from those pages. On the other hand, if someone has submitted his web pages to search engine, these pages have been put in a queue for getting scanned by a web crawler; so that the page could be found by crawling through the lists of the pages in the queue.

When a web crawler has found a page to scan, it retrieves this page via HTTP request. By going through the log files of web server we can determine when a web crawler has visited our pages. The Web server returns the HTML source code of the page to the web crawler. The web crawler then reads the web page. Reading process is termed as crawling. After the web crawler reads the pages, it compress them in such a way to conveniently store them in the in a gigantic repository of Web pages. This massive repository is known as search engine index. The data stored in the index is helpful to are stored in the search engine index so that it's available to promptly determine the relevance of the page with the query keywords under consideration. If the page is found to be relevant then the page is retrieved and is incorporated in the results displayed for the user. The process of inserting the page in the index is termed as indexing. After a page has been indexed, it appears on search engine results pages for the words and phrases most common on the indexed web page.

Later, when someone searches the engine for their terms, that page will be retrieved from the index and is incorporated in the search results.

Google (http://www.google.com) is an example of the crawler-based Search Engine.

Human-edited directories: Human-edited directories based search engines are a bit different. Their repository is being created by manual submission. The directories strictly require manual submission and therefore use various methods to check whether a page is being submitted manually or automatically. After manual submission of the web page the URL will be included in a queue which is for the review of a human usually an editor. The editor of the directory visits the URL and then decide whether the pages are relevant to the queries claimed by the owner of that web site or not. Crawler-based search engines use directories as a major source of new pages to crawl.

To determine the rank crawler-based search engine visits web sites regularly after indexing it, and check for any change made to the web pages while in a directory, decision are based on editors and their discretion. Since directories are created by experienced editors, they provide better filtered results. The best-known and most important directories are Yahoo (www.yahoo.com) and DMOZ (www.dmoz.org).

Hybrid engines: A hybrid engine uses both the aforementioned techniques for making the indexes and ranking the pages. Usually, a hybrid search engine will favor one type of listings over another.

Almost all the search engines are hybrid search engines nowadays. Yahoo (http://www.yahoo.com) and Google (http://www.google.com), are in fact hybrid engines. As a rule, a hybrid search engine will favor one type of listings over another. For example, Yahoo is more likely to present human-powered listings and Google its crawled listings. Another example is Live Search (Previously Known as MSN) that presents human-powered listings from LookSmart. However, it does also present crawler-based results (provided by its own Web crawler), especially for more obscure queries.

Meta Search Engines: Meta search engines use the results from a number of search engines and integrate them to give a concrete and better output. That is why Meta search engines are also known as multi-engine. A natural language request is translated to several search engines. Each of the search engines is then directed to process the user’s query. The results responses of all the search engines then incorporated into one index and displayed to the user. By this approach it is possible to provide the user with more precise and relevant results for his search.

Examples of multi-engines are MetaCrawler (http://www.metacrawler.com) and DogPile (http://www.dogpile.com).

By this I have tried to give an idea about search engines and their classification. Being an IT Professional we all must be aware of the various types of search engines and their working patterns.

© Indian Institute of Information Technology, Allahabad