|
Search engines are the key to finding specific information on the vast expanse of the
World Wide Web. Without sophisticated search engines, it would be virtually impossible to locate anything on the Web without knowing a specific URL. But do you know how search engines work? And do you know what makes some search engines more effective than others?
When people use the term search
engine in relation to the Web, they are usually referring to the
actual search forms that searches through databases of
HTML
documents, initially gathered by a
robot.
There are basically three types of search engines: Those that are powered by
robots (called crawlers; ants or spiders)
and those that are powered by human submissions; and those that are a
hybrid of the two.
Crawler-based search engines are those that use
automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's
meta tags and also follow the links that the site connects to
performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index. |
Key Terms To
Understanding Web Search Engines
spider trap
A condition of dynamic Web sites in which a search engine’s spider
becomes trapped in an endless loop of code.
search
engine
A program that searches documents for specified keywords and returns
a list of the documents where the keywords were found.
meta tag
A special HTML tag that provides information about a Web page.
deep link
A hyperlink either on a Web page or in the results of a search
engine query to a page on a Web site other than the site’s home
page.
robot
A program that runs automatically without human intervention.
|
In both cases, when you query a search engine to locate information, you're actually searching through the index that the search engine has created you are not actually searching the Web. These indices are giant
databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.
So why will the same search on different search engines produce different results? Part of the answer to that question is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same
algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for.
One of the elements that a search engine algorithm scans for is the frequency and location of keywords on a Web page. Those with higher frequency are typically considered more relevant. But search engine technology is becoming sophisticated in its attempt to discourage what is known as
keyword stuffing, or spamdexing.
Another common element that algorithms analyze is the way that pages link to other pages in the Web. By analyzing how pages link to each other, an engine can both determine what a page is about (if the keywords of the linked pages are similar to the keywords on the original page) and whether that page is considered "important" and deserving of a boost in ranking. Just as the technology is becoming increasingly sophisticated to ignore keyword stuffing, it is also becoming more savvy to Web masters who build artificial links into their sites in order to build an artificial ranking.
|
Did You Know...
The first tool for searching the Internet, created in 1990,
was called "Archie". It downloaded directory listings of all
files located on public anonymous FTP servers; creating a
searchable database of filenames. A year later "Gopher" was
created. It indexed plain text documents. "Veronica" and "Jughead"
came along to search Gopher's index systems. The first actual Web
search engine was developed by Matthew Gray in 1993 and was
called "Wandex". [Source
] |
Webopedia:
Internet and Online Services >
Internet > World
Wide Web >
Search Engines
Last updated: February 17, 2006
The Web Robots FAQ

Indexed list of frequently asked questions about Web robots.
How Search
Engines Work

The term "search engine" is often used generically to describe both
crawler-based search engines and human-powered directories. These two types of
search engines gather their listings in radically different ways.
The Anatomy of a
Large-Scale Hypertextual Web Search Engine
In this paper, we present Google, a prototype of a large-scale search engine
which makes heavy use of the structure present in hypertext. Google is designed
to crawl and index the Web efficiently and produce much more satisfying search
results than existing systems.
How
Search Engines Rank Web Pages
Search for anything using your favorite crawler-based search engine. Nearly
instantly, the search engine will sort through the millions of pages it knows
about and present you with ones that match your topic. The matches will even be
ranked, so that the most relevant ones come first.
Search Engine Watch - Search Links
Looking for search engines? This section of Search Engine Watch lists some top
choices in various categories.
Robots Exclusion
Sometimes people find they have been indexed by an indexing robot, or that a
resource discovery robot has visited part of a site that for some reason
shouldn't be visited by robots. |