This is excerpted from our "Commercial User's Guide to the Internet." Subscription information for this periodical is posted at the end of this message.
--- Computer-Generated Indexes ---
Basic Definition: The use of a computer to systematically contact web servers (computers) and generate
a catolog of the information retrieved. This information can then be searched by web users.
The most common form of engines search databases collected by "spiders" and
"Web crawlers." Spiders and web crawlers are programs that systematically
collect Web pages across the Net by repeatedly querying their servers. The
process not only occupies an enormous amount of bandwidth but also
frequently crashes servers they hit. Some of these "robots" only index
links, titles or summaries. Others analyze the full text of retrieved Web
pages. Others incorporate Gopher holes, newsgroup postings or FTP listings
into their databases as well.
The World-Wide Web Worm (WWWW), found at
http://www.cs.colorado.edu/home/mcbryan/WWWW.html, is the ancestor of the
spiders and Web crawlers that currently roam the Net. In its day, the Worm
collected a database of over 100,000 resources. It still provides the user
with a search interface to this extensive database of Web pages, which is
current to March 7, 1994.
The WWWW search engine locates Web pages or uniform resource locations
(URLs) by means of keywords. You can search URL references, all URL
addresses, document titles or document addresses. The latter two databases,
which are much smaller, can be searched more rapidly.
The WWWW search engine allows you to limit the number of hits you wish
returned to as few as five to as many as 5,000. It supports "AND" or "OR"
Boolean logical operators. It does not distinguish between upper and lower
case.
WebCrawler
WebCrawler, located at http://webcrawler.com/, is a "spider" program. It
roams the Net systematically indexing Web pages. WebCrawler is operated by
America Online as a public service to the Internet.
Since its founding in 1994, the WebCrawler has indexed over 150,000
different Web pages. The remainder of the WebCrawler's database, containing
tables of the remaining known Web pages amounting to nearly 1,500,000,
comprise nearly 100 megabytes of data.
WebCrawler's indexer parses queries into keywords based on space and
punctuation. Each word is reduced to lower case. Any endings are stripped.
The terms are checked against a stop list, to see if any are so common as
to be rendered irrelevant. The resulting keywords are fed to the index.
To keep its index small, the WebCrawler does not index certain words it
finds in source files. These include common words, like "WWW" or "Web."
Such terms, contained in nearly every Web document, are not useful words to
query. It also throws out combinations of letters and numbers.
The WebCrawler's search engine supports limited Boolean logic. Individual
keywords may be combined with the AND/OR Boolean operators. The user is
also allowed to limit the number of hits returned for each query. A
successful WebCrawler search returns a list of links. Each is characterized
by a rating for relevance, on a scale of zero to 1,000, where the higher
the number the greater the document's relevance to your search.
The WebCrawler page provides a form for the submission of URLs for
inclusion in the WebCrawler database. There are also hot links to a
completely random set of sites and the 25 most-accessed URLs. Both change
periodically.
Lycos
The term "Lycos" comes from the arachnid family Lycosidae, which are large
ground spiders that are very speedy and active at night. These predators
catch their prey by pursuit rather than in a web. Lycos, located at
http://lycos.cs.cmu.edu/,
lives up to its namesake. Rather than catching
URLs on a server in a massive single sweep, Lycos uses an innovative,
probabilistic scheme to skip from server to server.
In building its database, Lycos starts with a given Web page and
systematically collects its title; headings and subheadings; the 100 most
"weighty" words; the first 20 lines; size in bytes; and number of words.
After adding all links mentioned in the Web page to its queue, Lycos
chooses another document to explore. It jumps randomly from http, Gopher
and FTP links in its queue. Lycos prefers documents that are the object of
multiple links, and possess shorter URLs. This servers to focus its
database on popular pages at the top of each server's file hierarchy. By
the end of August 1995, Lycos had amassed over 634,000 references in its
database making it one of the largest on the Net.
The Lycos search interface accommodates keyword queries. Once your search
results are generated, you can examine document outlines, keyword lists and
excerpts. In this way, you can determine the value of a document without
having to retrieve it. The interface supports AND/OR operators. You are
also given the option to have either a verbose or terse set of search
results. The total number of links returned can also be delimited. Search
hits are ranked on a score of 1,000 to zero, depending on their relevance
to the search terms.
Harvest Home Pages
The Home Pages Harvest Broker, located at
http://town.hall.org/Harvest/brokers/www-home-pages/, is an index of over
45,000 documents. This database has a flexible interface, providing search
queries based on author, keyword, title,or URL. The search engine allows
the user to customize output for each result, including displaying WAIS
rankings, object descriptions and links to indexed content summary data.
The user may also specify the maximum number of results allowed. The
Harvest database, while currently small, bears watching as it is growing
rapidly.
Open Text
Open Text, found at http://www.opentext.com:8080/omw.html, claims to be
"the index of the Internet." Sponsored by UUNET of Canada, Open Text's Web
crawler scans the Internet, following links, moving from Web page to Web
page, noting each page's location. A proprietary program indexes every word
of each page, building a huge, dynamic map of the Web. As of early August
1995, nearly one million pages had been indexed. Each day tens of thousands
more are added. In addition to continually discovering new resources, Open
Text's Web crawler also revisits indexed pages to check changes in content
and the operability of links.
To give you an idea of the size of its database, Open Text's Web Index
contains about 765 million words of text and 14,638,581 hyperlinks. In the
most recent update 22,426 pages were changed, and 10,658 were deleted,
because they were no longer on the Web. Another 11,768 were replaced with
changed versions, and 6,490 pages, when revisited, were found unchanged.
Another 21,760 new pages were added.
The Open Text database interface allows you to search every word of each
indexed Web page for any particular word, phrase or combination thereof.
You can also search FTP directories and Gopher servers. Newsgroups will
soon be added. Open Text gives you a choice of simple, power or weighted
search modes. It can search either the text, summary, title, first heading
or hyperlinks for keywords. Open Text fully implements Boolean operators.
In addition, it supports the "fuzzy" logic concepts such as: "but not,"
"near" or "followed by." In sum, Open Text offers Web users both the most
comprehensive collection of Web resources and the most powerful search
engine.
InfoSeek
At Infoseek, http://www.infoseek.com/, you can search and retrieve articles
from over 80 computer periodicals, over 10,000 USENET newsgroups, over
400,000 WWW pages, mailing list archives and a variety of other
publications. Infoseek gathers information in two ways. The bulk of its
information is derived from a very fast Web crawler. Infoseek also
negotiates aggressively with list owners, electronic magazines and other
online publications for the rights to their archives.
In addition to Net resources, the following publications are available
through Infoseek: Computer Reseller News, Computer Retail Week,
Computerworld, Communications Week, Communications Week International,
Electronic Buyer's News, Electronic Engineering Times, Home PC, Information
Week, Interactive Age, Network Computing, NetGuide, NewsBytes, OEM
Magazine, On & About AT&T, VAR Business, WINDOWS Magazine and Work-Group
Computing Report. Infoseek also provides access to Associated Press Online,
PR Newswire, Business Wire, Newsbytes News and The Reuters Business Report.
Hoover's Company Profiles, the MDX Health Digest, Cineman Entertainment
Reviews and FrameMaker Help Notes are also available. Infoseek provides
excellent access to Usenet news up to five weeks old.
Infoseek is a for-profit service. The standard monthly fee of $9.95
includes 100 free transactions. Each query and each document read counts as
a transaction. Additional transactions cost 10 cents a piece. Infoseek
allows a one-month free trial subscription.
NIKOS
NIKOS, at http://www.rns.com/cgi-bin/imagemap/common_bar?459,18, is a Web
database sponsored by Rockwell International. NIKOS wanders the Web
constantly to gather the latest information on Web pages. NIKOS indexes
both Web links and pages. It supports multiple keyword searches. Search
terms are treated conjunctively: increasing the number of words narrows the
search.
ALIWEB
ALIWEB, located at http://Web.nexor.co.uk/public/aliWeb/aliWeb.html, takes
a fundamentally different approach to indexing the World Wide Web than
other search engines described above. In a manner similar to Archie, ALIWEB
encourages Web authors to post a description of their page in a file
accessible to the Net. (See Section 144.) After regularly retrieving these
files, ALIWEB combines the descriptions into a searchable database.
ALIWEB has many advantages over other search engines. It does not consume
as much bandwidth as even the most elementary Web crawler or spider. Its
unique organization allows ALIWEB to be updated daily. Because authors
write their own descriptions of their work, the information in ALIWEB is
accurate and informative. A prototype method of linking a Harvest-style
spider with the ALIWEB database is now under development. ALIWEB is a
public service provided by NEXOR.
------------------------------------------------------------------
SUBSCRIPTION INFORMATION
The above article is an excerpt from Thompson Publishing Group's unique
information service that never goes out of date, the Commercial User's
Guide to the Internet. The Guide is the kind of resource business
professionals need to help them understand how to take the best advantage
of the fast-growing use of the Internet and the World Wide Web for commerce
and communication. For any library, particularly one serving a business
community, the Commercial User's Guide is an essential source of
always-current and practical how-to guidance on incorporating the Internet
into everyday business practices. From connection options to building a Web
site, from gathering competitive intelligence to marketing a Web site, the
Guide is easily the only comprehensive reference for the commercial user.
SPECIAL OFFER TO LIBRARIANS
Your subscription to the Commercial User's Guide to the Internet includes
to a loose-leaf manual with over 600 well-organized pages. Included in the
manual are directories of Internet resources, organized by topic, of
special interest to the business user. Every month, you also receive new
and replacement pages for the manual so it never goes out of date. Plus,
you get a 12-page monthly newsletter featuring case studies of how others
are using the Internet, new technologies, legal issues and much more.
You may order the Guide for a 30-day risk-free approval period. Use and
review the Guide for a full month. If you'd like to continue to receive the
monthly updates and newsletters, honor the invoice that will be enclosed.
Otherwise, return the materials with the invoice marked "no thanks" and be
under no further obligation. It's that simple. See for yourself! Order
today. Call toll-free, 1-800-677-3789. Refer to priority code CXT and get
10% off the regular subscription price of $298. You pay only $268 (plus
applicable sales tax).
For further information, send inquiries by e-mail to: INET@thompson.com or
call 1-800-677-3789.
Dr. Andrew Lightman
Email: lightman@cais.com
World-Wide Web Worm
copyright c 1995 Thompson Publishing Group
------------------------------------------------------------------
Editor, Commercial User's Guide to the Internet
Thompson Publishing Group
(202) 739-9541
alightman@thompson.com
URL: http://www.thompson.com
UP to the "Multimedia on the Internet" Class Home Page
(Dr. Steve Anderson at Utah State University).