We at Thompson Publishing hope you find the enclosed helpful in your forays into cyberspace.

This is excerpted from our "Commercial User's Guide to the Internet." Subscription information for this periodical is posted at the end of this message.

--- Computer-Generated Indexes ---

Basic Definition: The use of a computer to systematically contact web servers (computers) and generate a catolog of the information retrieved. This information can then be searched by web users.

The most common form of engines search databases collected by "spiders" and "Web crawlers." Spiders and web crawlers are programs that systematically collect Web pages across the Net by repeatedly querying their servers. The process not only occupies an enormous amount of bandwidth but also frequently crashes servers they hit. Some of these "robots" only index links, titles or summaries. Others analyze the full text of retrieved Web pages. Others incorporate Gopher holes, newsgroup postings or FTP listings into their databases as well.



World-Wide Web Worm

The World-Wide Web Worm (WWWW), found at http://www.cs.colorado.edu/home/mcbryan/WWWW.html, is the ancestor of the spiders and Web crawlers that currently roam the Net. In its day, the Worm collected a database of over 100,000 resources. It still provides the user with a search interface to this extensive database of Web pages, which is current to March 7, 1994.

The WWWW search engine locates Web pages or uniform resource locations (URLs) by means of keywords. You can search URL references, all URL addresses, document titles or document addresses. The latter two databases, which are much smaller, can be searched more rapidly.

The WWWW search engine allows you to limit the number of hits you wish returned to as few as five to as many as 5,000. It supports "AND" or "OR" Boolean logical operators. It does not distinguish between upper and lower case.

WebCrawler

WebCrawler, located at http://webcrawler.com/, is a "spider" program. It roams the Net systematically indexing Web pages. WebCrawler is operated by America Online as a public service to the Internet.

Since its founding in 1994, the WebCrawler has indexed over 150,000 different Web pages. The remainder of the WebCrawler's database, containing tables of the remaining known Web pages amounting to nearly 1,500,000, comprise nearly 100 megabytes of data.

WebCrawler's indexer parses queries into keywords based on space and punctuation. Each word is reduced to lower case. Any endings are stripped. The terms are checked against a stop list, to see if any are so common as to be rendered irrelevant. The resulting keywords are fed to the index.

To keep its index small, the WebCrawler does not index certain words it finds in source files. These include common words, like "WWW" or "Web." Such terms, contained in nearly every Web document, are not useful words to query. It also throws out combinations of letters and numbers.

The WebCrawler's search engine supports limited Boolean logic. Individual keywords may be combined with the AND/OR Boolean operators. The user is also allowed to limit the number of hits returned for each query. A successful WebCrawler search returns a list of links. Each is characterized by a rating for relevance, on a scale of zero to 1,000, where the higher the number the greater the document's relevance to your search.

The WebCrawler page provides a form for the submission of URLs for inclusion in the WebCrawler database. There are also hot links to a completely random set of sites and the 25 most-accessed URLs. Both change periodically.

Lycos

The term "Lycos" comes from the arachnid family Lycosidae, which are large ground spiders that are very speedy and active at night. These predators catch their prey by pursuit rather than in a web. Lycos, located at http://lycos.cs.cmu.edu/, lives up to its namesake. Rather than catching URLs on a server in a massive single sweep, Lycos uses an innovative, probabilistic scheme to skip from server to server.

In building its database, Lycos starts with a given Web page and systematically collects its title; headings and subheadings; the 100 most "weighty" words; the first 20 lines; size in bytes; and number of words. After adding all links mentioned in the Web page to its queue, Lycos chooses another document to explore. It jumps randomly from http, Gopher and FTP links in its queue. Lycos prefers documents that are the object of multiple links, and possess shorter URLs. This servers to focus its database on popular pages at the top of each server's file hierarchy. By the end of August 1995, Lycos had amassed over 634,000 references in its database making it one of the largest on the Net.

The Lycos search interface accommodates keyword queries. Once your search results are generated, you can examine document outlines, keyword lists and excerpts. In this way, you can determine the value of a document without having to retrieve it. The interface supports AND/OR operators. You are also given the option to have either a verbose or terse set of search results. The total number of links returned can also be delimited. Search hits are ranked on a score of 1,000 to zero, depending on their relevance to the search terms.

Harvest Home Pages

The Home Pages Harvest Broker, located at http://town.hall.org/Harvest/brokers/www-home-pages/, is an index of over 45,000 documents. This database has a flexible interface, providing search queries based on author, keyword, title,or URL. The search engine allows the user to customize output for each result, including displaying WAIS rankings, object descriptions and links to indexed content summary data. The user may also specify the maximum number of results allowed. The Harvest database, while currently small, bears watching as it is growing rapidly.

Open Text

Open Text, found at http://www.opentext.com:8080/omw.html, claims to be "the index of the Internet." Sponsored by UUNET of Canada, Open Text's Web crawler scans the Internet, following links, moving from Web page to Web page, noting each page's location. A proprietary program indexes every word of each page, building a huge, dynamic map of the Web. As of early August 1995, nearly one million pages had been indexed. Each day tens of thousands more are added. In addition to continually discovering new resources, Open Text's Web crawler also revisits indexed pages to check changes in content and the operability of links.

To give you an idea of the size of its database, Open Text's Web Index contains about 765 million words of text and 14,638,581 hyperlinks. In the most recent update 22,426 pages were changed, and 10,658 were deleted, because they were no longer on the Web. Another 11,768 were replaced with changed versions, and 6,490 pages, when revisited, were found unchanged. Another 21,760 new pages were added.

The Open Text database interface allows you to search every word of each indexed Web page for any particular word, phrase or combination thereof. You can also search FTP directories and Gopher servers. Newsgroups will soon be added. Open Text gives you a choice of simple, power or weighted search modes. It can search either the text, summary, title, first heading or hyperlinks for keywords. Open Text fully implements Boolean operators. In addition, it supports the "fuzzy" logic concepts such as: "but not," "near" or "followed by." In sum, Open Text offers Web users both the most comprehensive collection of Web resources and the most powerful search engine.

InfoSeek

At Infoseek, http://www.infoseek.com/, you can search and retrieve articles from over 80 computer periodicals, over 10,000 USENET newsgroups, over 400,000 WWW pages, mailing list archives and a variety of other publications. Infoseek gathers information in two ways. The bulk of its information is derived from a very fast Web crawler. Infoseek also negotiates aggressively with list owners, electronic magazines and other online publications for the rights to their archives.

In addition to Net resources, the following publications are available through Infoseek: Computer Reseller News, Computer Retail Week, Computerworld, Communications Week, Communications Week International, Electronic Buyer's News, Electronic Engineering Times, Home PC, Information Week, Interactive Age, Network Computing, NetGuide, NewsBytes, OEM Magazine, On & About AT&T, VAR Business, WINDOWS Magazine and Work-Group Computing Report. Infoseek also provides access to Associated Press Online, PR Newswire, Business Wire, Newsbytes News and The Reuters Business Report. Hoover's Company Profiles, the MDX Health Digest, Cineman Entertainment Reviews and FrameMaker Help Notes are also available. Infoseek provides excellent access to Usenet news up to five weeks old.

Infoseek is a for-profit service. The standard monthly fee of $9.95 includes 100 free transactions. Each query and each document read counts as a transaction. Additional transactions cost 10 cents a piece. Infoseek allows a one-month free trial subscription.

NIKOS

NIKOS, at http://www.rns.com/cgi-bin/imagemap/common_bar?459,18, is a Web database sponsored by Rockwell International. NIKOS wanders the Web constantly to gather the latest information on Web pages. NIKOS indexes both Web links and pages. It supports multiple keyword searches. Search terms are treated conjunctively: increasing the number of words narrows the search.

ALIWEB

ALIWEB, located at http://Web.nexor.co.uk/public/aliWeb/aliWeb.html, takes a fundamentally different approach to indexing the World Wide Web than other search engines described above. In a manner similar to Archie, ALIWEB encourages Web authors to post a description of their page in a file accessible to the Net. (See Section 144.) After regularly retrieving these files, ALIWEB combines the descriptions into a searchable database.

ALIWEB has many advantages over other search engines. It does not consume as much bandwidth as even the most elementary Web crawler or spider. Its unique organization allows ALIWEB to be updated daily. Because authors write their own descriptions of their work, the information in ALIWEB is accurate and informative. A prototype method of linking a Harvest-style spider with the ALIWEB database is now under development. ALIWEB is a public service provided by NEXOR.

------------------------------------------------------------------
copyright c 1995 Thompson Publishing Group
------------------------------------------------------------------

SUBSCRIPTION INFORMATION The above article is an excerpt from Thompson Publishing Group's unique information service that never goes out of date, the Commercial User's Guide to the Internet. The Guide is the kind of resource business professionals need to help them understand how to take the best advantage of the fast-growing use of the Internet and the World Wide Web for commerce and communication. For any library, particularly one serving a business community, the Commercial User's Guide is an essential source of always-current and practical how-to guidance on incorporating the Internet into everyday business practices. From connection options to building a Web site, from gathering competitive intelligence to marketing a Web site, the Guide is easily the only comprehensive reference for the commercial user.

SPECIAL OFFER TO LIBRARIANS

Your subscription to the Commercial User's Guide to the Internet includes to a loose-leaf manual with over 600 well-organized pages. Included in the manual are directories of Internet resources, organized by topic, of special interest to the business user. Every month, you also receive new and replacement pages for the manual so it never goes out of date. Plus, you get a 12-page monthly newsletter featuring case studies of how others are using the Internet, new technologies, legal issues and much more.

You may order the Guide for a 30-day risk-free approval period. Use and review the Guide for a full month. If you'd like to continue to receive the monthly updates and newsletters, honor the invoice that will be enclosed. Otherwise, return the materials with the invoice marked "no thanks" and be under no further obligation. It's that simple. See for yourself! Order today. Call toll-free, 1-800-677-3789. Refer to priority code CXT and get 10% off the regular subscription price of $298. You pay only $268 (plus applicable sales tax).

For further information, send inquiries by e-mail to: INET@thompson.com or call 1-800-677-3789.

Dr. Andrew Lightman
Editor, Commercial User's Guide to the Internet
Thompson Publishing Group
(202) 739-9541

Email: lightman@cais.com
alightman@thompson.com
URL: http://www.thompson.com

UP to the "Multimedia on the Internet" Class Home Page

For more information about this Web Site contact sanderso@cc.usu.edu
(Dr. Steve Anderson at Utah State University).