5. The Art of Indexing & Searching

5. The Art of Indexing & Searching

"When I use a word," Humpty-Dumpty said, "it means just what I choose it to mean -- neither more nor less."
--Lewis Carroll

There is nothing more crucial to the ultimate success of any application which creates information than the process whereby the information is classified for later retrieval and utilization. This process is known as the subject of indexing of information. This topic is at the heart of just about all of the applications the interface designer will confront.

Indexing is a set of methods whereby we classify information in some structured, abstract, and compact form in order to find the information at some later period of time. It is analogous to the process whereby libraries classify their materials. In fact, much of our understanding of the problem of indexing derives from the earlier efforts of library and information scientists.

In developing any classification scheme, the difficulties arise with the problem of handling the exceptions. The best example is the art and science of library cataloging. The oldest reference to a collection of catalog rules is in 1718 by Gaetano Giardina in Latin. Even if one goes back a hundred years (Dewey, 1894) one can easily see how complex the handling of exceptions is and the fundamental ambiguity in the language process. However, a hundred years ago the amount of information was not of the same magnitude as the body of knowledge that society deals with today. Therefore, the complications of handling exceptions are even harder today. Here are some example rules and procedures from Dewey's early guidelines that listed hundreds of such rules:

Classify married women, and other persons who have changed their names, under the last well-known form, with reference from other forms.

Talmud, Koran, Vedas, and other sacred books under those words on author line, making added entries under editor, translator, etc.

A society, under first word (not an article) of its corporate name, with reference from any other name by which it is known, specially the place if it has headquarters and is often called by that name, e.g., Royal Statistical society, with from London, Royal Statistical society.

Under remarks indicate: any other-bindings, sale, loss, exchange, with drawl, damage, duplicate, binding in with other volume, or any change or disposition.

Another serious complication is the lateral connections, groupings, or nonlinear linkages that must exist in any extensive collection of material. Color was one of the mechanisms that were used to handle this complication:

Green represented biographies, white for complete works, works of criticism yellow. Also color in the bindings of books was used to represent different languages: American, light brown; English, dark brown; German, black; French, red; Italian, maroon; Spanish, olive; Latin, light green; Greek, dark green; Semitic, yellow, etc.

It is also true that the language itself encourages ambiguity and that a given object may be sought under a great many different names for it:

Bible, Holy Bible, La Sainte Bible, Biblis Sacra, Bible wordo, biblia, Scriptures, Holy Scriptures, The Scriptures, Sacra Scrittur, Saintes, New Testament, Old Testament, Testament, Nouveau Testament, Avrcien, Vetus Testamentum, vetus novum, nuovo testament, gospels, evangelium secundum, mathhewm, gospel of St. Matthew, epistle to the romans, acts of the apostles, proverbs, psalms ecclesistes.

The real goal of an indexing approach is to be a "guidance system" to the body of information that it represents (Borko and Berlnier, 1978). Societies invent indexing approaches as a normal part of their evolution. The telephone book is an index developed to support our telephone based society. In wines, there is a multi-dimensional classification system that wine tasters utilize to characterize a given wine relative to others (body, color, browning, nose, etc.). The same holds true of most commodities in our society: e.g., leather, wool, oil, coal, metals, plastics, etc. In chemistry, the periodic table is an indexing system based upon the scientific structure that explains the behavior of the elements. The more the indexing system reflects the inherent structure or nature of the information or the application it supports, the more useful is the index.

In any application today, the process of searching and finding information is a standard, common, and growing problem. It behooves the designer to understand how humans go about classifying information into some sort of indexing schema. In many cases, this is very different from the ways that are used internally in the computer to index information. Too often the indexing and searching problem is only given superficial consideration. For example, when the hard disks on personal computers were 20 MEGS, and largely empty, the DOS approach to indexing (hierarchical) seemed fine. For those of us that now have Gigs of bytes on our hard disk that is almost full of compressed files it begins to become very difficult to find something that you worked on a year ago. Searching problems do not become a concern to users until they have built up a reservoir of material that requires more powerful search capabilities. However, it is the obligation of the designer to anticipate the needs of the users rather than to react only after the problem has already occurred.

5.1. Index Types

Surprisingly, there are only a limited number of different approaches to indexing. Furthermore, there is a logical structuring of the indexing schemes used in our society and this presentation will follow that ordering. These go from very precise and compact indexing to very equivocal and expressive approaches.

5.1.1. Hierarchical

The most compact of all the indexing schema is the "hierarchical" approach. This is a tree structured classification system. An object is allowed to be at only one place in the tree. This approach requires a minimum of internal coding and therefore makes the least demands on internal memory. For this reason, it was chosen to be the original DOS file system structure. Usually a single term or phrase represents the meaning of a node within the tree. The resulting classification for an object is then the sequence of node names through which one must pass to reach the object.

It was the original approach to library classification (Dewey Decimal Classification) so that every book or document would have only one unique location in the classification structure. The following example (Meadow, 1973) illustrates the compactness of this approach:

600. Technology (applied science)
620. Engineering
629. Other branches of engineering
629.13 Aeronautics
629.138 Uses of aircraft
629.138 8 Space flight (inertial navigation)
629.138 82 Artificial earth satellite

While hierarchical classification is compact and straightforward, it is the most difficult system for humans to utilize accurately for classifying objects. For example if a person writes a book on the use of integral equations in chemistry, should it be classified under mathematics or under chemistry? In the hierarchical approach we can not do both. Therefore a human has to make a difficult decision about where it will go. Then future users will have to determine what the thought process was of the original classifier. In fact, a year later even ones own thought process might not be easily recalled.

Another inherent problem is that the established hierarchy can become inconsistent with age. The way you classified your DOS files a year ago may not be the way you wish to do it today and that makes use of the index to recall items even more difficult. In library systems, the classification of "atomic weapon" is a sub classification to "ammunition." Hardly anyone unfamiliar with this particular index would think to look for atomic weapons under the tree node of ammunition. Clearly, the concept of ammunition was understood and recognized before that of atomic weapons and since it would be very costly, on a world wide bases, to reestablish a new card catalog index in every library, one had to add atomic weapons to the existing hierarchical structure. The only place that seemed logical to add it was under the category of ammunition.

The outline structure or table of contents of a document is another form of hierarchical indexing. The fact that this is inadequate as a function of the size of the document is evident by the introduction of subject indexing and cross referencing at the back of large documents such as books. Hierarchical indexes are best created and utilized by experts in classification. This is the reason for the profession of reference librarian. Anyone who uses a book index quickly realizes whether the indexer for that book did a good or bad job depending on how well they constructed the two to three level index that goes with a professional book.

Hierarchical indexing is very easy to summarize and is internally very compact. It is, however, the most difficult for humans to utilize accurately. It tends to become obsolete with age and is highly rigid and difficult to adapt to regular changes in the body of knowledge it is suppose to represent. Over time, its survival becomes dependent upon a growing list of rules, procedures, and the expertise of those who perform the classification. Certainly it is the worse choice for people who are not already experts in the subject matter that they are indexing. Furthermore, the experts would need add functionality designed to aid in the updating of the hierarchy and the location of the indexed objects. This was never provided in the DOS file system.

5.1.2. Networked

A networked index is a hierarchical index that allows an object to occur in more than one place in the tree. This is, in effect, the addition of cross referencing. We can accomplish this by either duplicating the object at different locations, or by referring to another location at the first location (cross referencing). It also illustrates the flexibility introduced by hypertext type relationships. The yellow pages of a telephone directory utilize cross referencing to allow multiple locations in the tree to end up at one place. The telephone yellow pages, however, begins with a large set of independent nodes that are more reflective of subject headings, with only one or two levels below the starting points.

In most real world applications, the relative amount of cross referencing is kept to a minimum when compared to the richness or diversity of the original hierarchy. Once again the use of rules and procedures is desirable for controlling and understanding what the network relationships are. When the network itself becomes the dominant consideration for determining the position of the object we begin to have the situation of what is commonly called Hypertext. Hypertext has some distinct problems we will discuss later in this book.

Hypertext is not an index in the classical sense. Indexes have an objective of providing a high level map to guide user in the information domain. Hypertext is much more like the signs at a point on the road. It is localized guidance that tells the user what he or she can reach from where they currently are.

5.1.3. Subject Headings

The next form of indexing is a fixed set of subject headings that are not necessarily related to one another. Subject headings may have a hierarchical structure; however, it is rarely coded and it is possible to have an object classified under more than one subject heading. For example, the book about integral equations in chemistry could be classified under the subject of mathematics and also under the subject of chemistry.

This approach requires very precise definitions of the subject headings so that it can be absolutely clear where the object that is being classified belongs. Subject headings are usually presented in alphabetical order and the user must scan for the subject headings in which he or she might be interested. Since there can be a large number of objects indexed under a single subject heading, subject headings are usually employed at a macro level of description. Usually retrieval is accomplished by picking the single most appropriate subject heading. There are subject heading indexes where an object can only be placed under one subject heading and there are those where the object can be placed under many different subject headings.

An example of placing objects under large numbers of different subject headings is taxonomy. In biology, plants and animals are classified by specifying their characteristics under a wide number of semi-independent subject headings and/or attributes: the number of veins on the leave, the number of branching twigs at a branching point, the color of the bark, the color of the leaf, etc. The formal approach for this can utilize information theory to derive relative weights for the importance of each of the attributes in characterizing the object. Biological classifications have become so complex that they are really a combination of subject headings and faceted indexes (below).

The ACM uses a set of subject headings to classify the field of computer and information science and all publications reviewed in the Computing Reviews journal. This subject classification is reviewed and modified from time to time by one of the ACM working committees. In this use of subject headings, an object can only be placed under one subject heading when it is entered in Computer Reviews. However, authors can utilize these same headings as a set of keys to express the topic of their paper. It is much easier to invent new subject headings to accommodate a rapidly changing situation than it is to modify a hierarchy.

Developing a good list of subject headings is also a difficult human task and the ultimate success of this type of index depends very much on this list. Anyone who has tried to utilize some of the subject oriented help indexes found in various software application packages will realize how little careful thought goes into the creation of some of these indexes. The collection of subject headings defined by one individual to represent a domain of knowledge may not be the one perceived by a particular end user with a particular need. The ideal application for subject headings is where they serve to divide up a set of objects into fairly large subgroups. Whereas each branch in a hierarchy has a relatively small number (tens) each level in a subject index is usually very large (hundreds). Clearly the distinction can become fuzzy in some cases.

5.1.4. Key Words & Key Phrases

In this type of index, any combination of key words or phrases may be used to categorize a particular object. There may only be a few objects that have the same combination of keys. In principle, sufficient keys can be used to make every object distinct from every other object by making every set of keys distinct for every object. Usually, one does not go this far. It is usually more desirable that a reasonable sized set of objects have a distinct and meaningful classification that semantically distinguishes that set from all others.

There is no need for precise definitions for each term. It is assumed that the combination of terms used to categorize the object are sufficient to give it a distinctive place in the classification system. The question is not what term to place the object under, but what combination of terms are applied to the object in order to best describe it.

Key words or phrases may be free: chosen from all possible semantic terms, or they may be fixed from a fixed list that is used for classification. Key words are considered to be more highly adaptable than the previous indexing methods, but it is far less compact in nature. One expects a key word index to undergo more frequent change and have more added terms as time passes. In situations where the underlying morphology is not evident or clear this is one of the most popular approaches. Hence we see it has been added to most word processing applications and other applications that generate user files of their effort.

In terms of a person rendering their own indexing, it is probably one of the most desirable approaches. A single individual can be consistent in the establishment and utilization of the keys. It is also fairly easy for a single individual to adapt the index by the introduction of new terms without having to spend a great deal of effort cleaning up the old terms. Consistency for a group creating and sharing a key word index is more difficult unless one person is given authority over the evolution of the index. In practice, the approach for a group should probably be a set of fixed keys with a procedure for the proposal, evaluation, and addition of new keys on an as needed base.

5.1.4.1. Coordinate Keys

A specialized form of key words utilizes keys to define coordinate factors. For example, the keys "tall, medium, short" could be systematically applied to a set of objects to denote their relative height. In the example of a recipe system we could have, "high, medium, and low" fat dishes. A great many nutritional categories utilize fairly common coordinate type keys.

Coordinate keys are quite common in society to deal with commodities, e.g., the color, aroma, body, and taste of wine, the relative softness of leather. In many such applications they reflect the fact that human subjective judgment is involved in deriving a classification. Usually the range of the scale is three to nine distinctions which is consistent with the accuracy of the human subjective judgment process and the limitations of short term memory.

Coordinate keys are usually not used alone but in conjunction with free keys or with subject headings. While coordinate keys are subjective in nature one need only watch professional wine tasters in action to realize how precise and accurately they can be utilized in the hands of an expert. Another example is the classification for cuts of beef (e.g., choice, prime, etc.).

5.1.5. Syntactic Languages

This is the concept of introducing grammatical relationships between the terms that make up a classification. This allows us to get far more expressive and precise about the characteristics of an object within a collection. There are many specific forms that such grammatical relationships can take. The principle of inverting a name to the form "last, first middle" is a simple form of this. The most common approaches taken are in the use of tagged descriptors and faceted indexes or some combination of these.

5.1.5.1. Tagged Descriptors

A typical tagged descriptor might be: "object, modifier" such as in "tank, petroleum" and "tank, weapon." The additional tags of petroleum and tank make a significant difference in the meaning of classification term "tank." Tagged descriptors are the typical professional approach for book indexes, providing a natural two or three level hierarchy.

Very often tagged descriptors result from the "inversion" of the normal English form. For example, organic and inorganic chemistry become: "chemistry, organic," and "chemistry, inorganic." In a reference the last name is tagged with the first and middle names (e.g., Hiltz, Starr Roxanne).

5.1.5.2. Faceted Indexes

These type of indexes are composed of an ordered set of dimensions or properties called facets that characterize the object. Those dimensions are specified by terms or quantifiable variables. This index is commonly used by professionals dealing with commodities in the society. For example, leather might be characterized by the factors: degree of softness, color, shear strength. Other things that are indexed in this way are metals and plastics. For metals the melting point is an example of a very precise quantified facet for specification of the particular metal. The classification of a professional paper reference involves: authors, title, journal, volume, issue, date, pages and represents a structured faceted index. Note that some of these facets can be quite precise (e.g., date of publication, melting or vaporization point of a plastic, etc.).

The facets can be both highly factual and highly subjective in nature. Consider a brokerage firm that is issuing qualitative recommendations about a stock (e.g., desirable, very desirable) stocks based upon a set of factors such as profit. Consider the ratings for Bonds: AAA, AA, A, etc. While these classifications are done by experts it may be rather difficult for a non expert to understand the fine differences in a given facet.

It is quite common to take many of the other indexing methods and put together some combination of them as a faceted index. The Universal Decimal Classification is a good example of this and allows special faceted codes called auxiliaries. This allows libraries to incorporate many other forms of documents such as music, treaties, maps, artwork, rare books, etc. In natural language the parts of speech can be considered as facets of a sentence (e.g., nouns, subjects, verbs, etc.).

Faceted indexes are a very powerful approach to indexing when the application has some underlying theoretical or logical model that guides the choice of the facets. The most obvious example is the periodic table of the elements based upon the facets of the number of protons and neutrons.

The periodic table of the elements also illustrates that the "right" classification scheme can do far more than merely organize knowledge. They can become a framework for showing the theoretical structure that underlies the knowledge. This is what the periodic table of the elements does, until its invention it was impossible to grasp the many different apparently unrelated characteristics of chemical properties. Biologists, anthropologists, and paleontologists, for example, are all seeking improvements in the classification systems that underlie the knowledge in their field. In many professional fields, one has to view the classification scheme and process as an evolving one. In our recipe example, can we provide a capability that will allow people to set up classification schemes that will be more useful to them then the existing ones? If the answer is yes, we have created a valuable system that will allow some users to advance the art of cooking.

What is termed visualization is really the use of multi-dimensional faceted indexes that are mapped to a spatial set of dimensions or scales. The "art" of finding the appropriate transformation to find the proper scaling properties is the design challenge in visualization efforts.

The more rapidly a field is changing the more difficult is the challenge of maintaining an indexing system. This includes business that are changing their products and markets. Consider a classification system for micro computers and all its capabilities that might be used inside the manufacturer of such a product.

5.1.6. Phrases

The title of a paper represents a phrase that characterizes the paper. The use of KWIC (Key Word In Context) indexing for titles is an example of using phrase oriented indexing. The KWIC takes and alphabetizes each word in the title so you can see together all titles that use the same word. In the faceted index for an article reference we require one of the fields to be the title of the paper. So we are then using a phrase as one of the facets of an article classification.

Notice that we are increasingly making it easier to classify an object. In this one we have no pre-structure constraint and can merely write down phrases ad hoc if we wish. It is far more expressive than the prior approaches. However, it is far more difficult to utilize this approach when it comes to precise retrieval of a given object. For a large body of material the number of unique phrases could be excessively large. An author can easily give a title to a paper he or she wrote which is what is often done. However, many times a reader may discover that the title is not a very accurate description of the paper. Also the reader has to view a large number of titles to be able to select the items he or she is interested in.

In many systems we use phrases without being aware that it is what we are doing. We often display a list of items in the shortest possible form with just enough information to allow the user to distinguish uniquely between each item on the list. This allows the user to quickly scan a list and perform a mental discrimination to pick out the item or items they are interested in. A phrase can be a grammatical structure that includes both quantitative and qualitative information.

5.1.7. Natural Language

The final approach to indexing is to write a description of the object in terms of actual text. This is commonly called an "abstract." It is the least concise approach to indexing and the most difficult to utilize for finding objects. It is the easiest to create. There is no need for any consistent process to exist among different abstracts. An abstract is only some sort of short summary of what the writer feels is the important results or essence of the document it refers to.

Note that there is usually only one abstract for each object that one wishes to be informed about. These abstracts need to be designed so that they are an accurate and expressive description of the object and not a misleading statement. When, for example, abstracts, are written by the authors of documents this does not often appear to be the results. In many cases authors tend to over value the document they are describing. Automatic abstracting is one of those illusive but desirable goals of natural language processing that the field continues to seek.

5.2. Index Performance

There are a number of different properties that we can utilize for evaluating indexing approaches. These are:

Ambiguity: The degree to which objects can become confused as to their location within the classification schema. The degree of difficulty in deciding where to place an object within the classification. Natural language has the greatest degree of ambiguity.

Expressiveness: The degree to which the indexing terms express information and/or unique characteristics about the object they are classifying. This is sometimes referred to as specificity. Natural language is the most expressive of all the indexing choices.

Conciseness: The degree to which the specification of the classification of this object is compact in representation and/or nature. Natural language is the least concise.

Retrieval Effort: The amount of effort needed on the part of a human and/or a computer to actually perform a search for a given object. Clearly a human or a computer must examine the greatest amount of information and do the maximum amount of analysis using natural language.

Classification Expertise: Establishing the location or locations of an object within an indexing schema. One must have expert knowledge of a hierarchy schema to be able to accurately place objects; however, anyone can write an abstract about an item he knows without knowing all other items in the classification schema.

Adaptation Effort: The effort to reformulate or update the indexing schema as it becomes obsolete with the passage of time. Free keys are the most adaptable because there is no structure to be modified for the index as a whole. Free keys have the highest effort for humans and natural language represents the highest effort for computers since they are still less able to deal effectively with ambiguity than humans.

In the following table we characterize the different types of indexes in terms of these properties.

Table of Index Performance

Index Type Ambiguity Expressiveness Conciseness Retrieval Effort Classification Expertise Adaptation Effort
Hierarchical LOW LOW HIGH LOW HIGH HIGH
Networked
Subject Headings
Fixed Keys
Free Keys LOW
Tagged Descriptors
Faceted Indexes
Phrases
Natural Language HIGH HIGH LOW HIGH LOW HIGH

Another approach to looking at the effectiveness or performance of an index is with the classical retrieval measures of precision and recall. Consider the numeric quantities A, B, C, and D which are defined by the following cells in the matrix. A + B is the total number of items retrieved as the result of a search and A + B + C + D is the total number of entries in the database.

Variables Defined RELEVANT NOT RELEVANT

RETRIEVED

A

(hits)


B

(noise)


NOT RETRIEVED

C

(misses)


D

(correctly rejected)

Given the above variables we define:

Precision = A / ( A + B )

Recall = A / ( A + C )

Specificity = D / ( B + D )

Search Efficiency = (Recall) x (Specificity)

Clearly high precision means what you retrieved had very little noise or non relevant material and high recall means you found almost everything that was relevant. Specificity deals with how much noise there is relative to the total size of the database, the closer it is to one the better the rejection of not-relevant material.

Precision and recall may vary between zero and one. For databases containing textual information it is impossible to design indexing and retrieval procedures that result in high values (close to one) for both precision and recall. These two dimensions represent a fundamental tradeoff in the design of indexes. If one wishes to retrieve only relevant information (no noise) then it becomes very probable that the search will not find much of the relevant information (poor recall) and vise versa. This becomes more true as the size the database increases. Even for databases that deal with quantitative information, if the searches are multidimensional and must rely on subjective weightings of the importance of factor combinations we have exactly the same tradeoff problem.

The measure of specificity is the only one that incorporates the overall size of the database. The search efficiency is the only measure that utilizes all four variables. Note that for all the quantification in the above relationships, a human or a group of humans must judge how relevant the retrieved items were in order to make the above calculations. Measuring the above quantities is a classic sampling and human judgment problem that has received considerable attention by information scientists. It is a considerable effort to determine these quantities with some accuracy for a given database.

Other important measures are also considered when examining the utility of a database and the information it can provide. These are:

Timeliness: Is the data up to date and available when needed?

Accuracy: Is the information accurate?

Complete: Is the information complete?

Associated with timeliness, accuracy, and or completeness could be various status measures that indicate such things as when the data will be updated or completed. These three measures are the ones that most impact on a users feelings about the utility of a given database.

Objective/Subjective Nature: To what extent is the data the result of subjective human judgment or objective fact?

Certain types of subjective data or qualitative information may need associated measures of confidence and reliability that become properties of the indexing approach.

Form and Structure: To what the degree is the information a summarization or a detailed raw data set? Are there relationships (semantic or otherwise) that relate the objects or are the objects independent?

Generalized semantic relationships among data objects are at the foundation of Hypertext and represent the future for the indexing of multimedia systems.

In terms of the classification and indexing approach some other considerations for evaluation purposes are:

Depth and Breadth: How deep down or how wide is the indexing structure?

Index structures of wide breadth imply considerable user expertise for making relative decision among large sets of classification terms. Greater depth allows more use by novices but is unsatisfying to advanced users because of the added effort in finding something. These are the same tradeoffs that exist for menu design in an interface. The ultimate breadth of a menu is a command language with a command index.

Discrimination or Resolution: How well can the index method distinguish between different objects?

Can you look at the index representation of two objects and know how those two objects differ? Clearly to do this perfectly one would have to restrict the index to having only one object in any unique classification. However, this is an unworkable goal since the number of categories becomes too large to be useful. Even doing this it would probably be highly inaccurate (e.g., take "abstracts" as one such approach).

Density: How many objects, on the average, exist at a classification, pigeonhole, or place in the index?

There is a tradeoff between discrimination and density. One does not want hundreds of objects at a single point but neither does one want only one; this would usually require a very large index that was not very compact. The tradeoff here is a strong function of the expertise of the users and the nature of the application.

There are many organizational applications where it might be desirable to have the index maintained by a single specialist supporting the entire population of users. This is after all the purpose of the reference librarian within a library. The larger the application and the more diverse the population of users, the more quickly an index will degrade in performance without a sizable maintenance effort. For organizational wide applications it is also desirable to make available a dictionary that allows people to discover what the intended meanings of terms are in each application.

5.3. Index Use

In any application that is made up of a large number of objects that we wish to classify, there is often the ability to conceptualize a structure for classification based upon high-low or macro-micro considerations of the application. If we consider as our set of objects human beings, then most of us have some sort of macro level set of dimensions by which we classify a person. A human being is male-female, tall-short, dumb-smart, wealthy-poor, young-old, friendly-unfriendly, attractive-ugly, etc. These are various macro or high level dimensions we can use to set up a classification system for a general population of users. However, a psychologists thinks about classifying humans has a completely different set of dimensions at a much more micro or detailed level. We might think about a person having a good personality or bad one; however, the psychologist thinks about such dimensions as feeling - thinking, concrete thinker - abstract thinker, introvert - extrovert, etc. Similarly, there could be a great many different macro or micro level classifications. An employer will use a very high level classification to designate the relative health of his or her employees in their personal records; however, a doctor will have a much more detailed elaborate classification system to signify the relative health of his or her patients.

If there are well-structured relationships among the objects that comprise the knowledge base of interest and these relationships are at a macro level then the choice of a hierarchical approach is probably best suited. This is why this approach was originally chosen to represent a whole book in a library.

If the relationships are well structured but the relationships are at a micro level (i.e., many different classifications possible within the object) then the use of a networked indexing is probably desirable.

Clearly the nature of the object and the level of complexity with respect to the components that make up an object exercises influences the desirable indexing approach.

When the relationships are at a macro level and still reasonably structured but multiple in nature, then we move on to subject headings as the desirable indexing schema.

When we begin to get into unstructured relationships at a micro level we move to the concept of free keys. When there are factors within the objects that can be structured at the micro level and lots of overlap between objects we are naturally forced into a syntactic language approach. Whenever you are dealing with professionals in a particular application area it is very likely that the experts are already working mentally with such an approach. Many different metals may have the same melting point and it is the combination of attributes about a metal that defines it distinctively.

In situations where there is no explicit structure and you are dealing at the macro object level, then phrases and natural language become the preferred approaches.

These observations are summarized as follows:

INDEX TYPE IDEAL USE SITUATION
Hierarchical Macro objects, well structured objects and relationships
Networked Micro sub objects, structured relationships
Subjects Macro, Structured Concepts
Fixed keys, Free keys Micro or Macro, multiplicity of object characteristics, overlapping structure
Tagged descriptors, Faceted Indexes Micro, Structured factors within objects
Phrases Macro, Semi Structured
Natural Language Macro, Unstructured

In any evolving application area, and even for the recipe system we started this book with, it is important to allow the users to evolve and update the indexing approach they are initially provided. Setting up general maintenance procedures for indexes used through out the organization and across different user groups can also be a significant objective for improving overall usability of systems. The free and fixed key approach should be a standardized tool that is provided for most applications and it should be in a standardized form so that users do not have to learn a new search process. There is certainly enough structured knowledge about food to set up a rich index combining many of the above techniques and which would allow the user a flying start, but clearly an active cook is going to need to modify this with new keys and associated relationships that pertain to their utilization of recipes.

5.4. A Little Intelligence

A great deal of very sophisticated work has taken place in the areas of automatic indexing and abstracting of material (Salton, ???). One can utilize fuzzy set relationships, correlations, and/or semantic meaning to try and incorporate content information within indexes. Certainly this area is wide open to Artificial Intelligence type approaches. However, there are some rather straightforward things that can be done that buy a significant improvement at little actual added effort. This is an area where a little intelligence can go a long way and make some significant improvements. Also it is far better to provide features that users can understand and utilize in formulating their searches and their classification of information. This is particularly critical where the user is creating and maintaining the index. Once again we have a situation where the internal model can easily become difficult for the user to understand (system opacity) and hence degrade his or her ability to both feel and exercise control over the search process.

5.4.1. Common Words

The first simple step that can be taken is to exclude a subset of words from use in any index scheme. These are the most frequently used words in the English language and convey very little selective information for the purposes of setting up an index. Actually enforcing that the user does not use them in either categorizing or searching is quite desirable.

The first table is for those that are the most frequent words, which would rarely be useful as a modifier or part of an index term. We refer to this as the "primary exclusion set." The second table is the "secondary exclusion set." These words can also be eliminated; however, in some syntactic language approaches some of these words could be of use.

A third set of exclusion is obtained by considering all common suffixes (endings). This means that the index terms are reduced to only the root part of words and endings are ignored in checking for any text term matching an index term. The third table is a list of suffixes grouped by their number of characters.

PRIMARY EXCLUSION WORD LIST

a

about

above

after

afterall

afterward

again

against

all

almost

already

also

although

always

am

an

and

another

any

anyone

anything

are

as

at

be

because

been

before

being

but

by

can

cannot

could

did

do

does

doing

done

don't

each

else

etc.

first

for

from

further

had

has

have

having

he

hence

her

here

hers

herself

him

himself

his

how

however

I

if

in

indeed

into

is

it

its

itself

like

me

mere

merely

more

moreover

most

mostly

must

my

myself

nearly

neither

never

nevertheless

no

non

none

nor

not

nothing

notwithstanding

now

of

often

OK

on

once

one

only

onto

or

other

others

otherwise

ought

our

ours

out

overly

per

perhaps

pre

rather

readily

same

seem

seems

shall

she

should

since

so

some

something

sometime

somewhere

such

than

that

the

their

them

themselves

then

there

thereby

therefore

therein

these

they

this

those

though

thus

to

too

unless

up

upon

us

very

was

we

were

what

whatever

whatsoever

when

whenever

where

whereat

whereby

wherefore

wherein

whereof

wherever

whether

which

whichever

while

who

whom

whose

why

will

with

within

without

wont

would

you

your

yours

yourself

SECONDARY EXCLUSION WORD SET

able

accomplish

accord

across

actual

affect

ago

allow

allowed

allowing

allows

alone

along

among

application

applied

apply

applying

around

ask

asked

asking

asks

away

based

became

become

began

begin

begins

behind

believe

below

beneath

beside

best

better

between

beyond

both

bring

bringing

brings

brought

came

cause

caused

causes

causing

certain

clearly

co-

come

comes

coming

consider

corp.

dept.

despite

down

due

during

easily

easy

eight

either

enough

especially

even

ever

every

everything

far

farther

few

fewer

finally

find

finding

finds

five

follow

forth

forthcoming

found

four

front

fulfill

full

fully

gave

get

getting

give

given

gives

giving

go

goes

going

gone

good

got

happen

inc.

inside

involve

just

keep

keeping

keeps

kept

knew

know

knowing

known

knows

largely

last

late

lately

later

least

leave

less

let

lets

letting

likely

likes

little

long

longs

look

looked

looking

low

ltd.

made

make

makes

making

many

may

might

much

need

needed

needs

next

nowhere

obtain

off

outside

over

overcome

own

owning

owns

partly

pending

possible

put

quick

quickly

quite

really

regard

relate

result

return

said

sat

save

saved

saves

saw

say

says

see

seen

sees

seven

several

shown

sit

sits

sitting

six

soon

still

stood

sure

take

taken

takes

taking

ten

three

through

times

together

took

toward

tried

try

two

under

underneath

until

use

used

using

usual

usually

various

vary

versus

via

vs.

want

went

whole

wholly

yes

yet

SUFFIX LISTS


1 letter

e

s

2 letter

ae

al

cy

ed

en

es

ic

iv

iz

ly

or


3 letter

age

als

ant

ary

ate

eds

ely

ent

est

ety

ful

ial

iar

ibe

ics

ied

ief

ier

ies

ify

ily

ing

ion

ior

ism

ist

ite

ity

ive

ize

izm

ors

ous


4 letter

abil

able

ably

ages

ally

ance

ated

ates

ativ

ator

edly

ence

ency

ents

hood

ials

iant

iate

ibed

ibes

ibil

ible

ibly

ical

icle

ient

ific

ings

ions

ious

isms

ison

ists

ites

itly

izms

less

ment

ness

ship

wise

ying


5 letter

acion

aging

ances

arily

ately

ating

ation

ative

ences

entry

ially

iated

iates

icals

ician

icles

iency

ified

ifies

isons

isty

itate

itely

ities

ition

itive

ments

6 - letter

ations

atives

encies

encing

iently

iption

itiated

itates

itions

itives

5.4.2. Syndetic Systems

These are index systems that provide guidance within the index system. The usual form shows synonyms, related heading, antonym, overall structure of the index, information nomenclature, and exceptions. The major components of these systems are the cross-references, for example: "Corn sugar, see Glucose."

For many applications, and in particular for help systems, the incorporation of synonyms probably has the single biggest marginal value for improving the utility of the system. This is due to the lack of standardization in the terminology used to represent functionality across applications or even in the same application software supplied by different vendors. Thus the term "add," for example, means one thing on one system and something else but related on a different system.

5.4.3. Zipf's Law & Concordances

Zipf (1972) was a linguist who discovered, before computers, the starling simple relationship between the frequency of words (f) and their rank order (r):

r x f = C (C = a constant)

The plot of r verses f for this formula on log-log paper is a straight line. This was found to be statistically true by counting word occurrences in many bodies of text. It appears to be true in different languages and for different subject matter within the same language. Zipf pointed out that the words easiest to say (shortest) tended to be the most frequent and also those that had the most differing meanings tended to be more frequent. Somehow the use of language in a culture evolves so that people expend the minimum effort to communicate. This led Zipf to further investigations to support a very general hypothesis that humans attempt to minimize effort in anything they do. A relevant example that he studied was how crafts people, such as shoe makers and carpenters, positioned their tools within their work area. While these tools often looked like they were placed without any logical ordering, it turned out they were placed to minimize the effort in getting them. The most frequently used tools were always closer to the craftsperson.

The following illustrates the behavior of the common rank order list (Hofland and Johansson, 1982). While there is statistical fluctuation in the value of the constant it is a statistically significant linear relationship.

SAMPLE ENTRIES FOR THE ZIPF LIST

Word Rank Frequency Rank x Frequency
the 1 68,315 68,315
of 2 35,716 71,432
and 3 27,856 83,568
to 4 26,760 107,040
a 5 22,744 113,720
in 6 21,108 126,648
that 7 11,188 78,316
is 8 10,978 87,824
was 9 10,499 94,491
it 10 10,010 100,100
sir 195 452 88,140
it's 196 452 88,592
why 197 451 88,847
asked 198 448 88,704
give 199 446 88,754
once 200 443 88,600
usually 400 239 95,600
tax 500 167 83,500
ideas 800 128 102,400
proved 1,170 88 102,960
sections 2,146 49 105,154
flames 5,070 17 86,190
cultures 7,020 11 77,220

Comparing with the common rank order list of English (American English that is) a more intelligent formulation of index terms can be derived. We have already pointed out earlier the very frequent terms used in normal English should be eliminated. These terms are supplied in the list given above. With the complete rank list one can sample a body of literature and create a concordance list that is just the rank order list for that body of literature. Then, comparing the specific application concordance list to the general English list allows one to detect what words are used more in the subject area than in normal English. For example, if it is a database of physics papers we would expect the scientific terminology of Physics to be more common than in normal English. Terms like "nuclear" or "quantum" would stand out as potential terms for use in formulating an index. If one has to develop a database for all the documentation for a specific project it is very likely that this procedure would allow the formulation of a very useful set of index terms.

One would also reject for use in indexing the words that do not occur very often. In essence one chooses a high and low frequency occurrence cutoff. Performing a complete concordance analysis of the body of material and using it to update the index every year or so is a very desirable approach to keeping the index current. Knowing how often a word is used in the objects helps in determining classifications with reasonable discrimination capability. Ideally one would like a fairly even distribution with respect to the number of objects within a given classification node. This approach can be enhanced further by looking at two and three word combinations. With the computing power now available it is not unreasonable to also consider the frequency of phrases through at least three word phrases.

Zipf's law allows individuals who are not experts in the application to make a reasonable start at formulating an index for a given application. The next most significant improvement is to consider words in combination (not phrases) but this requires being an expert in the application area. One is interested in the overall occurrence of words and phrases and in their occurrence on an object bases. Furthermore, some very powerful refinements are possible when word combinations are looked at with respect to how far apart they are in a document.

In general, allowing users to generate concordance lists for material they are creating or dealing with, and comparing to normal English can be useful for formulating document outlines and or determining problems in writing style.

5.4.4. Spatial & Temporal Relationships

One of the largest text databases in existence is the one used by lawyers to search the actual text of law case decisions. The single most popular form of search functionality is not any of the standard database options. It is searching for two or more text terms within some spatial distance of one another. Do the terms occur in the same sentence, paragraph, page, or within a certain number of words of one another. In any application there may be a need for correlated searches among the subparts of an object within the database. In any given situation there may well be correlated searches that are particular to the application.

An excellent example of temporal relationships and the use of inversion is the "citation index." This is an index that takes the references in a paper and creates an index out of them so that one can look up a given paper and discover what other later papers have referenced it. This is considered a critical way of evaluating the importance of a paper (how frequently it is cited) as well as locating new material by utilizing older related references that are considered classics because they are still cited many years after they were published.

Many of those who work with the written word tend to have various piles of papers scattered around the desk, the shelves, and even the floor. Each pile is some appropriate grouping of material. The location of these piles and their spatial proximity to one another, as well as to the individual, are the cues by which we remember what they are. The information worker of today is often not too dissimilar from the carpenter or shoemaker of yesteryear. Information work is a craft in a very real sense. Many of us tend to remember things by association of some sort of spatial representation. This is why systems like "windows" are so popular. Clearly the computer provides the opportunity for unlimited spatial organization of material. We can have road maps that allow us to find the classification of what we want, point to it, and use that operation to generate the information. The use of graphics to depict the classification of information is an extremely useful approach. For certain problems such as engineering design, the graphic design of a physical object is a natural way to categorize information about different parts of the design. Any sort of drawing may be a classification schema where objects in the picture can be used to uncover piles of information that have been grouped under that object.

These forms of spatial indexing allow completely new ways to do things. For example, a picture of a murder scene could now become the index or "non linear table of contents" for a mystery novel. The reader obtains portions of the text by deciding to point to something in the crime scene. Pointing to the body might produce the autopsy report and pointing to a witness might produce background and interview material about that character in the story.

A classification schema is also a method for abstracting concepts. Therefore, it is related to the problem of people conceptualizing their considerations and their approach to dealing with a complex problem. Putting together the considerations and moving them about spatially to formulate relationships for further exploration is another utilization of spatial representation of indexing classifications. Both applications require the same tools and should be considered as a common cognitive application that people deal with. In fact, the concept of spatial relationships is one of the aspects that makes Hypertext so appealing to the user and relates the satisfaction with a high degree of control due to the direct manipulation process that can be employed. Unfortunately the spatial ability to manipulate Hypertext is still rather rare. With the computer one can have a virtually infinite space upon which or in which to create piles of information in any meaningful juxtaposition. In fact, one can do this without upsetting one's spouse unless you both share the same computer.

5.4.5. Search Matching

In collaborative group operations, where many different people are contributing to the creation of objects and their classification within the database, there will be a growing problem with consistency unless there is some sort of formal support in the software for a human to be the final arbitrator of appropriate index terms.

Also, term consistency will be a major problem unless the search criterion allows the use of "letters in sequence" matches. Consider the alternatives for the term "Air Force Base":

Air Force Base, air force base, AFB, A.F.B., A. F. B.

Different people may create text utilizing any of these options. If people are to be able to find all occurrences of this, then searching for "afb" in sequence is the only type of match that will return all the occurrences of Air Force Base (in any form). Letters in Sequence matches should be standard approach in user oriented systems. It is an approach the user can understand and predict the consequences of his or her search choice. The search engines on the web that currently exam all words and form an inverted index still do not deal with this problem.

In general, if the user supplies all lower case for a search string, the match should be against any mix of upper and lower case which occurs in the actual index or text. Only if the user puts in explicit upper case characters, should the match condition be case dependent.

No matter what the implemented heuristics are, it must be clear and understandable to the user. For example, if one is looking for equations and other special text forms, are blanks to be ignored in the match? Is there a way of detecting all possible misspellings? The correction of spelling is an important aspect of creating and maintaining an index. Can a thesaurus be utilized for suggesting synonyms?

In the process of designing indexes and the ways in which they can be searched it must be kept in mind that users can employ a multitude of search strategies:

Scanning: When users are scanning, they tend to cover a large area of the index system but without going into great detail or depth. Typically this is implemented by allowing users to be able to directly scan the raw index or combination of indexes and at least be informed of how many items are associated with each index term.

Browsing: Users in this mode wish to go wherever the data leads them. Users will pursue a path as long as it sustains their interest. Allowing users to do this usually means a fairly rich set of reactive functionality when they are viewing the results of their current search. This is a highly iterative form of searching.

Review Browsing: In this mode the user knows there is information that they wish to learn and to integrate with what they already know. They are looking for things that are new to them. They are looking for anything that is not familiar.

Target Searching: Users are seeking a particular target and it involves continued refinement of a given approach. The user has a very specific object that they are after and the main objective is to find it. The user needs a good understanding of the structure of the index and the rules or processes that are used to classify or place objects.

Exploring: Users are trying to understand or comprehend the nature of the database. This is usually a process that users have to go through before they can feel confident in undertaking any of the other alternatives above.

Wandering: This is random investigation and often results when there are no clues provided to help guide the user into an exploring mode. However, it is sometimes used as an approach to diversion and the stimulus of creative activity.

Users can be searching because they want to learn about the subject matter in the database without knowing ahead of time what is useful to learn. Also users can search as a form of enjoyment because of a natural interest in the material. Also users may find browsing type searching easier and more relaxing than target searching. The less focus the user has the less likely he or she wants to obtain the more detailed level of the data in the information structure. One might use an analogy of the difference in observing a forest by flying over it in a plane or walking on a trail.

Just as with any other part of an interface it is critically important to provide a level of comprehension of the structure of a database and the resulting index structure that is available to the user. There is always considerable difference in the search behavior of users who are experts or novices in the subject area of the information base being searched. Experts tend to browse fewer topics but go to much greater depth of detail than do novices. Novices tend to base their browsing on topics of special interest and common sense type of relationship considerations. Novices are always seeking summaries and higher level overviews of the information. Novices will range widely over the scope of the database. Experts tend to search in an evaluative mode and are quick to detect and distinguish between relevant and non-relevant associations that are present in the data structure. One might say that experts and novices both fly planes and walk along trails but the novice will employ his normal eyesight, while the expert will wish to utilize magnifying glasses, microscopes, and telescopes.

It is the experts that can be utilized to help improve the indexing of any database if the proper feedback mechanisms can be provided to make this an easy task for the expert users. One easy form of feed back that can be very useful is to keep a record of the unsuccessful choices that both novices and experts have attempted. This can be very informative regarding which information is missing or what relationships are missing between the index terms and the information.

5.5. Evaluation Considerations

A significant way to evaluate the on going effectiveness of an index is to keep an automated record of any searches made by users that results in no match occurring. This gives some indication of what users are looking for and not finding. The specifics of such a log may indicate the degree of obsolescence occurring with respect to the index. Also in an evolving database it may indicate the need for more new types of data or objects to be added to the database. If there is a person responsible for maintenance of the index they should be provided access to this log. An excellent example of where to apply this is for any HELP database to support users of the system by informing the maintenance people what help users are seeking and not finding.

Another desirable evaluation procedure is to keep a record of each object with respect to the date that it was last retrieved by any user. Even in a personal database this is extremely useful for the individual user who wishes to make an informed decision about which items to delete. Providing tools that will aid users on what to delete is a neglected area of design. Users desperately need help to determine what they can forget. The purchase of bigger disks only puts off the day of reckoning and makes the problem more difficult.

5.6. Search Procedures

The user undertaking a search is only one step in a much broader task that he or she is going through. The steps in this task include:

1. Perceiving a need to find information. There is some occurrence or need that has triggered this perception.

2. Formulating the request for information. This is the process of creating a representation of the user's requirement that will allow him or her to make an intelligent decision on how to proceed.

3. Selection of the source for the information. Based upon the prior step the user now decides where to go to search for the information. In an integrated computer environment this implies a database of information sources and might in fact trigger a high level search to find the right data source.

4. Specifying the search strategy. This is where an actual determination is made of how to specify a specific search.

5. Carrying out the search.

6. Evaluating the results to determine if useful or relevant information was found and whether all the information that is needed was found.

Step 4 above, to be successful, is dependent upon the knowledge and facility the user has with the particular indexing schema in use. User satisfaction will be dependent on how well he or she feels they understand the indexing approach and how to formulate searches. Making the decision in step 6 as to whether they have achieved "total recall" will also be dependent upon this factor.

The actual steps for carrying out the process in 4 to 6 above are highly sensitive with respect to the degree of control that a user may feel they have over the process. We therefore have both the knowledge of the classification approach and the sense of control over it as the highly critical items in a good search procedure.

The following is a set of recommended procedures to provide the users for improving their ability to carry out searches in steps 4 to 6 above. These are:

4.1 Allow the user to browse the index providing information as to how many objects exist under a given classification in the index.

4.2 In the search formulation processes allow the normal user the ability to express "or" conditions and the more experienced user to add "and" and "not" conditions.

5.1 The result of a search should indicate to the user how many "hits" were collected and then provide a choice as to whether to display the results (in alternative forms) and/or whether to nest the search. Nesting means to undertake a new search that applies only to the set of items resulting from the first choice. Also, the user needs an option to discard the results of the current search and to step back to an earlier result.

6.1 As the user is reviewing the hits, he or she should be provided a way to indicate which of the items are really relevant or useful. Allowing the user to mark the items for incorporation in a marked list is one method of accomplishing this. Sometimes marking the whole list and unmarking items is preferred as a complementary option. If this has been a nested search then these hits are being added to a prior hit list.

6.2 If nesting is allowed then the user should be able to back up to an earlier search and start over. The user may need to compare the current list of hits with the one that he or she has been accumulating and functions such as automatically purging duplicates could be very useful for this process.

6.3 Being able to store and reuse a successful search at a later time is also a desirable feature.

6.4 Finally, a user needs to be able to copy, print, or file the relevant items found as a result of the search procedures.

The search index is in itself a list. Therefore all the functionality needed to allow a person to maintain and update a list is needed for maintenance of the index. In addition, one needs the added functionality to be able to deal with the set of objects that each index classification refers to. The user needs to be able to take the items under a given classification and split them into separate lists, and to take the items under different classifications and merge them under one classification.

5.7. Observations

The area of Information Retrieval is one where a considerable amount of tailoring is possible for a particular application. Also, tailoring is likely to be required as a function of the expertise of the group utilizing the application. When dealing with experts there should already exist an accepted structure for organizing and categorizing information. It is up to the designer to seek out this information and to utilize it in the design of the system.

When a structure for indexing is not immediately clear it is probably best to focus on a key word and phrase approach. For an organization, it would be very desirable to have a standardized key word system that can be incorporated into any new application.

As we have seen with the recipe system, it is often appropriate to incorporate a variety or combination of indexing approaches for some applications. In the recipe system we had a general key word ability, specific coordinate keys to handle different nutritional or taste dimensions, phrases in terms of recipe titles, and a restricted ingredients only index. As an application evolves and the system becomes richer, it is very likely that a variety of indexing approaches will be needed.

Spatial indexing is very useful for the individual and his or her personal filing system; however, for a group of users nothing matches the richness of semantic type indexing for fostering mutual understanding of a body of knowledge.

There is a rich literature of statistical methods to improve the effectiveness of indexing methods by such methods as dynamically evolving measures of association and fuzzy relationships among indexing terms and the objects they point to. These methods basically allow the user to get quantifiable measures of how relevant a given object is to the search criteria they have specified. The larger the database, the more expert the user, the greater the potential benefit of these approaches.

Perhaps the greatest challenge to the designer in the area of searching is in the combination of Hypertext oriented databases which are the collaborative results of many different contributors of information. The introduction of powerful analytical methods are probably crucial to the long term solution to this problem area.

5.8. Assignments

1. Devise an indexing schema for college courses that would allow students to find courses by content and by specifics of pre requisite material utilize in one course and supplied by others. The goal of this is to produce a sizable improvement in the informativeness of standard college catalogs for the benefit of students. Rather than just producing an electronic version of the college catalog we wish to see what new and useful ways information can now be collected.

2. What sort of metaphor and description would you provide a user for his or her own personal spatial system of creating piles of information? What is the list of key functions or actions that need to be made available?

3. For each index type, suggest what appears to be an ideal situation for utilizing that particular indexing method.

4. Give an example of a real application package you are familiar with where you feel there are some significant shortcomings in the current indexing methods provided. Explain what these shortcomings are and suggest how they could be eliminated by specific improvements or changes to the indexing scheme.

5. Devise and layout the interface design for a general key word and phrase system that can be a standard index approach for incorporation in any application in your organization that can make use of free and fixed key words. Make sure you design the functionality for doing regular maintenance on the index.

6. For each of the design strategies (browsing, target searching, etc.) what are specific interface features or data structure functionality that you would suggest to aid the user seeking to use that particular strategy.

7. What are some possible differences in interface features to support the novice or expert in the subject of the database?

8. Characterize the possible search strategies users employ on the World Wide Web. What sort of approaches could be taken to improve the search situation on the Web if users are allowed to index, in some manner, their own material.

5.9. References

Borko, Harold, and Charles l. Berlnier, Indexing Concepts and Methods, Academic Press, 1978.

Dewey, Melvil, Library School Rules, 1894, Publ: Boston Library Bureau, 5th edition.

Hofland, Knut, and Stig Johansson, Word Frequencies in British and American English, The Norwegian Computing Center for the Humanities, Bergen, Norway, 1982.

Kochen, Manfred, Principles of Information Retrieval, Melville Publishing, Los Angeles, California, 1974.

Lancaster, F. Wilfrid, Information Retrieval Systems, John Wiley and Sons, 1979.

Meadow, Charles T., The Analysis of Information Systems, Melville Publishing, Los Angeles, California, 1973.

Salton, G., Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989.

Zipf, George Kingsly, "Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Hafner Publishing Company 1972.