Free text retrieval, or simply information retrieval, concerns providing topic based searches of document collections. Typically such systems operate with natural language documents and queries (typically of the form "find me documents about ..."). These notes provide an explanation of the magic box which carries out the matching.
Although a user of a free text retrieval system does not need to understand how free text retrieval works, it seems reasonable that they will perform better if they have some understanding of the black box. Understanding roughly how IR engines work can also help with curiosity of expert users. The algorithms and techniques used in free text retrieval are, essentially, very simple, so a good understanding of the process is not difficult to attain. The difficulty in implementing a free text retrieval comes not from the basic approaches, but from making them run fast enough to be useful.
For a full overview of free text retrieval the reader should consult the two main texts on information retrieval (van Rijsbergen 79 and Salton 71). For state of the art research in the area, the main source is the proceedings SIGIR conferences, the annual conference of the ACM's interest group on IR (Croft and van Rijsbergen 94).
Once indexed a matching routine is required to compare document indexes with user submitted queries. This is the first stage of retrieval from the document collection.
Once retrieved, documents are displayed on the user's screen. Typically, a list of summary information is presented and the user can then look in more detail at any of the items which are of particular interest. The summary list is usually listed in decreasing strength of matching, so the most likely to be relevant are listed at the top of the list.
It is often not possible for a retrieval engine to take a single query and find all documents which the user would consider relevant to the query. There are two main reasons for this limitation: a query is rarely complete enough to specify all relevant document and the retrieval engine is not powerful enough to match all relevant documents. To overcome this problem the user is often allowed to provide relevance feedback information by stating which documents are similar to what they are looking for. This information is then used in a new query and the process repeats until the user is satisfied (s)he cannot find any more relevant documents.
The next section of this document will look at these stages in more detail. Throughout these notes the term matching document will be used to refer to a document which the system has matched to a query. The term relevant document will refer to a document which the user thinks matches his/her query. In an ideal world there should be no difference, however some relevant documents will not be matched by a real life retrieval engine and some matched documents will not be relevant.
Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it. "And what is the use of a book", thought Alice, "without pictures or conversations"?
would, through extremely simple indexing, be converted into the following table:
These words would be considered as the content of the document for retrieval purposes. In this simple indexing the words have simply been listed in alphabetical order (thus loosing contextual information), reduced to lower case and punctuation removed.
Although much of the information has now been removed we are left with a tighter list of words which are likely to distinguish this document from others.
Stop word lists can either be taken from a standard source (see van Rijsbergen 79 for a sample) or can be made by analysing the most frequently used words in each document collection - if a word occurs in most documents then it should be considered as a very poor discriminator of relevant and irrelevant document and, thus, listed as a stop word.
This set of terms will now match a wider set of queries (assuming queries are also stemmed). There are two common approaches to stemming: algorithmic and dictionary based.
Algorithmic approaches, typified by Porter (1980), make basic assumptions about how suffixes are added to words and then try to alter words according to these rules. For the vast majority of words this approach leads to a system which will correctly stem two variants of a word to the same index term. This index term is, however, often not a proper English term; as an example, pictured may be stemmed to pictur, but this is not too serious since picture and pictures would also be stemmed to pictur. The biggest problem this introduces is when two independent words are stemmed to the same term, for example Porter's algorithm stems both communism and community to commun. Even this is not very serious since it can simply be viewed as introducing an extra homonym into English. Since the other words in the query are likely to distinguish whether we mean communism or community there should be no problem in real retrievals, just as other words in queries distinguish between financial bank and river bank.
Dictionary based approaches have large translation tables which are looked up for each word and contain the base form of the word. These tables lack the error prone nature of algorithmic approaches but are reliant on their dictionary having an entry for all words in the document base, they have no method of coping with an unknown word.
The weights here are given as decimal fractions between 0 and 1.
Sparck Jones (1980) developed an automatic algorithm for calculating these scores known as inverse document frequency (IDF). It is based on comparing the number of times the specific term occurs in the entire collection, n, with the total number of occurrences for all terms in the entire collection, N. Essentially the algorithm can be expressed as log(N/n)+1 which gives a high weight to very rare terms and almost unary weight to terms which are very frequent.
It can also be useful to take into account how many times each term occurs in a document as this gives an indication, especially for large documents, of how important the term is within the document. This is usually done simply by dividing the occurrences of the specific term in the document by the total number of term occurrences in the document. In our example there were a total of 17 terms (excluding stop words) and the terms book, conversation and picture occurred twice with the rest occurring once. This leads to a revised weighting as follows:
Although the numbers are rather abstract the ordered list of terms gives a sensible listing of how important each term is for describing the sample document: conversation, book, picture, alice, peep, once, twice, sister, without, think, read, her, use, and what.
where N is the number of terms used in the document collection and V is the document's vector (of length l). For the sample paragraph the length is currently 0.158, so to normalise the vector we would divide it by 0.158 or multiply each term's weight by 6.33. This leads to a new unit length vector of:
This is the final representation that most systems would have of the sample paragraph. Some systems go further and introduce additional information, most commonly thesaurus style information, while others do not save IDF information in the document vectors but store it separately.
While not universally implemented thesaurus information can drastically increase the number of documents which are likely to match a query since more individual terms are used to describe each document. Domain specific thesauri are often the most effective (e.g. precise domain information such as bus = car in a computing document base while computer = word processor in a transport document base). It can, however, also be beneficial to simply use a standard thesaurus. Any thesaurus can be augmented by analysing the document collection to establish if two terms tend to be used as synonyms within that collection.
Most of the work of the retrieval engine has already been done - that is why so much work is put into indexing which makes the querying simpler and faster.
There are two main ways of providing a query: in boolean logic and free text. Boolean retrieval treats the query very differently from the documents and doesn't require as much indexing information as we have provided. Free text querying is further split into the two main ways of approaching the matching problem, probabilistic and vector space, both of which initially index queries in the same way as documents.
This level of querying requires very little indexing - essentially the list of words used in each document is all that is required. Stemming could be implemented to work with this approach but it is usually left to the user to indicate when stemming is appropriate, through the use of wild cards. Wild cards allow the user to roughly specify a word by saying that (s)he doesn't mind which letters occur, e.g. conversation* would match all words which start with conversation, or to allow for a single uncertain letter, e.g. summari?e would match summarise and summarize. This reduces the work of the retrieval engine but puts much more complexity in the hands of the user.
Typical operators in boolean matching systems include and (both terms must exist), or (one of the two terms must exist), and not (the term must not exist). We can, thus, look for systems which discuss computers (in the singular or plural) and image processing by issuing the query:
computer* and image and processing
If we consider image manipulation an equivalent term to image processing but do not want anything on virtual reality, then we can issue:
computer* and image and (processing or manipulation) and not (virtual and reality)
Unfortunately, most computing scientists are stretched by the complexity of logical expressions like this - never mind novice computer users in a library. Many boolean systems provide further constructs to improve the power of the query language, for example to find two words within the same sentence. Although these do improve the power of the language, they further increase the complexity of the task the user is faced with.
Boolean matching allows very precise queries to be formulated but imposes a lot of complexity on the user to create this query. This is further hindered by a difference in the logical constructs and and or and their day-to-day equivalents - especially when combined with not.
document .5 alice .5 discuss .5 conversation .5
The weight of .5 gives the four terms equal weight (since they all occur once) and a vector of unit length. Although this introduces some additional terms (document and discuss), users often perform better by thinking in full English sentences rather than making a list of keywords (this problem is reduced after relevance feedback - see later).
The main advantage of the approach taken in pure free text retrieval, whether vector space or probabilistic, is the shift of complexity from the user's task to the system. This makes the complexity of issuing a query much lower, however does impose major computational problems on the system. These complexities tend to result in error prone retrieval engines: unlike boolean retrieval, free text retrieval systems can make mistakes - either missing relevant document or wrongly retrieving non-relevant documents. Although similar behaviour can happen in boolean retrieval this can, in theory, be prevented by a better specified query. Free text querying reduces the power of the query language and so reduces the possibility of improving the query - as such the user has to be more aware of the errors and the system's user interface designed to minimise their impact (e.g. through summary lists and user selection for printing). Despite this, the matching performance of free text engines has been shown to be no worse than that of boolean systems and often out perform and take much less time to use - since there is little time spent composing a query.
The reduction in the strength of the language removes the users ability to specify precise queries and also removes any semantic content in the words used, for example consider the free text query:
I'm interested in computers and image processing but not virtual reality.
An IR system would automatically match documents which also mention computer in the singular and, through weight of other words, documents discussing image manipulation (although a thesaurus would strengthen this link). The word not would, however, be treated as a stop word and, thus, be totally ignored - the user would be presented with documents discussing virtual reality. Documents which discuss images of computer processors for virtual reality would also be matched. These are the most common problems of presenting user's with what looks like an intelligent language system which isn't, these problems are, however, very small compared to the benefit of quickly specified natural queries and the backup of stemming, weighting and non-definitive queries. So how does the matching work...
Now that the documents are both represented as vectors, the vector space model considers the similarity of them to be based on the angle between the two vectors in space. Up until this point (and with the probabilistic model), the vector has simply been a convenient mathematical model for storing a list of terms and their weights. The vector space model then makes the jump to processing them as if they were real geometrical vectors in a space with thousands of dimensions. Although this seems rather strange initially, it is based on an extension of a simple matching routine for non-weighted indexes. Consider non-weighted indexes for the above query and the sample document, these are basically a list of four words for the query and 14 for the document. A rough measure of matching strength is the number of terms they have in common, in this case two. This doesn't take into account how large each document is and would have a tendency to match larger documents, so we could divide by the number of terms in total between the query and document. This leads to Dice's coefficient:
where |D[[intersection]]Q| is the number of terms common to the document and query, |D| is the number of terms in the document, |Q| the number in the query and m the matching value - the fraction is doubled to give a maximum value, for matching a document with itself, of 1 instead of 0.5. For our query and document this leads to:
m = ~ 0.222
When considering weighted terms, like those we indexed, it is not possible to simply count the number of terms in common. Instead the vector space model multiplies the term weights together. For the vast majority of terms either the document or the query will have a zero weight, hence the resulting weight is zero. These individual new weights are then summed to give the top line of the matching algorithm. For the query above and the last form of the sample document's index, this leads to:
m = = = 0.43
where ||V|| is the length of the document, Vi is the weight of term i in vector V, and N is the total number of individual terms (the dimensionality of V). In geometry this equation is used to calculate the cosine of the angle between the two vectors, hence this matching routine is known as the cosine coefficient.
Although quite simple to understand this approach has no sound bases in information theory - there is no theoretical reason for this to be a good matching algorithm. The cosine coefficient does, however, perform well in practice, is reasonably easy to code and is used in many retrieval systems.
Probabilistic models are based on expanding known information to the unknown and assigning a probability that the expansion is accurate. In IR, this results in a set of document being required to be known as relevant to the user. The system then estimates the probability that each other document is also relevant to the user. Croft and Harper's combination matching algorithm was designed to help the user find the first relevant documents.
The mathematics involved in the probabilistic model are too complex to develop here, however from a user's point of view the complexity can be overcome by considering it as a more sound version of the vector space approach (the indexing and weighting schemes are often the same). Van Rijsbergen (1979) gives a long discussion on the probabilistic model which is also discussed in a later research report (1992).
In the vector space model, this information can be used to alter the query to more closely represent the user's information requirement and some of the particular aspects of the retrieval engine. A simple, but effective, approach is to add a fraction of each relevant document vector to the query vector. If we issued the initial query from above:
document .5 alice .5 discuss .5 conversation .5
which was deemed by the system as a match to the sample document:
it would be possible to mark this as relevant. The system would then scale the document vector by, say, 1/4 and add the two vectors to give a new query of:
To finalise the new query this would require normalised by dividing by the size of the new vector (1.13).
Note: this approach was taken for simplicity throughout the document although it is now rather strange since IDF information is being used in both the new query and the documents it will match. Separate recording of IDF information is, thus, more accurate.
Although this has not altered the query considerably it has reduced the relative weights of the two rogue terms and introduced a group of new, but low weighted, terms. With relevance feedback on a few documents, these affects will be compounded when new terms are found which are common to the set of marked relevant documents. By this means, the query can be adjusted to take into account the original textual query plus information about documents which were found to be relevant.
Relevance feedback can prove to be an extremely powerful facility since users need not spend a long time formulating an initial query but simply enter a quick sentence and then expand their query by feedback.
So far we have discussed the processes behind the scenes of the black box of information retrieval. In this section we will consider the evaluation of IR systems. This is intended to give you some idea of the main testing approaches and of what issues are involved in improving IR engines.
The technical benchmark tests in information retrieval are usually conducted with standard test collections. The collections are composed of a set of documents, a set of queries and a list of answers. The answers are usually based upon a panel of experts stating which documents they think are relevant to which queries. Testing a system simply involves indexing the documents, running the queries and comparing the list of matched documents with the answers.
There are two standard measures for how well a particular system performs: recall and precision. Recall states the fraction of relevant documents which the system successfully matches (i.e. the number of correct answers retrieved divided by the total number of correct answers for this query). Precision states the fraction of the matched documents which are relevant. Recall essentially states how well the system is at including relevant documents in the list of matched documents, while precision states how good a system is at not including non-relevant documents in the list. An ideal IR system would have 100% precision and 100% recall but in practice precision is traded for recall: the more relevant documents a system retrieved the more non-relevant also get included.
There are strong arguments against the test collection approach, in particular it assumes the query is a perfect statement of the information requirement and that the experts have a categorical understanding of this. The approach also fails to take into account more user oriented aspects such as relevance feedback and the user interface. The approach does, however, provide the only way of categorically comparing two retrieval engines and is, thus, often used to assess whether a new technique does improve retrieval.
While being much more time consuming and expensive to carry out and not as definitive as benchmark tests, user test can provide more information than just a simple comparison of in-depth retrieval algorithms.
Croft, W.B., and van Rijsbergen, C.J., (editors), Proceedings of SIGIR '94, Springer-Verlag, London. (1994)
Dunlop, M.D., and C.J. van Rijsbergen, C.J., "Hypermedia and free text retrieval", Information Processing and Management. vol 29 (3), (May 1993).
Porter, M.F. , "An Algorithm for Suffix Stripping", Program, vol. 14(3), pp. 130-137. (July 1980)
van Rijsbergen, C.J., Probabilistic Retrieval Revisited, Departmental Research Report 1992/R2, Computing Science, Universtiy of Glasgow. (1992)
van Rijsbergen, C.J. , Information Retrieval (second edition), Butterworths, London. (1979)
Salton, G. (editor), The SMART Retrieval System, Prentice Hall, New Jersey. (1971)
Sanderson, M, "Word Sense Disambiguation and Information Retrieval", Proceedings of SIGIR '94 (edited by Croft and van Rijsbergen), Springer-Verlag, London. (1994)
Sparck Jones, K. and Webster, C.A., Research on Relevance Weighting, British Library Research and Development Report 5553, Computer Lab., University of Cambridge. (1980)