Free Text Retrieval:
The Magic Explained

8 September 1994, University of Hertfordshire
Mark D Dunlop, University of Paisley

Free text retrieval, or simply information retrieval, concerns providing topic based searches of document collections. Typically such systems operate with natural language documents and queries (typically of the form "find me documents about ..."). These notes provide an explanation of the magic box which carries out the matching.

Introduction

This document provides an introduction to free text information retrieval. It starts with an overview of the whole process and then works through each of the main aspects of this process.

Although a user of a free text retrieval system does not need to understand how free text retrieval works, it seems reasonable that they will perform better if they have some understanding of the black box. Understanding roughly how IR engines work can also help with curiosity of expert users. The algorithms and techniques used in free text retrieval are, essentially, very simple, so a good understanding of the process is not difficult to attain. The difficulty in implementing a free text retrieval comes not from the basic approaches, but from making them run fast enough to be useful.

For a full overview of free text retrieval the reader should consult the two main texts on information retrieval (van Rijsbergen 79 and Salton 71). For state of the art research in the area, the main source is the proceedings SIGIR conferences, the annual conference of the ACM's interest group on IR (Croft and van Rijsbergen 94).

Overview of the process

To allow documents to be accessed by their content it is necessary to firstly index them. This process varies depending on the style of access which is intended, but it essentially summarises each document into the basic information which the matching routine will require.

Once indexed a matching routine is required to compare document indexes with user submitted queries. This is the first stage of retrieval from the document collection.

Once retrieved, documents are displayed on the user's screen. Typically, a list of summary information is presented and the user can then look in more detail at any of the items which are of particular interest. The summary list is usually listed in decreasing strength of matching, so the most likely to be relevant are listed at the top of the list.

It is often not possible for a retrieval engine to take a single query and find all documents which the user would consider relevant to the query. There are two main reasons for this limitation: a query is rarely complete enough to specify all relevant document and the retrieval engine is not powerful enough to match all relevant documents. To overcome this problem the user is often allowed to provide relevance feedback information by stating which documents are similar to what they are looking for. This information is then used in a new query and the process repeats until the user is satisfied (s)he cannot find any more relevant documents.

In summary:

The next section of this document will look at these stages in more detail. Throughout these notes the term matching document will be used to refer to a document which the system has matched to a query. The term relevant document will refer to a document which the user thinks matches his/her query. In an ideal world there should be no difference, however some relevant documents will not be matched by a real life retrieval engine and some matched documents will not be relevant.

Indexing

Most free text retrieval systems work by using terms to describe the content of a document. Terms are based on words extracted from the content of the documents and are usually stored in a table with no duplicates. As an example the paragraph

Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it. "And what is the use of a book", thought Alice, "without pictures or conversations"?[1]

would, through extremely simple indexing, be converted into the following table:

a

alice

and

book

but

conversations

had

her

in

into

is

it

no

of

once

or

peeped

pictures

reading

she

sister

the

thought

twice

use

was

what

without

These words would be considered as the content of the document for retrieval purposes. In this simple indexing the words have simply been listed in alphabetical order (thus loosing contextual information), reduced to lower case and punctuation removed.

Stop word list

The first level of linguistic processing which typically occurs is the removal of stop words - these are words which have no meaning when taken out of context. In the above example words like a, and, in, it, of, and the would typically be removed. This would lead to an index as follows:

alice

book

conversations

her

once

peeped

pictures

reading

sister

thought

twice

use

what

without

Although much of the information has now been removed we are left with a tighter list of words which are likely to distinguish this document from others.

Stop word lists can either be taken from a standard source (see van Rijsbergen 79 for a sample) or can be made by analysing the most frequently used words in each document collection - if a word occurs in most documents then it should be considered as a very poor discriminator of relevant and irrelevant document and, thus, listed as a stop word.

Stemming

To try and increase the number of documents which match a query it would be useful to convert words into their base form. The simplest form of this would be to remove plurals, so that a query which uses the term conversation would match our document. It is also useful to remove more general suffixes from words, for example those indicating tense (e.g. peeped -> peep). This is not, however, an easy problem, in general, since many words in English do not have simple suffixes (e.g. think). With an ideal stemming algorithm we would get the following list of base terms which would describe our paragraph:

alice

book

conversation

her

once

peep

picture

read

sister

think

twice

use

what

without

This set of terms will now match a wider set of queries (assuming queries are also stemmed). There are two common approaches to stemming: algorithmic and dictionary based.

Algorithmic approaches, typified by Porter (1980), make basic assumptions about how suffixes are added to words and then try to alter words according to these rules. For the vast majority of words this approach leads to a system which will correctly stem two variants of a word to the same index term. This index term is, however, often not a proper English term; as an example, pictured may be stemmed to pictur, but this is not too serious since picture and pictures would also be stemmed to pictur. The biggest problem this introduces is when two independent words are stemmed to the same term, for example Porter's algorithm stems both communism and community to commun. Even this is not very serious since it can simply be viewed as introducing an extra homonym into English. Since the other words in the query are likely to distinguish whether we mean communism or community there should be no problem in real retrievals, just as other words in queries distinguish between financial bank and river bank[2].

Dictionary based approaches have large translation tables which are looked up for each word and contain the base form of the word. These tables lack the error prone nature of algorithmic approaches but are reliant on their dictionary having an entry for all words in the document base, they have no method of coping with an unknown word.

Word frequencies

Once we have removed the stop words and converted words into stemmed terms it is possible to take into account how often words are used in the document collection as a whole. The motivation for this is that words which are used very often in the document base are poorer at separating relevant from non-relevant documents than rarely used words. As an example, consider a free text retrieval system operating on the Highway Code,the term ice occurs in many sections so is a poor query term, whereas cream only occurs in two sections. It would be useful if a retrieval system posed with the query ice cream vans, automatically weighted cream as more important than ice. Assuming that our paragraph is an extract from general English literature, the weights might look like:

alice 0.9

book 0.5

conversation 0.7

her 0.3

once 0.7

peep 0.7

picture 0.5

read 0.4

sister 0.6

think 0.4

twice 0.6

use 0.3

what 0.2

without 0.5

The weights here are given as decimal fractions between 0 and 1.

Sparck Jones (1980) developed an automatic algorithm for calculating these scores known as inverse document frequency (IDF). It is based on comparing the number of times the specific term occurs in the entire collection, n, with the total number of occurrences for all terms in the entire collection, N. Essentially the algorithm can be expressed as log(N/n)+1 which gives a high weight to very rare terms and almost unary weight to terms which are very frequent.

It can also be useful to take into account how many times each term occurs in a document as this gives an indication, especially for large documents, of how important the term is within the document. This is usually done simply by dividing the occurrences of the specific term in the document by the total number of term occurrences in the document. In our example there were a total of 17 terms (excluding stop words) and the terms book, conversation and picture occurred twice with the rest occurring once. This leads to a revised weighting as follows:

alice .053

book .058

conversation .082

her .018

once .041

peep .041

picture .058

read .024

sister .035

think .024

twice .035

use .018

what .012

without .029

Although the numbers are rather abstract the ordered list of terms gives a sensible listing of how important each term is for describing the sample document: conversation, book, picture, alice, peep, once, twice, sister, without, think, read, her, use, and what.

Adjustment

The list of words and weights can be viewed as part of a very large vector with one dimension for each term in the document collection - with most set to zero for a specific document. This vector can then be processed mathematically by the matching algorithm. For simplicity, most approaches to IR scale document vectors so that they all have unit length. The length of a vector is calculated by summing the squares of all the components and taking the square root of the answer (just like calculating the length of a diagonal on a rectangle), or

where N is the number of terms used in the document collection and V is the document's vector (of length l). For the sample paragraph the length is currently 0.158, so to normalise the vector we would divide it by 0.158 or multiply each term's weight by 6.33. This leads to a new unit length vector of:

alice 0.34

book 0.37

conversation 0.52

her 0.11

once 0.26

peep 0.26

picture 0.37

read 0.15

sister 0.22

think 0.15

twice 0.22

use 0.11

what 0.08

without 0.18

This is the final representation that most systems would have of the sample paragraph. Some systems go further and introduce additional information, most commonly thesaurus style information, while others do not save IDF information in the document vectors but store it separately.

Thesauri

Information about synonyms and other terms which are related to ones used in the document can also be included in the document vector (although, again, some systems choose to include this information when indexing the query vector). Some systems have a simplistic list of synonyms which are considered to be equivalent terms while others have a weighted list of relationships. For the sample paragraph it would be reasonable to add image as a synonym of picture and we could consider a related term sibling to sister (perhaps adding it with half of sister's weight: 0.11).

While not universally implemented thesaurus information can drastically increase the number of documents which are likely to match a query since more individual terms are used to describe each document. Domain specific thesauri are often the most effective (e.g. precise domain information such as bus = car in a computing document base while computer = word processor in a transport document base). It can, however, also be beneficial to simply use a standard thesaurus. Any thesaurus can be augmented by analysing the document collection to establish if two terms tend to be used as synonyms within that collection.

Matching

Now that we have the document in a form that minimises the information we need to consider when matching queries to documents we have to do some matching.

Most of the work of the retrieval engine has already been done - that is why so much work is put into indexing which makes the querying simpler and faster.

There are two main ways of providing a query: in boolean logic and free text. Boolean retrieval treats the query very differently from the documents and doesn't require as much indexing information as we have provided. Free text querying is further split into the two main ways of approaching the matching problem, probabilistic and vector space, both of which initially index queries in the same way as documents.

Boolean

Although free text document collections have existed for a long time, access has usually been through non-free text, structured queries. Boolean queries are similar to database queries and state, precisely, the conditions that must apply for a document to be considered a match to the query. If we were interested in documents in which Alice discusses conversations then we would issue the query Alice and conversations.

This level of querying requires very little indexing - essentially the list of words used in each document is all that is required. Stemming could be implemented to work with this approach but it is usually left to the user to indicate when stemming is appropriate, through the use of wild cards. Wild cards allow the user to roughly specify a word by saying that (s)he doesn't mind which letters occur, e.g. conversation* would match all words which start with conversation, or to allow for a single uncertain letter, e.g. summari?e would match summarise and summarize. This reduces the work of the retrieval engine but puts much more complexity in the hands of the user.

Typical operators in boolean matching systems include and (both terms must exist), or (one of the two terms must exist), and not (the term must not exist). We can, thus, look for systems which discuss computers (in the singular or plural) and image processing by issuing the query:

computer* and image and processing

If we consider image manipulation an equivalent term to image processing but do not want anything on virtual reality, then we can issue:

computer* and image and (processing or manipulation) and not (virtual and reality)

Unfortunately, most computing scientists are stretched by the complexity of logical expressions like this - never mind novice computer users in a library. Many boolean systems provide further constructs to improve the power of the query language, for example to find two words within the same sentence. Although these do improve the power of the language, they further increase the complexity of the task the user is faced with.

Boolean matching allows very precise queries to be formulated but imposes a lot of complexity on the user to create this query. This is further hindered by a difference in the logical constructs and and or and their day-to-day equivalents - especially when combined with not.

Vector space

The first model of true free text retrieval which we will look at is the vector space model (Salton 1971). Under the vector and probabilistic models, the user's query is initially indexed in the same way as the documents - but is usually given in free form English. As an example, a query for documents in which Alice discusses conversations would be simply: Give me documents in which Alice discusses conversations. The indexer would then chop this down into a small query vector:

document .5 alice .5 discuss .5 conversation .5

The weight of .5 gives the four terms equal weight (since they all occur once[3]) and a vector of unit length. Although this introduces some additional terms (document and discuss), users often perform better by thinking in full English sentences rather than making a list of keywords (this problem is reduced after relevance feedback - see later).

The main advantage of the approach taken in pure free text retrieval, whether vector space or probabilistic, is the shift of complexity from the user's task to the system. This makes the complexity of issuing a query much lower, however does impose major computational problems on the system. These complexities tend to result in error prone retrieval engines: unlike boolean retrieval, free text retrieval systems can make mistakes - either missing relevant document or wrongly retrieving non-relevant documents. Although similar behaviour can happen in boolean retrieval this can, in theory, be prevented by a better specified query. Free text querying reduces the power of the query language and so reduces the possibility of improving the query - as such the user has to be more aware of the errors and the system's user interface designed to minimise their impact (e.g. through summary lists and user selection for printing). Despite this, the matching performance of free text engines has been shown to be no worse than that of boolean systems and often out perform and take much less time to use - since there is little time spent composing a query.

The reduction in the strength of the language removes the users ability to specify precise queries and also removes any semantic content in the words used, for example consider the free text query:

I'm interested in computers and image processing but not virtual reality.

An IR system would automatically match documents which also mention computer in the singular and, through weight of other words, documents discussing image manipulation (although a thesaurus would strengthen this link). The word not would, however, be treated as a stop word and, thus, be totally ignored - the user would be presented with documents discussing virtual reality. Documents which discuss images of computer processors for virtual reality would also be matched. These are the most common problems of presenting user's with what looks like an intelligent language system which isn't, these problems are, however, very small compared to the benefit of quickly specified natural queries and the backup of stemming, weighting and non-definitive queries. So how does the matching work...

Now that the documents are both represented as vectors, the vector space model considers the similarity of them to be based on the angle between the two vectors in space. Up until this point (and with the probabilistic model), the vector has simply been a convenient mathematical model for storing a list of terms and their weights. The vector space model then makes the jump to processing them as if they were real geometrical vectors in a space with thousands of dimensions. Although this seems rather strange initially, it is based on an extension of a simple matching routine for non-weighted indexes. Consider non-weighted indexes for the above query and the sample document, these are basically a list of four words for the query and 14 for the document. A rough measure of matching strength is the number of terms they have in common, in this case two. This doesn't take into account how large each document is and would have a tendency to match larger documents, so we could divide by the number of terms in total between the query and document. This leads to Dice's coefficient:

m =

where |D[[intersection]]Q| is the number of terms common to the document and query, |D| is the number of terms in the document, |Q| the number in the query and m the matching value - the fraction is doubled to give a maximum value, for matching a document with itself, of 1 instead of 0.5. For our query and document this leads to:

m = ~ 0.222

When considering weighted terms, like those we indexed, it is not possible to simply count the number of terms in common. Instead the vector space model multiplies the term weights together. For the vast majority of terms either the document or the query will have a zero weight, hence the resulting weight is zero. These individual new weights are then summed to give the top line of the matching algorithm. For the query above and the last form of the sample document's index, this leads to:

m = = = 0.43

where ||V|| is the length of the document, Vi is the weight of term i in vector V, and N is the total number of individual terms (the dimensionality of V). In geometry this equation is used to calculate the cosine of the angle between the two vectors, hence this matching routine is known as the cosine coefficient.

Although quite simple to understand this approach has no sound bases in information theory - there is no theoretical reason for this to be a good matching algorithm. The cosine coefficient does, however, perform well in practice, is reasonably easy to code and is used in many retrieval systems.

Probabilistic

Mathematically the cosine coefficient has little bases in information theory, the probabilistic approach to information retrieval attempts to solve this problem by basing its algorithms on results from probability theory. Although, in practice, the performance of systems based on the probabilistic model is not much better than the vector space model, the approach gives a theoretical grounding for understanding how different information can be combined to estimate a probability of a document be relevant to the user. Although large practical improvements are not widely seen, Croft and Harper (1979) have shown significant improvements when using their combination match.

Probabilistic models are based on expanding known information to the unknown and assigning a probability that the expansion is accurate. In IR, this results in a set of document being required to be known as relevant to the user. The system then estimates the probability that each other document is also relevant to the user. Croft and Harper's combination matching algorithm was designed to help the user find the first relevant documents.

The mathematics involved in the probabilistic model are too complex to develop here, however from a user's point of view the complexity can be overcome by considering it as a more sound version of the vector space approach (the indexing and weighting schemes are often the same). Van Rijsbergen (1979) gives a long discussion on the probabilistic model which is also discussed in a later research report (1992).

Inverted indexes

To make the whole process reasonably speedy, many programming techniques are required when implementing an information retrieval system. The most widespread technique is the use of inverted indexes. A standard index lists which terms occur in which documents and with what weights. An inverted index lists which documents each term is used in. When querying in a system with an inverted index the first stage is to make a list of documents which need a matching weight assigned, document which do not contain any of the terms in the query automatically have zero weight. The list of documents in which each query term occurs are simply combined to produce a list of all documents which have one, or more, of the query terms and are thus worth further effort.

Relevance Feedback

The probabilistic model requires information on what documents the user considers relevant. This information can also be extremely useful in the vector space model. Relevance feedback provides a method for incorporating such information by asking the user to mark which matched documents are actually relevant.

In the vector space model, this information can be used to alter the query to more closely represent the user's information requirement and some of the particular aspects of the retrieval engine. A simple, but effective, approach is to add a fraction of each relevant document vector to the query vector. If we issued the initial query from above:

document .5 alice .5 discuss .5 conversation .5

which was deemed by the system as a match to the sample document:

alice 0.34

book 0.37

conversation 0.52

her 0.11

once 0.26

peep 0.26

picture 0.37

read 0.15

sister 0.22

think 0.15

twice 0.22

use 0.11

what 0.08

without 0.18

it would be possible to mark this as relevant. The system would then scale the document vector by, say, 1/4 and add the two vectors to give a new query of:

alice 0.59

book 0.09

conversation 0.63

discuss 0.50

document 0.50

her 0.03

once 0.07

peep 0.07

picture 0.09

read 0.04

sister 0.05

think 0.04

twice 0.06

use 0.03

what 0.02

without 0.05

To finalise the new query this would require normalised by dividing by the size of the new vector (1.13).

Note: this approach was taken for simplicity throughout the document although it is now rather strange since IDF information is being used in both the new query and the documents it will match. Separate recording of IDF information is, thus, more accurate.

Although this has not altered the query considerably it has reduced the relative weights of the two rogue terms and introduced a group of new, but low weighted, terms. With relevance feedback on a few documents, these affects will be compounded when new terms are found which are common to the set of marked relevant documents. By this means, the query can be adjusted to take into account the original textual query plus information about documents which were found to be relevant.

Relevance feedback can prove to be an extremely powerful facility since users need not spend a long time formulating an initial query but simply enter a quick sentence and then expand their query by feedback.

Testing systems

So far we have discussed the processes behind the scenes of the black box of information retrieval. In this section we will consider the evaluation of IR systems. This is intended to give you some idea of the main testing approaches and of what issues are involved in improving IR engines.

Benchmarking

The technical benchmark tests in information retrieval are usually conducted with standard test collections. The collections are composed of a set of documents, a set of queries and a list of answers. The answers are usually based upon a panel of experts stating which documents they think are relevant to which queries. Testing a system simply involves indexing the documents, running the queries and comparing the list of matched documents with the answers.

There are two standard measures for how well a particular system performs: recall and precision. Recall states the fraction of relevant documents which the system successfully matches (i.e. the number of correct answers retrieved divided by the total number of correct answers for this query). Precision states the fraction of the matched documents which are relevant. Recall essentially states how well the system is at including relevant documents in the list of matched documents, while precision states how good a system is at not including non-relevant documents in the list. An ideal IR system would have 100% precision and 100% recall but in practice precision is traded for recall: the more relevant documents a system retrieved the more non-relevant also get included.

There are strong arguments against the test collection approach, in particular it assumes the query is a perfect statement of the information requirement and that the experts have a categorical understanding of this. The approach also fails to take into account more user oriented aspects such as relevance feedback and the user interface. The approach does, however, provide the only way of categorically comparing two retrieval engines and is, thus, often used to assess whether a new technique does improve retrieval.

User testing

There has been little work in information retrieval using user testing methods. Much work, however, could easily be imported from human-computer interaction research (including think alouds, questionnaires, user monitoring etc.). These techniques, while not providing exact figures on performance, can provide strong indications of problem areas with retrieval systems and be used to make well defined quantitative assessments. As an example, Dunlop (1993) used think alouds with user monitoring to assess user performance in an information retrieval engine with and without hypertext links.

While being much more time consuming and expensive to carry out and not as definitive as benchmark tests, user test can provide more information than just a simple comparison of in-depth retrieval algorithms.

Conclusions

It is hoped that by giving a high level overview of how the inners of an information retrieval system works, you will have gained a high level of understanding about how they work. While not being able to programme one, the speed issue makes this too complex, you at least should now see there is no magic in the process and in so seeing, it is hoped, you will better understand the behaviour of a free text retrieval system.

Bibliography

Croft, W.B., and Harper, D.J., "Using probabilistic models of document retrieval without relevance information", Journal of Documentation, vol. 35, pp. 285-295. (1979)

Croft, W.B., and van Rijsbergen, C.J., (editors), Proceedings of SIGIR '94, Springer-Verlag, London. (1994)

Dunlop, M.D., and C.J. van Rijsbergen, C.J., "Hypermedia and free text retrieval", Information Processing and Management. vol 29 (3), (May 1993).

Porter, M.F. , "An Algorithm for Suffix Stripping", Program, vol. 14(3), pp. 130-137. (July 1980)

van Rijsbergen, C.J., Probabilistic Retrieval Revisited, Departmental Research Report 1992/R2, Computing Science, Universtiy of Glasgow. (1992)

van Rijsbergen, C.J. , Information Retrieval (second edition), Butterworths, London. (1979)

Salton, G. (editor), The SMART Retrieval System, Prentice Hall, New Jersey. (1971)

Sanderson, M, "Word Sense Disambiguation and Information Retrieval", Proceedings of SIGIR '94 (edited by Croft and van Rijsbergen), Springer-Verlag, London. (1994)

Sparck Jones, K. and Webster, C.A., Research on Relevance Weighting, British Library Research and Development Report 5553, Computer Lab., University of Cambridge. (1980)