Information Retrieval - Books Under Review

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than “I know it when I see it.” Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media -- such as instant messaging and the Web -- are addressed peripherally. In doing so we examine the definition of spam, the user’s information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

Web Search: Multidisciplinary Perspectives
Amanda Spink and Michael Zimmer

Web search engines have emerged as one of the dominant technologies of modern life, leaving few aspects of our everyday activities untouched. Search engines are not just indispensable tools for finding and accessing information online, but have become a defining component of the human condition and can be conceptualized as a complex behavior embedded within an individual's everyday social, cultural, political, and information-seeking activities. This book investigates Web search from the non-technical perspective, bringing together chapters that represent a range of multidisciplinary theories, models, and ideas about Web searching. They examine the various roles and impacts of Web searching on the social, cultural, political, legal, and informational spheres of our lives, such as the impact on individuals, social groups, modern and postmodern ways of knowing, and public and private life. By critically examining the issues, theories, and formations arising from, and surrounding, Web searching, Web Search: Multidisciplinary Perspectives represents an important contribution to the emerging multidisciplinary body of research on Web search engines. The new ideas and novel perspectives on Web searching gathered in this volume will prove valuable for research and curricula in the fields of social sciences, communication studies, cultural studies, information science, and related disciplines.

Successes and New Directions in Data Mining
Poncelet, Pascal; Masseglia, Florent and Teisseire, Maquelonne

The problem of mining patterns is becoming a very active research area and efficient techniques have been widely applied to problems in industry, government, and science. From the initial definition and motivated by real-applications, the problem of mining patterns not only addresses the finding of itemsets but also more and more complex patterns. Successes and New Directions in Data Mining addresses existing solutions for data mining, with particular emphasis on potential real-world applications. Capturing defining research on topics such as fuzzy set theory, clustering algorithms, semi-supervised clustering, modeling and managing data mining patterns, and sequence motif mining, this book is an indispensable resource for library collections.

Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, Second revised edition
Jackson, Peter; Moulinier, Isabelle

This text covers the technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical concerns. It assumes some mathematical background on the part of the reader, but the chapters typically begin with a non-mathematical account of the key issues. Current research topics are covered only to the extent that they are informing current applications; detailed coverage of longer term research and more theoretical treatments should be sought elsewhere. There are many pointers at the ends of the chapters that the reader can follow to explore the literature. However, the book does maintain a strong emphasis on evaluation in every chapter both in terms of methodology and the results of controlled experimentation.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
Markov, Zdravko; Larose, Daniel T.

This book introduces the reader to methods of data mining on the web, including uncovering patterns in web content (classification, clustering, language processing), structure (graphs, hubs, metrics), and usage (modeling, sequence analysis, performance).

Information Retrieval Book Review Homepage

Books Under Review