BOOK AND SOFTWARE REVIEWS
Büttcher, Stefan., Clarke, Charles L.A. and Cormark, Gordon V. Information retrieval: implementing and evaluating search engines.. Cambridge, MA: The MIT Press, 2010. 632 p. ISBN: 978-0-262-02651-2. £37.95
Because of the increasing literature concerning information retrieval, Julian Warner’s (2010: 1) comment that, 'IR is of high contemporary significance, diffusing into ordinary discourse and everyday practice' becomes obviously true. Warner (2010: 1) has also declared that 'recently IR has changed rapidly, particularly through the influence of Internet search engines (SEs)', and the authors of the book under review echo his opinion, namely 'IR forms the foundation for modern SEs' (p. xxi). So, it can be concluded that information retrieval and search engines have a close reciprocal relationship especially in the chaotic area of the Internet. In a word, we can say that since the emergence of the Internet, one of the main and commonly used channels of meeting users’ information needs have been search engines, which, in turn, have had an undeniable impact on the information seeking and retrieving. This book has been written and published on the basis of this interconnection between the two fields.
On page 3, the authors highlight that the efficient implementation and evaluation of relevance ranking algorithms under a variety of contexts and requirements represent a core problem in information retrieval, and form the central topics of this book. It consists of sixteen chapters divided into five parts. Part I (including three chapters) entitled Foundations, uses an encyclopedic approach and provides readers with some fundamental concepts and techniques. Chapter 1 (Introduction), Chapter 2 (Basic techniques), and Chapter 3 (Tokens and terms) altogether include some basics like information retrieval, information retrieval systems and their architecture, Web search, e-text, text formats (such as, Microsoft Word, HTML, PDF, SGML, etc.), test collections like TERC (Text REtrieval Conference), open-source information retrieval systems including Lucene, Indri, and Wumpus, indexing (inverted and document-oriented indices), ranking, search operators (e.g., Boolean operators and others), performance evaluation including recall and precision, and effectiveness and efficiency measures, tokenization and term matching for English text, term distribution (e.g. smoothing and Markov models), and language modeling. In fact, Part I, takes a tour through the nuts and bolts of information retrieval especially in the area of search engines and facilitates the understanding of issues discussed in the remainder of the book, particularly ones related to indexing, retrieval, and evaluation.
Part II, Indexing, includes four chapters, dealing mainly with the essential algorithms and data structures necessary to make up and access the static inverted indices belonging to given text collections from the structural and operational perspectives (Chapter 4, Static inverted indices); query processing and realizing efficient search operations other than Boolean model using two popular alternatives namely ranked retrieval and lightweight structure (Chapter 5, Query processing); index compression which focuses on general-purpose data compression, symbol-wise data compression (Huffman and arithmetic coding), compressing posting lists (nonparametric and parametric gap compression), and compressing the dictionary (Chapter 6, Index compression); and dynamic inverted indices belonging to dynamic text collections including file systems, digital libraries, and the Web (Chapter 7, Dynamic inverted indices). Reading this part, which concerns the heart, prerequisite, and infrastructure of searching process from the beginning to retrieval, helps us to understand the process of searching behind the scene, particularly in the context of large-scale search engines dealing with plentiful information resources.
Part III, Retrieval and ranking (four chapters), as its name implies, associates with retrieval methods and the ranking of retrieved results. In Chapters 8 (Probabilistic retrieval) and 9 (Language modeling and related methods), the authors discuss some substantial information retrieval models such as probabilistic model and language modeling. Then Chapter 10 covers two important elements affecting the fruit of retrieval, namely, 'categorization’ as the process of labeling documents to satisfy some information need and 'filtering’ as the process of evaluating documents on an ongoing basis according to some standing information need (p. 310). The last Chapter of Part III, Fusion and metalearning, concerns the output of information retrieval – the ranking of hits. It poses the question 'which combination of choices works best?’, and examines specific approaches or techniques for ranking: fusion or aggregation, stacking, bagging (bootstrap aggregation), boosting, multicategory ranking and categorization, and learning to rank. Since often the methods of query interpretation by the search engines are inadequately explained and the ranking algorithms of the search tools are often opaque the contents included in Part III seem to be instructive and interesting.
Part IV, Evaluation (consisting of two chapters) is devoted to one of the main pillars of information retrieval systems, particularly, search engines. The evaluation or performance appraisal of search engines is discussed in terms of effectiveness (Chapter 12) and efficiency (Chapter 13). It can be treated as C(heck) of PDCA model (Plan-Do-Check-Act) developed by Deming (1986). Chapter 12 actually deals with the methods of measuring the quality of the search results produced by search engines and makes a reference to traditional effectiveness measures (e.g. recall and precision, average precision, reciprocal rank, user satisfaction, etc) and nontraditional effectiveness measures (i.e. graded relevance, incomplete and biased judgments, and novelty and diversity). Chapter 13 is about measuring the efficiency of search engines. Throughput (the number of queries a search engine processes within a given period of time) and latency aka response time (the amount of time that elapses from the moment a search engine received of the query until the results are sent to the user) are among notable efficiency measures debated here.
Finally, Part V entitled Applications and Eextensions embraces three chapters (14 to 16) and begins with a topic of 'parallel information retrieval’ (Chapter 14), which can help a search engine in processing queries faster by having multiple index servers processing incoming queries in parallel. It emphasizes that '[information retrieval] systems often have to deal with very large amounts of data. They must be able to process many gigabytes or even terabytes of text, and to build and maintain an index for millions of documents…A single computer simply does not have the computational power or the storage capabilities required for indexing even a small fraction of the World Wide Web' (p. 488), and aims to examine various ways of making information retrieval systems scale to very large text collections such as the Web. MapReduce (a framework developed at Google that is designed for massively parallel computations on very large amounts of data) is also talked about. Afterwards, information retrieval in the specific context of Web search is considered in Chapter 15. Accordingly, in addition to discussing static and dynamic ranking algorithms used by Web search engines, three components - 'the structure of the Web’, 'the scale of the Web’, and 'the users of the Web’ - are covered. Part V ends with Chapter 16, XML retrieval, revolving around the topic of information retrieval in the collections of XML documents. As the term extensions included in the title of Part V suggests this part prepares the ground for a migration from a basic information retrieval system to a dynamic large scale search environment within which Web search engines are increasingly developed, used, changed, merged, or abandoned.
Totally, as the authors indicated on p. xxi of the book, 'we aim for a balance between theory and practice that leans slightly toward the side of practice, emphasizing implementation and experimentation', they successfully met their goal. With many related works cited in the text (an indication of the book’s depth and richness) it provides a broad foundation for future studies. However, the book suffers from the lack of compilation of all references in a single bibliography, which could help readers find many related works by a single effort. The inclusion of the details, examples, and exercises with respect to each approach, and some chapter summaries as well as further readings is a strong point of the work. There is also an index at the end of the book. Because it also plays the role of a textbook, the inclusion of a glossary defining descriptively some related terms and concepts seems to be useful for students and professors. It is notable that each part of the book is so comprehensively designed and written that it can be read selectively as a stand alone text. Finally, it can be said that this title is more comprehensive and deeper than similar ones such as 'An introduction to search engines and web navigation’ by Levene (2010) which answers the need for an introductory, yet technical, text on search and navigation technologies and demystifies the technology underlying the tools that we use in our day-to-day interaction with the Web (p. xiv), and 'Search engines: information retrieval in practice’ written by Croft et al. (2010) which, as a textbook for undergraduate students in computing, information systems, and information science, deals with the basic information retrieval techniques.In my opinion, because of the twofold nature of the present work, which functions both as a textbook and a guidebook, students, experts, and professors in the fields of computer science, computer engineering, software engineering, and information science (search practitioners), search engine optimizers, and researchers can benefit from reading it with a sufficient depth. Information retrieval: implementing and evaluating search engines can be of value and interest for a much broader audience than it is likely to attract and presents readers with good opportunities of understanding theory, research, and above all, practice. It is important to remind that real end-users of this book should consider the following note highlighted by its authors:
'We assume that the reader possesses basic background knowledge consistent with an undergraduate degree in Computer Science, Computer Engineering, Software Engineering, or a related discipline. This background should include familiarity with: (1) basic data structuring concepts, such as linked data structures, B-trees, and hash functions; (2) analysis of algorithms and time complexity; (3) operating systems, disk devices, memory management, and file systems. In addition, we assume some fluency with elementary probability theory and statistics, including such concepts as random variables, distributions, and probability mass functions.' (p. xxiv).
Hence, each interested reader should be more patient so that such a technical text can be easily as well as comprehensibly read and understood.
Croft, W.B., Metzler, D. & Strohman, T. (2010). Search engines: information retrieval in practice. Harlow, UK; New York: Addison-Wesley.
Deming, W. E. (1986). Out of the crisis. Cambridge, MA: The MIT Press.
Levene, M. (2010). An introduction to search engines and web navigation. 2nd edition, Harlow, England; New York: Addison-Wesley.
Warner, J. (2010). Human information retrieval. Cambridge, MA: The MIT Press.