vol. 13 no. 4, December, 2008
Web information retrieval systems have been evaluated almost from the origin and implementation of the Web. This should not surprise us because the need for evaluation has always accompanied the development of information retrieval systems. A broad set of qualitative and quantitative aspects have been evaluated. We can group them into four fields of interest: (i) performance, usually based in the effectiveness as we can see in Oppenheim (2000) and Martínez & Rodríguez (2003), with a supplementary set of works that avoid the use of the precision ratio to measure the effectiveness because it is affected by human subjectivity (Hersh et al. 1995, Landoni and Bell 2000); (ii) the proposal of Berners-Lee et al. (2001) and the World Wide Web Consortium for a semantic Web based upon an extensive use of metadata, which has been evaluated by Zhang and Dimitroff (2005a) (2005b); (iii) how users search the Web (Spink et al. 1998), (Spink and Jansen 2004), (Jansen and Spink 2006) and (iv) the system-user interaction (Schlichting and Nilsen 1997),(Johnson et al. 2001) and Marchionini (2004).
Leaving aside the nature of these engines, what could be the origin of this interest? We suppose that it may due to questions like 'Dear Professor, what is the best Web retrieval system?' This question might be answered (by the professor) saying, 'Probably the answer depends on the point of view of the persons who will make the analysis'. It is logical that, if they rate effectiveness as more important, they will consider measures such as time of response, precision, recall and overlap. To improve precision, they could measure the influence of metadata, a matter not sufficiently resolved today. Perhaps, the researcher will prefer to know the trends of Web usage, then analyse Web queries, the use of Boolean operators, Web query reformulation, the use of relevance feedback and viewing results. Finally, if they consider the interaction between systems and people, they will focus their study the interface design and other usability aspects that could affect information retrieval.
While there is great diversity among these lines of work, we think that it is possible to integrate some of them for adding an user's perspective to the traditional evaluation of performance. More concretely, we propose to incorporate data about the position a document in the search output like a measure of the overlap between the Web retrieval systems, with the goal of transforming a simple measure of coincidence of documents in a response to an extent of the overlap of documents in the useful response for users. It seeks to combine different viewpoints with the idea of bringing added value to the ratios that have so far been used.
It is a fact that the Web retrieval systems index large collections of documents. It is also true that the similarity between these indexes is very small. It has been traditionally assumed that this similarity is approximately 15%. If this is correct (usually this kind of comments is not supported by any study), it would mean an enormous diversity in the composition of these indexes which, in some way, most people would see when performing the same search in different systems (which, incidentally, is becoming less frequent). Nevertheless, another factor intervenes in the diversity of the reply, which is the ranking algorithm the systems use to provide the response. Each system implements a different algorithm; consequently, the level of divergence of the response the users can see is even greater.
Which of the levels of similarity -or divergence- is the most interesting for our analysis? From the point of view of the end user, we would choose that of the given response and, from the point of view of the administrator of a Web retrieval system, it would also be interesting to analyse the composition of the indexes, but for the users of the Web this is less important. We should not forget that it is very difficult to have access to all of these collections in order to analyse them. We can only find data on this concordance in the works of Losjland (2000a; 2000b) and Martínez (2002), besides the data on the overlap of Web retrieval systems which Notess (2002) took into account for the Search Engine Showdown blog (data not updated) and more recently Jansen and Spink (2006). So, it is difficult to find more sources of updated information on this subject.
For this reason, we find it interesting to bring about a method enabling us to carry out a periodical analysis of this factor and to use it in the response of the most used Web retrieval systems in the current World Wide Web.
Our initial research was included in the field of the performance assessment. The most quoted projects are those related to effectiveness, particularly the work of Chu and Rosenthal (1996), which analyses more features. We should also mention the contributions of Leighton and Svristava (1999), and finally, the work of Gordon and Pathak (1999), who analysed more systems than anyone else. One of the first experiments developed in Martinez (2002) determined the level of similarity in the responses of six Web retrieval systems (Google, Altavista, All the Web, Terra, MSN, and Wisenut). Thirty queries (in Spanish) were carried out in these systems and the level of similarity was analysed among the first ten, twenty and thirty documents of the response. In order to determine this similarity, and inspired by Lojsland (2000a), we used a the 'cosine' similarity function (Salton & Harman, 1983) with the necessary adjustments to the context of the experiment, since this function was created in order to determine the similarity of two one-dimensional vectors and, in this case, we had bi-dimensional vectors: the Web search engine and its reply to a given question.
Why the limit of thirty documents, which is obviously a very small portion of the possible response?. Spink and Jansen (2004) give us the justification:
'…from 1996 to 1999, for more than 70 percent of the time, a user only viewed the top ten results. On average, users viewed 2.35 pages of results (where one page equals ten hits). Over half the users did not access results beyond the first page. Jansen et al.(1998) found that more three in four users did not go beyond viewing two pages. By 2001, only roughly one-thirs of users looked beyond the second page of Web sites retrieved.
We can find similar actions in Tang and Sun (2003) where the authors, basing their work on that of of Jansen et al.(1998):
'…decided to collect only the top 20 links among the thousands retrieved in light of previous studies showing that 80 percent of users view only the first two pages of results'.
Jansen and Spink (2006) affirm that “the first result page represents the top results that an engine found for a given query and therefore is a barometer for the most relevant results an engine has to offer”. It seems clear that thirty is the maximum number that most users are willing to consult. It is the useful response, the rest of the documents are ignored. In practice, it is possible that we are setting too high a value to this limit.
As discussed above, we introduced several changes in the 'cosine' similarity function to approach it to the context of the experiment, since this function was created in order to determine the similarity of two one-dimensional vectors and, in this case, we had bi-dimensional vectors: the Web retrieval system and the reply to a given question. Obviously, it was necessary to resize the results before using this function but it was not the unique change introduced into this calculation. We consider the position of documents in the response, favouring those that appeared in the top positions. Thus, the weight of each resulting document depending on its position in the response vector of the Web retrieval system (known as the 'relevance factor') was taken into account. This idea is based on the 'first 20 full precision' idea introduced by Leighton and Srivastava (1999) which gave an added value to the ability to place relevant documents within the first twenty delivered in response to the user. This function measured at the same time, accuracy and the capacity to show the relevant documents before the irrelevant (something very important for the user). If you regard inactive and duplicate documents as irrelevant, you favour those search engines that are up to date through the refreshment of their indexes. This factor enabled us to assess not only the coincidence of the documents in the response (that is, the overlap), but also in trying to assess the similarity of the useful response (those the user actually reads) taking into account the order these documents in the response.
Before calculating the similarity, we have to resolve an additional question, the type of operators to be used in the query formulation. We used (1) Boolean operators (more concretely, the 'AND' operator) and (2) the 'AND' operator combined with the 'exact phrase'. The obtained results were very similar. Noting further that the use of Boolean operators is growing on the Web (Spink and Jasen 2004), we consider that it is appropriate to experiment only with such operators.
To exemplify the method, suppose that a search has been carried out in the Web retrieval system A and in the Web retrieval system B. The results are represented in the following Table 1 (the coinciding URLs are in bold). The column on the right shows the weight given to each URL (in bold) depending on its position (the relevance factor):
|Web retrieval system A||Web retrieval system B||Weight|
As previously mentioned, there is a distribution of elements with two characteristics: URL and 'weight' or relevance factor of the objective. In order to determine the similarity it is necessary to reduce the distribution achieved to a common n-dimensional space where 'n' is the number of coinciding URLs for each of the pair of vectors of the results plus the number of the relevant URLs found by each separate search engine. In order to do this, the initial vectors of the results become two individual vectors composed of the values of the relevance factor presented by each URL in the original search engines. This transformation of the vector space leads to Table 2 which represents the vector of global result and the vectors V (Web retrieval system A) and V (Web retrieval system B):
|Result vector||V(Web retrieval system A)||V(Web retrieval system B)|
Now we can calculate the cosine function with these two new vectors:
V(Web retrieval system A)
|V(B)= V(Web retrieval system B)||V(A)•V(B) (scalar product)||[V(A)]2||[V(B)]2|
The result is obtained by dividing 4.70 within a value of 7.44 which gives a value of 0.63. This means that the systems A and B coincide in 63% for this search.
Applying this method, the obtained results of our original experiment are shown in Table 4.
|Pos: set of analysed documents. AW: All the Web. GO: Google. MS: Microsoft Network. TE: Terra Networks. WI: Wisenut. AV: Altavista.|
The greatest similarities, of about 30%, were obtained by two second level (in terms of number of documents indexed and proportion of Web searchers) Web retrieval systems: Terra and MSN. At this time, these systems shared the same search technology (Inktomi), and have a high proportion of indexed documents in common, thus this high level of similarity in comparison to the rest. The average value of the obtained similarity for the first ten and twenty documents was 0.15, and 0.16 for the first thirty documents. These values confirmed the general idea of a coincidence level of 15%.
So many changes have happened in the Web in the last five years that many people, such as O'Reilly (2005), say that we are in the age of Web 2.0. In the field of Web retrieval systems, many developments have occurred also which we can summarise as follows:
Bearing this information in mind, we can extrapolate the obtained results in 2002 by allocating the results of MSN to live.com, and the best values of similarity reached by All the Web or Altavista with each engine to Yahoo!, which is the system which, to a degree, has replaced them. Finally, the engines that people do not use will be removed because they represent only ten percent of searches carried out routinely by Web users. As a result, we have the adapted Table 5.
|Pos/Sim||Yahoo! - Google||Yahoo! - live.com||Google - live.com|
|Pos: set of analysed documents. Sim: similarity|
Since it is necessary to have updated data and it is not possible to continue doing this type of analysis manually, we decided to develop a meta-searcher, which could perform the search equations of the systems that are the objective of our study and, also, which could automatically determine the similarity of the obtained results. Consequently, we simplify this kind of experiment and are able to update the results.
Our development is based on taking the Application Programming Interfaces (API), which each search engine provides and to integrate them within one system. Through the analysis of each of these modules, we confirm that on the way towards supremacy in the market for information retrieval on the Web, the main retrieval systems establish their own rules. Some, such as Google, seriously restrict the number of results obtained (until the beginning of December 2007 it offered eight documents, but now it offers 32 results), or they force the use of proprietary programming languages, as in the case of live.com. Currently, Yahoo! is the search engine which enables the least restricted use of its indexes. As explained previously, the new Google API restricted the number of obtained results for each consultation to eight. This was a constraint in the first phase of our study, carried out in September 2007. It was only possible to take eight samples of each engine for each query. Nevertheless, we do not think that this factor devalues the results obtained at the end of the study since, in any case, we would have achieved a comparison of almost all of the first page of the results given by each engine. The next step in the process requires the implementation of the necessary codes in order to interrogate each engine and to retrieve the result sets given by each, in order to create a database of samples, which will be studied and analysed.
In view of the diversity of the APIs given by each system, it is necessary, as far as possible, to establish a programming criterion in which to place the core of the meta-searcher and, from this, to lay out the rest of the parts of this tool. The end result will be a tool in which the code is partitioned depending on its functionality. In short, we have the following:
The database of our meta-searcher allows the results of each search to be stored. It compiles the URL, title, description, ranking, and query-engine relation of each result. It also has a table for storing the statistics of each query.
In order to test whether the current situation differs from 2002, the experiment was repeated by entering in the meta-searcher the same queries used at that time:
|Turismo rural en la Sierra del Segura||Alquiler de apartamentos en Málaga|
|Historia del Camino de Santiago||Curso a distancia de Programación en PHP|
|Principio de incertidumbre de Heisenberg||Diseño de sistemas multimedia para el aprendizaje|
|Academias de idiomas en Valencia||Discurso del Método de Descartes|
|Diseño accesible a páginas Web||Recetas de cocina y dieta mediterránea|
|Teoría de la Evolución de Darwin||Semana Santa en Murcia|
|Bibliografía de Miguel de Unamuno||Estrategias de Representación del Conocimiento|
|Galerías de Arte en Murcia||Empresas de fabricación de calzado en Alicante|
|Influencia de la televisión en los niños||Apuntes de Sistemas Digitales|
|Apuntes de Estadística Descriptiva||Modelos pedagógicos para la educación a distancia|
|Principio de Conservación de la Energía||Librerías de antiguo en España|
|Apuntes de Historia del Arte Barroco||Temario de Oposiciones de Matemáticas en Secundaria|
|Recopilación de Legislación en Derecho Civil||Historia de la ciudad de Ceuta|
|Compra-Venta de automóviles de segunda mano en Madrid||Evaluación de la calidad de la enseñanza universitaria|
|Literatura Española en el Siglo de Oro||Plan de Estudios de Licenciado en Comunicación Audiovisual|
The similarity results of the first eight documents for each system in this second experiment (September 2007) are shown in Table 7.
|Pos/Sim||Yahoo! - Google||Yahoo! - live.com||Google - live.com|
|Pos: set of analysed documents. Sim: similarity|
Assuming that there may be a slight difference between these results, if we would have been able to determine the similarity of the ten first documents instead of the first eight documents, we can compare these results with those obtained in 2002 and we can see that the average similarity values in the response decrease very little (from 15% to 14%). This means that the similarity between the first results of each system is stable. This is surprising, since the size of the indexes has increased considerably and consequently there are many more documents on any subject, so the coincidence is much more complicated (or so we thought). Even if a dozen of documents is a very small sample in order to evaluate the similarity of the reply, we cannot ignore the large number of Web users who only read the first page of the system output. Within this context, coincidence value gains much more significance and importance.
We have previously commented that at the end of 2007, the Google API announced that it had enlarged the maximum number of documents obtained with each query from eight to thirty-two. This obviously helped our experiment and we introduced a set of changes in the meta-searcher. This was not as easy as we expected, because Google presented the results in the form of four pages of eight results. This made the analysis of the response more difficult, since the other APIs give the results in an unique list. Nevertheless, we managed to make the necessary changes. Currently, the meta-searcher can determine the similarity values up to the first thirty-two documents obtained by the three systems under analysis. In this case, the searches were made in two languages, English and Spanish in order to verify whether the similarity is influenced by the language. The following results were obtained by repeating the experiment with the new limit:
|Pos/Sim||Yahoo! - Google||Yahoo! - live.com||Google - live.com|
|Pos: set of analysed documents. Sim: similarity|
In comparison to the results obtained in 2002, there are variations in the similarity engine by engine but not in the average value, which is 0.18 in both cases. This value is slightly higher than when only the first eight documents of the response are analysed. The current experiments seem to indicate that Google and live.com offer the most different responses, although in fact the differences are still small. The behaviour of the questions made in English is very similar although the average values are lower (0.12). We think that it is logical because the space in which to locate documents is broader (in the Web context there are more documents written in English than in Spanish).
One of the more repeated experiments by researchers on information searching is to determine the average length of the searches. Spink and Jansen (2004) give more information about it, noting that the average is small (between one and two words for a search). Our previous searches intended to simulate the behaviour of students interested in some subject ('Heisenberg Principle', 'Einstein Theory of Relativity', 'History of Ceuta', etc.) or the behaviour of a simple person who wishes to pass wonderful holidays in Málaga or to buy a second-hand car. The truth is that as our average search length exceeds the normal ratio, it does not reflect the general user's behaviour.
It would be interesting to verify if the average values of similarity vary with shorter searches. In order to replicate the behaviour of current Web users, we extracted the terms for the new queries from the 2006 Top 10 US Search Ranking (Kopytoff 2007), choosing the ten most commonly used terms in each of the three Web retrieval systems, forming a new group of thirty queries closer to the Web context and their users.
|Wikipedia||American Idol||Song lirycs|
|Rebelde||Chris Brown||New York|
|Mininota||Pamela Anderson||Baby names|
In the same direction, we only calculated the similarity of the ten first results of each Web retrieval systems. As we can easily see in the next table, the changes in obtained average values of similarity in this new experiment with respect to the previous ones are barely significant (0.14 against 0.18), although, like novelty, in some particular cases they appear null values of similarity (the query 'American Idol' does not have any common document in the Yahoo! and Windows Live responses, for example), this did not occur frequently in the previous searches, which used longer search formulations.
|Query||Yahoo! - Google||Yahoo! - live.com||Google - live.com|
We have conceived two lines of work for the immediate future. The first, more focused on measuring performance effectiveness, intends to overcome the limit of thirty-two documents in each analysed query, reaching a total of 100 documents. The second line is closer to the user's behaviour: we shall try to incorporate a set of parameters related to other emerging patterns in Web search, such as Web query reformulations, the distribution of search terms and the use of relevance feedback. If we are able to incorporate some of these aspects in our work, we will be able to show the real Web context. We can also repeat these experiments using only one language, instead of both Spanish and English as reported here: the results of the two experiments on search query length could be of interest when the two languages are compared.
Another objective is to extend the number of Web retrieval systems studied. Initially we considered incorporating the API of Ask (Antezeta 2007) or of any of its associated systems, although, after a subsequent study, we have opted for a comprehensive redesign of our meta-searcher because we only can use a small number of available APIs and they are very limited by their owners. So, we are working in the initial implementation of a new version of our meta-searcher, more complete and powerful, and more independent of the Web retrieval systems's design.
This extension of the reach of our meta-searcher has another objective: to determine the distance between the reply of an Web retrieval systems to an individual question and the ideal reply for this information need. This reply would be based in the semantics of the reply given by each of the Web retrieval systems and by using the techniques of decomposition of singular values with a similar approach to the automatic allocation of the best article reviewers sent to a scientific magazine proposed by Dumais and Nielsen (1992). We would need to make use of a wider and stronger information basis, so that we may increase the number of analysed documents. Increasing the number of sources will also be essential.
We think it is possible to approach the performance evaluation of the Web retrieval systems from the perspective of information searching, incorporating several aspects of Web users' trends and habits into the design of an evaluation methodology. The indexes of the Web retrieval systems are very different. Each system seems to have indexed different spaces in the Web, with very little overlap, mainly in the first documents of the replies (which is unlikely to lead to user satisfaction). This overlap is is slightly lower with documents written in English with those written in Spanish. Certainly, the different criteria for implementing the ranking algorithms contribute to this. Our study confirms that there is little similarity between the responses of these systems. From the analysed Web retrieval systems, Google and live.com have least overlap and for an exhaustive information search, it is necessary to employ several Web retrieval systems at the same time.
The number of search terms does not introduce significant differences in the similarity of the reply, it is not a decisive factor for search engines. It may be that comparing sets of thirty documents is too small, given the magnitude of the indexes of the Web retrieval systems. These conclusions need to be confirmed by extending the number of documents of the analysed sample. Having more information and a consistent basis for calculation (for the number of documents obtained and the analysed sources of information) can help us to create an ideal response for a given information request. If we involve in the calculation factors that are close to the users, we are calculating the ideal response to a user. Undoubtedly, this would be a significant achievement.
This work and our stay in Vilnius would not have been possible without the invaluable help of Tom Wilson and Elena Maceviciute. Our sincere acknowledgement to them.
|Find other papers on this subject|
© the authors, 2008.
Last updated: 9 December, 2008