WebART in 2015: 'research engines' and beyond

Post date: Dec 11, 2015 11:10:28 AM

The vastly increasing number of events organized about Web archiving* has shown that there is more and more interest in this important topic. This year, the researchers in the WebART team have presented at a number of international conferences, including the 'Web Archives as Scholarly Sources' conference, but also our work has been published in various journal papers. This post summarizes a few of the highlights.

First of all, we have extended last year's paper "Finding Pages On the Unarchived Web" to a full journal paper for the International Journal on Digital Libraries, which has been published as Open Access [1]. In this paper, using the link structure of the web pages in the archive, we uncovered and reconstructed unarchived pages referenced in the Dutch Web archive. The results of this work show that a substantial number of pages can be found, almost the same size of the actual archived contents. The journal paper further extends this work, and shows that creating site summaries (i.e. combinations of 'anchor text' of whole websites) can enhance the retrieval effectiveness of unarchived content. 

Furthermore, there have been investigations whether temporal anchor text can be used as a proxy for user queries, and this work was presented at the Fifth International Workshop on Semantic Digital Archives [2].

As allowing access for research use is an important aim of the WebART project, we have presented our work at the Web Archives as Scholarly Sources international conference in Denmark. This conference brought together a vast number of researchers and practitioners, evidencing the increased interest in Web archives for research purposes. The conference presentation summarized the approach to move beyond sole URL-based and keyword search in Web archives, and to move towards 'research engines' supporting the whole research process [3]. Based on a literature survey, different needs of researchers in research phases of corpus creation, analysis and dissemination were identified, demonstrating limitations of current Web archive access tools. Moreover, solutions for supporting these research phases developed in WebART were presented at the conference. A paper presented at the European Conference on Information Literacy [4] further looked at the theoretical underpinnings of providing stage-based search support.

Related to Information Retrieval, Contextual Suggestions and archived Web data, a paper was published in the Information Retrieval Journal [5]. The effectiveness of  Information Retrieval systems is usually measured using test collections, which contain collections of web pages that are indexed by experimental systems. In the area of Contextual Suggestions, this paper looks at the balance between reproducibility and representativeness when building these test collections, a topic of key importance for Information Retrieval research.

Finally, in collaboration with Spinque, work has been carried out towards supporting search strategies in the Web archive. An experimental prototype has been created, which allows researchers to search Dutch news data, while being able to precisely customize their search engine via visual 'building blocks'. 

This allows for answering novel research questions. To take a hypothetical example, a researcher looking at rivalry between neighbouring countries could define a corpus with all news items of a specific news website, and select only the articles in the category 'sports' which mention neighbouring countries, just by connecting a small number of visual building blocks. This allows for new, fluid, ways for supporting the research process.

* For instance the Web Archives as Scholarly Sources conference and Web Archiving 2015: Capture, Curate, Analyze, but also regular conferences including the topic of Web archiving, such as iPres 2015, TPDL 2015 and JCDL 2015.

[1] Huurdeman, H. C., Kamps, J., Samar, T., Vries, A. P. de, Ben-David, A., & Rogers, R. A. (2015). Lost but not forgotten: finding pages on the unarchived web. International Journal on Digital Libraries, 1–19.

[2] Thaer Samar and Arjen P. de Vries. Temporal Anchor Text as Proxy for Real User Queries (2015). Proceedings of the Fifth International Workshop on Semantic Digital Archives, co-located with TPDL 2015, Poznań, Poland, September 14-18, 2015. Slides.

[3] Hugo C. Huurdeman (2015). Towards Research Engines: Supporting Search Stages in Web archives (2015). Paper presented at Web Archives as Scholarly Sources conference, Aarhus, Denmark.

[4] Hugo C. Huurdeman and Jaap Kamps (forthcoming). Supporting the Process: Adapting Search Systems to Search Stages. European Conference on Information Literacy (ECIL), Tallinn, Estonia, October 2015.

[5] Thaer Samar, Alejandro Bellogín and Arjen P. de Vries (forthcoming). The Strange Case of Reproducibility vs. Representativeness in Contextual Suggestion Test Collections. Information Retrieval Journal.