Duration: Aug - Nov, 2017
Status: Completed
Members: Simran Barnwal
Overview
In the age of exponential data growth, interlinking web pages and links helps us in classifying information as well as giving structure to it. Many times, while interlinking information present on the network, wrong links redirect the users to something very irrelevant to their primary interest. To ensure better usability of the large pool of information present on the net, ensuring the credibility and correctness of the interlinking is of utmost importance. This project focuses on increasing the correctness of the content linking present on Wikipedia in form of Annotations to increase the usability of the data present.
Approach
To filter erroneous links, our fundamental aim was to find the closeness between the anchored text and the title. For the purpose of defining the metric for closeness of these two entities, we divided our approach in 3 parts:
- Firstly, we manually screened a small subset the dataset and looked for erroneous alternate titles and surface names corresponding to a title. For declaring an anchored text to be correct, we restricted ourselves to only those texts which genuinely related to the title without having the possibility of multiple interpretations. Irreverent or not completely related texts were investigated for analysis as a guideline direction for designing an approach to filter non-referential texts automatically
- In the second stage of our study, we implemented the non-referential texts filter by using Levenshtein distance coupled with the observations made during manual screening as a metric of closeness. Only those texts having metric below a particular defined threshold value were declared as correct interlinks. Our aim was to choose such a threshold value such that false acceptance rate was low
- Lastly, we compared the efficiency of the devised filter in terms of precision and recall for different threshold values and hence chose the optimum threshold value for the best result