Advisors: Víctor Fresno and Raquel Martínez.
Keeping information organized is an important issue to make information access easier. Although the information we need is sometimes available on the Web, this information is only useful if we have the ability to find it. With this aim, it is increasingly frequent to use automatic techniques for grouping documents.
In this thesis we are interested in document clustering, that is, grouping documents based on the similarity of their contents. In this regard, document representation plays a very important role in web page clustering and constitutes the central point of research of this dissertation. Web pages are commonly written in HTML language, that offers explicit information (tags, in this case) about their visual representation, the typography of the text or its structure, among others. It is also a widely used format on the Internet. The main goal of this thesis is to perform a deep study with the aim of making the most of a fuzzy model to represent HTML documents for clustering tasks.
Our study deals with the idea of discovering whether any part of the system could be exploited in a different way to improve clustering results. We begin our work analyzing the parts of the system where there is room for improvement and then we study different alternatives to do so. Thereby, we do not propose a document representation from the beginning, but we build it trying to understand its different parts during each step.
To evaluate our results and compare the different representation proposals, we use different web page collections previously gathered to be used as gold standards. Clustering is performed by using state-of-the-art algorithms and our proposals are validated in environments of plain and hierarchical clustering. Lastly, we also test the usefulness of our approaches in two languages: English and Spanish.
I successfully defended my PhD thesis on October 23rd, 2012 at UNED (Madrid). The reviewing committee was formed by the following members:
Felisa Verdejo (UNED), president.
Paolo Rosso (Universitat Politècnica de València)
Fernando Martínez (Universidad de Jaén)
Steven Schockaert (Cardiff University)
Manuel de Buenaga (Universidad Europea de Madrid), secretary.
The PhD thesis was graded with the maximum qualification unanimously (Apto Cum Laude) and awarded with Extraordinary Doctoral Award for dissertations defended at U.N.E.D. in 2012/2013.
The slides I used for the defense, as well as the dissertation, can be downloaded from the following links:
An Improved Fuzzy System for Representing Web Pages in Clustering Tasks by Alberto Pérez García-Plaza is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.