Open Topics

I am always looking for good students to write bachelor/master's theses.

Please have a look at the list of title available here.

The thesis can be in one of the following broad areas:

Big Data
Data Profiling
Linked Data
Text Mining
Recommender Systems
Information Retrieval
Natural Language Processing
Open Data, Open Governament
Visual Analytics

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact me to discuss further, to find new topics, or to suggest a topic of your own.

Big Linked Data tools for visualization

Visualization is an interdisciplinary imaging science devoted to making the invisible visible through the techniques of experimental visualization and computer-aided visualization.

Linked Data visualization systems are tools for navigation, browsing and querying RDF datasets. They can be divided in generic systems and graphical exploration systems.

Generic display systems (such as Rhizomer [Brunetti 2012], LODWheel [Stuhr 2011], SemLens [Heim 2011], Payola [Klímek 2013], LDVizWiz [Atemezing 2014], VisWizard [Tschinkel 2014], LinkDaViz [Thellmann 2015], ViCoMap [Ristoski 2015]) support different types of data (for example, numbers, temporal, graphical, spatial) and provide different types of visualization. Some systems offer recommendation mechanisms suggesting the most suitable form of visualization depending on the input data (LinkDaViz, VisWizard, LDVizWiz). With regard to visual scalability, most systems do not adopt approximation techniques such as sampling, filtering or aggregation. Existing approaches assume that all objects can be presented on the screen and managed through traditional visualization techniques, thus limiting their applicability to data sets of limited size. Exceptions in this scenario are the cases of SynopsViz [Bikakis 2017] and VizBoard [Voigt 2012] which exploit external memory at runtime.

Graphical exploration and visualization systems (such as FlexViz web applications [Falconer 2010], RelFinder [Heim 2010], LODWheel [Stuhr 2011], Lodlive [Camarda 2012], LODeX [Benedetti 2014], [Benedetti 2016], VOWL 2 [Lohmann 2015] , graphVizdb [Bikakis 2016]) are of great importance due to the graphical structure of the RDF data model. Although several systems offer sampling or aggregation mechanisms, most of these load the entire graph into central memory. Because graph layout algorithms require a lot of memory to draw large graphs, current systems are limited to handling small graphs.

This thesis want to focus on how these tools can handle large graphs, they should adopt more sophisticated techniques such as hierarchical aggregation approaches in which the graph is recursively decomposed into smaller subgroups (using clustering and partitioning techniques), forming a hierarchy of levels of abstraction [Archambault 2007, Auber 2004, Tong 2013, Li 2015]; edge grouping techniques that aggregate the edges of the graph into bundles [Cui 2008, Gansner 2011] and also consider scalability and performance as key requirements and deepen disk-based implementations, as in [Tong 2006, Sundara 2010].

Large Scale News Classification

News published every day by a large number of newspapers are a large amount of information being stored in the electronic format. It has become a necessity to interpret and analyse these data and extract any facts that could help in decision-making. Data mining which is used for extracting hidden information from huge databases is a very powerful tool that can be use for this purpose. News information was not easily and quickly available until the beginning of last decade. But now news is easily accessible via content providers such as online news services. A huge amount of information exists in form of text in various diverse areas whose analysis can be beneficial in several areas.

Classification of news is quite a challenging task as it requires preprocessing steps to convert unstructured data to structured information. With the increase in the number of news it has got difficult for users to access news of his interest which makes it a necessity to categories news so that they could be easily accessed.

The goal of this thesis is to review the state of the art of classification algorithms and select the most appropriate based on the news structure and contents. The experiments are based on a large dataset of news collected from 21 italian newspapers in a time-window of more then one year. The number of news is roughly 2,000,000.

Event Detection and Analysis

In Event Detection and Analysis we try to idenfity mentions of events in text documents and classify them according to a series of predefined classes. The state of the art defines a series of approaches for addressing this task, from rule-based approaches to a variety of machine-learning based approaches. However, most approaches usually require the use of backgroung-knowledge soruces and hand-crafted rules, or trainings-data.

In your thesis you will research the State-of-The-Art in Event Detection, Analysis and Tracking and implement a flexible, scalable system for detecting events in news streams.

Data Exploration with Profiling Results

With data profiling, we can efficiently detect numerous statistics and dependencies within a given dataset. Such profiling results help to explore the data, its content and inner logic. The amount of discovered metadata is, however, often so overwhelmingly large that interesting patterns and relevant statements are impossible to see. For this reason, this thesis aims to investigate visual and analytical methods that bring these insights to light. One core task for these methods is to separate random results from semantically meaningful ones - a classification task that could be well suited for machine learning algorithms. Another aspect of data exploration is to find patterns in the metadata, such as cliques, chains, hubs and authorities that help to assess the relevance and the connection of schema attributes. The overall goal of data exploration is to extract possibly many insights about the data from its matadata.

Sentiment Analysis: from simple polarity sentiment into emotional analysis

The advent of the Web has brought the evolution of the sentiment analysis to analyze subjective text into a higher level of granularity. Opinion mining or sentiment analysis is a task of text classification. It mines subjective expressions written in text and automatically summarizes opinions concerning an object of interest or concerning one or more of its related features or aspects.

Inspite of the importance of facts, opinions and emotions as well are also play fundamental rules. Politicians need to know what people are thinking about their new rules and policies which are important for their next elections. Individuals need to have clear understanding about a particular object or parts of its aspects in order to be able to take right decision whether to buy that object or to go for a better choice. Manufacturers need to know why the sales of one of their product lines is low so they can decide an improvement or produce a new line of product.

Data Visualization

Providing visual analysis of a large set of data is an increasingly important aspect in Business Intelligence and Data Analytics. In many areas of science and industry, the amount of data is growing fast and often already exceeds the ability to evaluate it. On the other hand, the unprecedented amount of available data bears an enormous potential for supporting decision-making. Turning data into comprehensible knowledge is a key research challenge of the 21st century.

Data visualization tools are systems that present data in a graphical format. The graphs, unlike tabular representations, allow the transmission of information in a universal way and encourage the sharing of ideas. The power of the human visual system makes data visualization an appropriate method to comprehend large datasets. Interactive visualization enables a discourse between the human brain and the data that can transform a cognitive problem to a perceptual one. However, the visual analysis of large and complex datasets involves both visual and computational challenges. Visual limits involve perceptual and cognitive limitations of the user and restrictions of the display devices while computational limits are related to the computational complexity of the involved algorithms.

More and more tools are available for creating graphs based on data. The training on this topic involves a phase of analysis of the tools on the market, comparison, and testing of some of them.

Previous theses published on this topic.

The use of open data in education to foster critical thinking

Every day we receive a large amount of scientific-technological information from multiple sources: comments on social networks, news in the media, whatsapp messages ... This information is conditioning our conception of the world, and our positioning as citizens. But the danger lies in the fact that not all the information that comes to us has been contrasted : often it is superficial and unreliable content that we share uncritically for the need for immediacy, or for lacking the knowledge to contrast its veracity. Therefore, it is essential to have critical scientific-technological knowledge and an informed judgment.

The critical thinking promotes the ability to interpret and evaluate information that we have around us. Thanks to this, it is possible to better understand our environment, which favors our participation in it. In addition, it encourages decision making, both professionally and personally, in a reasoned manner.

These skills are especially necessary for citizens who will join the work and democratic life in the coming years, so it is good to provide students with the right tools for that purpose. One option is to incorporate active learning techniques that promote the use of open data and obtain value through intuitive and easy-to-use information analysis and visualization tools.

More and more countries are aware of the potential of open data in the education sector . Therefore, they are starting to implement various initiatives to introduce open data into the curriculum. The European Data portal creates Open Data Schools. In Northern Ireland, a contest to promote innovative ideas on how to use Open Data to support teachers has been launched, Argentina with the initiative " School in the cloud: open data in the classroom ", or Germany , where a project to develop software applications that exploited the potential of open data was initiated . In Spain, initiatives of this type are also being developed, such as the Comciencia School.