Datasets

A Synthetic Large Scale Social Provenance Dataset

Overview

Provenance about data derivations in social networks is commonly referred as Social Provenance, which helps in estimating the data quality, tracking of resources and understanding the ways of information diffusion in social networks. We observe several challenges related to provenance in social networks domain. First, provenance collection systems capture provenance on the fly, however, their provenance collection mechanism may be faulty and have dropped provenance notifications. Hence, social provenance records can be partial, partitioned and simply inaccurate. Plus, Social Provenance records can grow large quickly because of the high number of participating actors involved. Although the number of services involved in e-Science workflows is in the order of hundreds, this number can grow to a scale in the order of thousands or millions of social interactions happening in social media. Hence, there is a need for an extensive experimental study, whether the current state of the art in standalone and centralized provenance databases are capable of handling large-scale provenance data. To address these challenges, this study introduces a large-scale, noisy, synthetic social provenance database, which includes a high volume of large-size social provenance graphs.

Dataset

We developed a tool responsible for the generation of random tweet data considering that any social scenario, no matter how many users are engaged in it or how many social activity has been made upon it, if visualized will be shaped as a forked linear graph. First, the tool keeps track of entities linked to the main workflow created either by retweeting or replying. It considers only social activities that may be applied upon a tweet. In that context, it takes into account “Tweet”, “Like”, “Retweet”, “Reply” and ignores other social activities such as “Follow” and “Unfollow”. Then, it creates a new unique agent for every social operation made. Note that, the tool considers that every social operation is affected by the last social operation made upon the same entity. Next, the tool starts with creating an initial activity representing “Tweet operation” which leads to the creation of an entity representing the original tweet. Finally, from that point, the tool will randomly invoke social operations until the wanted number of social operations has been reached.

Throughout the generation process, we attach several social metrics as provenance attributes to PROV-O nodes (entities, agents and activities). We ingest a hundred random workflows of the following two types:

No dropped notifications workflow (successful ingestion)
1% dropped notification (workflow missing relations).

Following the aforementioned methodology, we ingest social workflows of 100, 1000, 2000, 3000 and 4000 social operations. The figure below shows the visualization of an example of social provenance graphs that we generated using our provenance generation tool.

Database Access

With All Thanks to the Komadu development team, we’ve used the open source Komadu 1.0 they provided to ingest our synthetically generated provenance, it can be reached as a web service through this link http://95.183.194.97:8080/axis2/services/KomaduService

an SQL Dumb of the database can be downloaded from “Will be provided soon”

Google Sites

Report abuse