Big data analytics is a process to generate knowledge from the large datasets having variety of data. The big data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. Tools such as high performance computing clusters, Hadoop, and Spark are used to process the big data. Big data analytics has created tremendous opportunities for the researchers to process huge amount of data, however, have also created a big threat to the privacy of the individual. The data processed by the big data analytics platforms also have personal information of the data owner which need to be taken care of while deriving useful results for the research purposes. Big data is gathered using different sources and tools that may lead to privacy breaches. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are used to de-identify the data, however, the chances of re-identification always remain present since data is collected from multiple sources, where a record can easily be narrowed down by linkages of publicly available data. It is difficult to apply existing privacy models (privacy preserving approaches) to big data analytics because of 3Vs: Volume (large amount of data), Velocity (fast generation and processing of data), and Variety (structured, semistructured or unstructured data) characteristics of big data.
Due to the large volume of data, less number of records need to be generalized or suppressed or both to achieve the same level of privacy, also known as ``large crowd effect'', although it is always challenging to handle such a large data for anonymization. MapReduce handles large volume of data and distributes the data into the smaller chunks across the multiple nodes, consequently, the full advantage of large volume of the data is underachieved. Therefore, scalability of privacy preserving approaches becomes a challenging area of research. In this work, we explore this area and propose an algorithm named Scalable k-Anonymization (SKA) using MapReduce for privacy preserving big data publishing. We also compare our approach with existing approaches that results into a remarkable improvement of the data utility and significantly enhances the performance in terms of running time. We further find that SKA can be further extended for the l-diversity. Therefore, we propose the Scalable l-Diversity (SLD) as an extension to SKA. SLD also shows the significant improvement in terms of both Normalized Cardinality Penalty (NCP) and Running Time (RT) with respect to the existing approach. We also identify that the arrangement of the column plays a significant role in the anonymization process. A proper arrangement of columns in the input dataset generates a tighter arrangement of the records in the dataset for the anonymization that ultimately leads to the lower information loss. We implement the concept of column arrangement for both of our proposed approaches, namely; SKA and SLD. It results in an improved version of SKA and SLD and known as Improved Scalable k-Anonymization (ImSKA) and Improved Scalable l-Diversity (ImSLD). It shows significant improvement in terms of running time due to the lesser number of MapReduce iterations. It also has lower information loss as compared to the existing approaches achieving the same level of privacy due to the tighter arrangement of the records in the initial equivalence class.
We further find that in some scenarios efficient anonymization is not enough and a timely anonymization is needed. Hence, to incorporate velocity of the data with SKA approach, we propose another novel approach called, Scalable (alpha, k)-Anonymization (SAKA), where \alpha is a delay constraint. After \alpha number of records SKA starts anonymization process on the batch of data and it continues the process till there is some data in the input stream. Our proposed approach outperforms the existing approaches in terms of information loss and running time. Our extensive research of past and current publications in this area indicates that we are the first to propose a scalable anonymization using MapReduce framework for velocity of data.
In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for the variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing (NLP) to convert the unstructured data into the structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the proposed approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
In this work, we propose scalable anonymization for 3Vs (Volume, Velocity, and Variety) of big data. The proposed approaches show significant improvement over the existing approaches. Privacy preserving big data publishing deals with the problem of publishing big data online with preserving privacy of data owner. Our proposed scalable anonymization approaches are well suited for such applications.