data analytic and archival framework for bytestream filesystems

Recent advancements in sensors and other data acquisition systems have resulted in a tremendous surge in the amount of data generated. This influx of data presents both opportunities and challenges for organizations. To harness the full potential of this data and derive meaningful insights, robust data management practices are essential. Data management plays a vital role in connecting data acquisition components to data modeling components, encompassing various aspects such as data integration, quality assurance, access control, and crucially, storage and archiving. This research is focused on developing frameworks for efficiently handling large volumes of data stored in file systems, particularly raw files or byte stream files.

DATA ANALYTIC FRAMEWORK USING CONTENT-BASED INDEXING

In this work, we address the challenges in efficiently retrieving relevant data from file systems and propose an approach for streamlined data retrieval using content-based indexing. We have taken ocean observation systems, where extensive data is produced through advanced platforms and sensors, as a use case to illustrate the challenges in data accessibility and retrieval. Based on this, we propose ATHARVA, an ocean acoustic data analytic framework. ATHARVA includes two key components: (i) A content-based indexing for fast and efficient data discovery with minimal overhead, and (ii) A GUI-based software tool for interactive information retrieval and visual analysis of data.

DATA ARCHIVAL FRAMEWORK USING DEEP NEURAL NETWORKS

This work is devoted to lossless data compression techniques essential for storing voluminous data effectively. In particular, the possibilities and limitations of employing neural network-based lossless compression techniques for the compression of byte streams are analyzed. We propose ByteZip, an efficient lossless compressor for structured byte streams using Autoencoders and Gaussian mixture models. ByteZip leverages the byte stream nature and structure of underlying data to learn patterns and redundancies, which are then used to compress the data. The entire pipeline includes data preprocessing, hierarchical probability modeling using autoencoders and Gaussian mixture models, and arithmetic encoding to encode byte stream. Our experimental evaluation shows that ByteZip adeptly balances the higher compression ratio achieved by autoregressive neural network models with the practicality of attaining a reasonable compression speed.