Skyhook

Data Management

SkyhookDM - Tabular data management in object storage. SkyhookDM is now part of the Apache Arrow project.

Please see our Announcements page for latest news (last updated Oct 2021).

Skyhook is an open source project within the Center for Research on Open Source Software at the University of California Santa Cruz. Skyhook represents a "programmable storage" approach that uses Ceph object storage to provide data management functionality directly within the storage layer. Our implementation is currently within Ceph but is not Ceph specific, rather it is applicable to any object storage with similar extensibility features such as user-defined object classes and partial read/write of objects.

Goals

The goal of Skyhook is to allow users to transparently grow and shrink their data storage and processing needs as demands change, offloading computation and other data management tasks to the storage layer in order to reduce client-side resources needed for data processing in terms of CPU, memory, IO, and network traffic. Skyhook utilizes Ceph's existing object class mechanism ("cls") by developing customized C++ object classes and methods that enable database operations such as SELECT, PROJECT, AGGREGATE to be offloaded (i.e., pushed down) to the object storage layer. Data processing tasks are executed directly within storage at the object-level, and also include data management tasks such as local indexing and data transformations (e.g., row to column layout) to support dynamic data management in the cloud.

Data processing

SkyhookDM natively utilizes Apache Arrow fast in-memory serialization format, fast becoming the standard for data exchange. SkyhookDM uses Arrow's new Dataset API including the Arrow Compute library that supports data processing tasks such as selection, projection, as well as user-defined functions. We have extended the Arrow Dataset API with a "RADOS fragment" for SkyhookDM. This supports applying functions both within storage as well as on client.

Contributing

Skyhook is in development phase with a rich set of features to work on. To help with the project, please see the Code tab, and take a look at our Announcements to see examples of completed Google Summer of Code projects (2019, 2020), IRIS-HEP projects, talks, and publications.