Skyhook at CROSS symposium
Skyhook and its methodologies were presented October 4, 2018 during 2 workshop sessions of the Center for Rearch on Open Source Software symposium at UC Santa Cruz.
- Instore Compute: Interpreted Functions
- Skyhook: leveraging programmable object storage toward database elasticity
Invited talks on Skyhook
- Huawei analytic database group Sept. 21, 2018, Sunnyvale.
- Huawei storage group Aug. 30, 2018, via conference call.
We thank Huawei for the great opportunity to present Skyhook to several of their teams and the helpful feedback receieved.
Flatbuffers data layout for Skyhook objects in Ceph
Based upon the results of our flatbuffers layout evaluation, this figure shows the data layout currently used for Skyhook objects in Ceph. Considerations include flatbuffers with fixed schema that requires recompile of layout definition for schema changes (slightly faster data access) vs. flexbuffers with dynamically defined schema (slightly slower data access). The layout consists of a sequences of flexbuffer rows (dynamic schema), each with a flatbuffer (static schema) containing the record information, wrapped by a single skyhook flatbuffer containing the table metata (static schema) and the row pointers. This enables update in place of table metadata and row data, as well as dynamic schema changes to table rows.
Flatbuffers layout evaluation
We performed several experiments to consider potential data layout variants with flatbuffers/flexbuffers (by Billy Lai). Flatbuffers is an open-source buffer format that supports a table-like interface, more information can be found at https://google.github.io/flatbuffers/.
Initial comparison with Postgres
We compare the peformance of our approach using Ceph object storage with Skyhook extensions to a single node Postgres database. The results show our approach is comparable to the single node Postgres and then our approach is able to improve performance as we scale out the number of storage servers.
Leveraging programmable object storage toward database elasticity
Database elasticity is challenging primarily due to the tradeoffs involved when scaling either storage or processing or both. As processing needs change over time, the disruption involved in adding a new processing node often incurs significant downtime or performance impact while the new node joins the system and data is usually redistributed or assigned to the new node.
Server local decisions based on cache knowledge
Skyhook decouples data storage from processing in order to leverage the capabilities of object storage systems, but this results in other challenges such as optimizer planning, data locality, and batching. Because the database application no longer has full knowledge about storage state, some of the typical optimizations are not possible once storage is decoupled from database processing. Our work on Ceph's programmability shows that some of these effect can be overcome by delegating some optimizations to the storage based on its runtime system state.
We have tested our approach by scaling out the number of storage nodes to leverage the additional processing and indexing capabilities we have added. In this experiment, we execute range, point, and index queries with 1--16 Ceph storage nodes (OSDs) to test the performance of the data processing as we scale out.