Not HDF5?

Primary advantages

widely used,
portable to different systems,
stores metadata with the data,
format and tools are open source,
simple organization as datasets (rectangular numerical arrays) within groups,
scalable for big data,
flexible endianess,
POSIX-like hierarchy of datasets,
supports lossless data compression.

Salient disadvantages

opaque internal structure,
not well adapted for collaboration or parallel use,
risk of data corruption or even data loss,
inaccessible to standard tools of Windows or Unix,
complex specification and implementation,
buggy library and tools,
assumes real or integer basal values, no provision for handling uncertainty, tracking or even units except as metadata, which means each "number of the future" would potentially have to be a full HDF5 dataset.

Wikipedia article https://en.wikipedia.org/wiki/Hierarchical_Data_Format

Introductory video https://www.youtube.com/watch?v=S74Kc8QYDac

Critique https://cyrille.rossant.net/moving-away-hdf5/#:~:text=Corruption%20may%20happen%20if%20your,what%20HDF5%20is%20designed%20for.

Discussion

Scott: Thanks, Marco, for the suggestion about HDF5. Of course you are right that we need to be very aware of what's already out there. See the list of advantages and disadvantages above. I am really just guessing about the 7th disadvantage. Maybe I am totally wrong.

Marco: Thanks for sharing the disadvantages. They are much harder to find than the advantages. About the 7th: I presume it boils down to how one structures the metadata. One could build the metadata in such a way that every ’number of the future’ in the HDF5 is linked to its provenance. Of course this process of linking metadata and datasets is error prone and tedious, so it should be automatised.

Scott: Would you know how to store an interval, pdf, p-box or fuzzy number in HDF5?

Marco: This is a central question. I am going to attempt an answer, but it is likely that I will not make any sense. I will give it a go anyway:

I would store an interval as a pair of floats (doubles). A vector of n intervals as an (n,2)-array. A matrix of n x m intervals as a (n,m,2)-array, and so on.

A fuzzy number and a p-box would have an equivalent structure if nonparametric: a (steps, 3)-array, where the last dimension stores the intervals (u, d for the p-box) and probability (or membership) values.

I don’t know yet what the best choice would be for a p-box defined by only moments and range: would we store only the moments and range and let the software do the rest? Similarly for a parametric p-box: do we just store the ‘family’, ‘parameters’ and number of discretization steps and then let the software build it?

This would be most efficient but inconvenient to users who do not know how to operate the PBA-like software.

A nonparametric pdf would be stored as a (steps, 2)-array, where the last dimension contains the p- and x-values. A parametric pdf would be stored using the few parameters that are needed to describe it, the number of steps and a string identifying the family. Again auxiliary software will be needed to build the pdf out of it.

Scott: Yeah, of course, all perfectly logical. But, as I understand it, I don't think your specifications are addressing how HDF5 is structured. Or, rather, you're leaving out the need to group files. I guess we'll need to just try to do it and see what we end up with.

Scott: But the opacity issue alone is rather worrisome too. Thoughts? I was assuming everything would be JSON.

Marco: It is. We need to do something about them. Unfortunately, JSON does not handle large arrays of floats and compression is not possible.

Scott: "It is"? What do you mean? I think that HDF5 is not in JSON. Isn't that right? Maybe your sentence "It is" simply meant that you were agreeing with me that the opacity is worrisome.

In any case, I should think it would be possible to create rectangular or at least regular arrays of naked integer or floating-point values inside a data set conforming to our specs--let's call it DAV format as a working name--if not those of HDF5. Those bits could be compressible, and with appropriate checksums and such, they could be elements within our DAV structure like images are part of larger documents. Indeed, I would expect images to be stored in exactly this way...textual stuff and a possibly large rectangular array of naked encoding numbers.

Page updated

Google Sites

Report abuse