Backing Store

DataNav currently employs a simple file system-based strategy for storing portal content. Portal structural and state information are stored in JavaScript Object Notation (JSON) files, archived figures and hub navigation view template figures are persisted as conventional FypML figure files, and hub data sets and "metadata" for both data and figure archives are saved in custom-formatted binary files. All files specific to a particular archive are kept in a subdirectory assigned to that archive. The entire collection of files and per-archive directories is known as the archive vault backing store. Every DataNav portal has a vault backing store residing on the server's file system, along with some additional metadata files that persist the registered users list and the vault modification information. Furthermore, DataNav Builder uses the same backing store infrastructure to manage the user's private collection of archives within his DataNav workspace on the host machine's file system.

Backing store directory structure

Here is an excerpt from a directory listing that exemplifies the layout of the directory structure for an archive vault backing store. Let $STORE represent the backing store's root directory on the file system.

$STORE/

contents.json

store.lock

transaction.txt

hub_43891668/

hub.dnc

data.dhr

view_55189743930.fyp

view_21699823511.fyp

...

figarch_378122956401/

figarch.dnf

fig_20058230.fyp

...

Observe that there is a subdirectory for each data or figure archive defined on the portal. The subdirectory name has the form hub_{UID} or figarch_{UID}, where the prefix indicates the archive type and {UID} is a 32-bit strictly positive integer that uniquely identifies the archive. All backing store files specific to a given archive are kept in its archive directory. There are also three mandatory files in the backing store's root directory:

- store.lock - The backing store lock file. This tiny file is part of a mechanism to ensure that only one DataNav application -- DataNav Builder in the case of the user's local workspace archive vault, DataNav Server in the case of a portal's vault -- operates on the backing store. At startup, the application acquires a lock on this file before attempting to access any backing store files. This lock is held until the application quits. If the lock is unsuccessful, then it is assumed that the backing store is already in use by another DataNav application. (Of course, this mechanism does not prevent any non-DataNav application from manipulating the backing store files!)
- transaction.txt - The transaction state file. Any modification to vault contents must be persisted to the backing store. These operations could fail, leaving the archive vault in an invalid or inconsistent state. To minimize the damage, DataNav implements a simple transaction-like mechanism. Whenever information must be persisted to the store, a message is written to this file to mark the start of the transaction. If the persistence transaction has several steps, additional messages may be written to the file as the transaction proceeds. When the transaction completes normally, the file is truncated to zero length. If any persistence operation fails, the backing store is immediately disabled. If possible, a failure description is written to the transaction state file. When the backing store is closed, this file is normally deleted. But if the store was disabled by a transaction failure, the file is left in place to guide an administrator's attempts to recover from the failure. Note that a DataNav application will not attempt to access a vault backing store at startup if this file is already present (unless it is empty); thus, the archive vault will remain disabled until the file is manually removed by an administrator.

contents.json - The archive vault contents file. This file persists information on the list of archives defined on the portal. It contains a single JSON object with three fields: entity is a string identifying the file as a vault contents file; version is an integer specifying the file version number, for future migration purposes (current version = 4); and archives is a JSON array holding the vault's archive list. It is a mixed array of integers and strings: [uid1, type1, pub1, ..., uidN, typeN, pubN], where each triplet {uid, type, pub} defines the unique integer identifier (UID) , archive type ("data" or "figure"), and private/public flag (zero = private, non-zero = public) for each archive in the vault. The order of the archives within the list is the order in which archives are presented in DataNav Builder or Viewer. Each archive subdirectory represents the backing store for a single archive. For data hubs, this directory contains the hub content repository file (hub.dnc), the data repository file (data.dhr), and one view template file for each navigation view defined on the hub. Each template file view_{VUID}.fyp is a standard DataNav FypML file defining the figure template for the navigation view identified by the unique integer identifier {VUID}. These figure templates may be generated using Figure Composer and then attached to the hub view later via DataNav Builder, or they may be designed using a Figure Composer-like component within DataNav Builder itself.

The hub contents file persists all hub content except for the raw data sets and the view templates. It is a binary file especially formatted for random-access in an effort to minimize the amount of file I/O that must be done to put, get, or remove information from it. Listed below are the different kinds of structural and metadata content currently persisted in the file:

- Hub information. Title, HTML-formatted description, and author list -- each stored as UTF-8 encoded strings.
- Hub navigation map. The UID of the entry-point view, plus the set of view links that constitute the map.
- Navigation view information. The title and HTML-formatted description of each view, plus the configuration groups and instance list for each non-singleton view.
- Search attributes and their values. Each individual string is associated with a unique 32-bit integer key.
- Attribute-value-data lookup table. Each "instance" defined on a view associates one or more hub data sets with one or more attribute-value pairs -- the attribute names are part of the instance configuration group, while the corresponding values "label" the particular view instance. In this way, the author is attaching metadata to the individual data sets in each view instance. All of these attribute-value-data (AVD) associations are stored in a lookup table maintained in the hub content repository, and these associations are maintained even if the original view is deleted (attributes, attribute values, and data are never deleted, unless the hub is compacted prior to upload/download or export). The lookup table construct allows both hub author and readers to search for data quickly by specifying one or more attribute-value pairs.

Hub data sets referenced in the navigation view instance list are stored -- indexed by their UIDs for fast retrieval -- in the data repository file, a custom-formatted binary file that can grow to accommodate an indefinite number of sets. The file format optimizes the speed with which raw data can be added, retrieved, and removed from the repository. As sets are added and removed, the file can become fragmented, but it can be compacted to minimize wasted space. Compaction, of course, is a slow operation, and it is only performed as part of some larger, long-running task -- such as a hub upload or export operation.

The backing store for a figure archive contains a single metadata file, plus a standard FypML markup file for each figure in the archive. The metadata file figarch.dnf is to a figure archive what hub.dnc is to a data archive. It stores the archive's title, author list, and HTML-formatted description; the ordered list of UIDs identifying the figures comprising the archive, plus a legend (title plus description) for each figure. The name of each archived figure file takes the form fig_{FUID}.fyp, where {FUID} is the figure's UID.

What follows are detailed specifications of the binary file formats for figarch.dnf, hub.dnc, and data.dhr, for the benefit of interested developers and portal administrators. Other users will prefer to skip these sections entirely!

File format description: hub.dnc

The hub contents file begins with an 8-byte header, followed by a 2KB index block. These two sections are present even if no information is stored in the file. Obviously, this is wasteful, but it dramatically improves the random-access retrieval, revision, and removal of any content stored therein. The first allocated block in the file immediately follows the index block.

The header serves to identify the file as a DataNav hub contents repository. It is structured as follows:

- Bytes 3-0 : The file tag constant 0x434E4440, which translates to "@DNC" in ASCII. This tag not only serves to identify the file, but it also determines the file's endianness. If the tag reads as 0x40444E43 when little-endian byte order is assumed, then the file must have been saved in big-endian order.

Bytes 7-4 : Version number. The current version is 4. Versions 1-2 applied to a prior conceptualization of the "data hub"; conversion from these older versions to the current version is NOT supported. Version 3 included a string descriptor for each view link. Link descriptors are omitted in the current version. Automatic migration from version 3 to the current version IS supported.

The index block lists the type, size, and location of each allocated file block. One 2KB index block has room to store information on 170 file blocks. As the hub content grows and changes, additional index blocks are appended to the file as necessary, and the list of index blocks are joined in "linked list" fashion, with each index block including a "pointer" (byte offset from start of file) to the next block in the list. Each index block has a 8-byte header and 170 12-byte allocation slots. The header contains two 32-bit integers: the first is the file offset to the next index block (0 if there is none), and the second is reserved (0). Each allocation slot is structured as follows:

Bytes 1-0: Block type -- unallocated, allocated but unoccupied, or one of several content types. The presence of an unallocated slot merely indicates that the index is not full. If there are no unallocated slots and a new file block must be allocated, then a new index block must first be appended to the file in order to record the information about the new and subsequent file blocks. Allocated but unoccupied blocks are blocks that were previously used; their content is indeterminate until they are "reused". The structure of each content block type is detailed in a later section.
Bytes 3-2: Block size in KB (1024-byte increments). Minimum allocated size is 1KB. This field is 0 for any unallocated slot.
Bytes 7-4: Block pointer, i.e., the byte offset from the beginning of the file to the first byte of the allocated block. This field is 0 for unallocated slots. NOTE: Since we're using 32-bit integers for file offsets, total file size is restricted to 4GB or less. It is extremely unlikely that the contents file for a typical hub will exceed 10MB.
Bytes 11-8: Content-block specific parameter. Use depends on the content block type -- see descriptions below; for some content types, the field is not used and is set to 0.

All hub content is stored in file blocks that are appended to the file as needed, with their size, location and type recorded in the next available allocation slot within the linked list of index blocks in the file. As content changes, it may outgrow the block in which it is stored; in this case, a larger block is allocated to hold the content, and the old block is marked as unoccupied in the index, so that it can be reused when new content is added to the file. If content is removed from the file (e.g., when deleting a navigation view), the blocks in which that content was stored are again marked as unoccupied in the index.

When a block is marked as unoccupied, the index is checked to see if the block that is physically before or after it is also unoccupied. If so, the two blocks are coalesced into one, which reintroduces an unallocated slot in the index. Furthermore, if the just-unoccupied block is at EOF, then the file can be truncated to reclaim that space -- again reintroducing an unallocated slot in the index. To coalesce or truncate unoccupied blocks as quickly as possible, all index blocks in the content file are parsed when it is loaded, and the index is cached in memory.

Because of the linked-list index block structure, and because adjacent unoccupied blocks are coalesced whenever possible, the order of the allocation slots in the index will NOT generally reflect the physical order of the allocated blocks in the file. Given a block at offset D with size S, the block after it will have offset D+S, while the block before it will have offset F and size K such that F+K = D. Additional index blocks will introduce "holes" in the file, effectively splitting it into different allocation sections. Blocks in different sections cannot be coalesced, since they can never be physically adjacent. However, if the number of allocated blocks decreases such that an index block can be reclaimed, then the index is consolidated over the remaining index blocks, and the reclaimed index block is marked as unoccupied. This is likely to be a very rare event: A typical hub is unlikely to need more than the 170 different content blocks that can be recorded in the single index block at the head of the file.

The content and structure of an occupied file block varies with the block type:

- Hub information. There is always one and only one block of this type in the hub content repository. It stores a JSON array holding the hub's title, HTML description, and list of authors: [title, description, author1, author2, ... authorN]. Each array element is a non-empty string, and all elements after the second are assumed to be author names. The entire JSON array is converted to a string and stored in the block as a UTF-8 encoded byte array preceded by the array's length as a 32-bit integer. Whenever any piece of information changes, the entire block is overwritten to reflect the new state.

Hub navigation map. There is always one and only one block of this type in the repository. It stores all defined view links in the hub's navigation map. The links are persisted as a JSON array of 32-bit integers [src1, dst1, src2, dst2, ...], where each pair of integers represents the source view UID and destination view UID of a link defined in the navigation map. The JSON array is converted to a string and stored in the block as a UTF-8 encoded byte array preceded by the array's length as a 32-bit integer. Whenever a view or view link is removed or added, the entire block is overwritten to reflect the new state of the hub's navigation map. The content-specific parameter in the allocation slot for this block is the UID of the hub's entry-point view. If the hub has no views, the entry-point VUID is set to 0. The block is initially sized at 1KB, and it highly unlikely that a typical hub will ever need more than that to encode its navigation map. (NOTE: In version 3, a view link included a string descriptor, and the navigation map was represented by a JSON array of the form [[src1, dst1, desc1], [src2, dst2, desc2], ...], where each triplet represented a single view link, and the third element of each triplet was the link's string descriptor. Automatic conversion to the current format, in which the view link has no descriptor, is supported.)
- Attribute-value dictionary. There is always one and only one block of this type in the repository. It stores the hub's attribute names and attribute values as UTF-8 encoded strings. Each string is paired with a unique 32-bit integer key -- negative keys for attributes and positive keys for attribute values. The key is used in other blocks as a reference to the string itself. The dictionary block is read when the content file is preloaded, and separate in-memory caches of the attribute names and the attribute values are maintained to quickly detect duplicate strings and retrieve their corresponding keys. When a new attribute or attribute value is encountered, it is simply appended to the dictionary block, which will be reallocated (adding at least 2KB) if there's not enough room remaining. Each attribute or attribute value entry starts with the 32-bit integer key, followed by a 16-bit integer N, followed by an N-byte array containing the UTF-8 encoded string. The number of bytes currently occupied by these string entries (stored contiguously from the start of the block) is maintained as the content-specific parameter in the corresponding allocation slot in the block index.
- View definition. There is one block of this type for each view defined on the hub. The UID of the view is stored in the content-specific parameter field in the corresponding allocation slot of the index. The file block contains a single JSON array [title, desc, cfg0, cfg1, ...] holding the view's title, HTML description, and instance configuration groups. The first two elements are strings, while each of the remaining elements are JSON arrays defining up to four configuration groups. Each configuration group array has the form: cfg = [name, blkSz, [key1, .., keyM], [id1, fmt1, iter1, ..., idP, fmtP, iterP]]. Here name is the name of the configuration group (a string), blkSz is the group's block iteration size (an integer), and [key1 .. keyM] is a JSON array of M integer keys identifying the group's attributes as stored in the content file's attribute-value dictionary. The last element is a JSON array of triplets, each of which refers to a place holder data set in the view's template figure. Each triplet [idP, fmtP, iterP] specifies the ID string, data format (integer code), and block-iteration flag (boolean) for the configuration group's P-th place holder. A view may have 0 to 4 configuration groups. The entire JSON array defining the view is converted to a string and stored as a UTF-8 encoded byte array preceded by its length as a 32-bit integer. Whenever any aspect of the view definition changes, the entire block is overwritten to persist those changes.
- View instance list. There will be at most one block of this type for each configuration group defined on each hub navigation view. Singleton views have no instance configuration groups and thus have no blocks of this type. Each block stores all view instances belonging to a particular configuration group for a particular hub view. The view is identified by the UID stored in the content-specific parameter field in the corresponding allocation slot of the block index. The configuration group index is part of the block type: type = BT_VILIST + group index (BT_VILIST = 6). The block begins with an 8-byte header:
  - Bytes 3-0 : 32-bit integer, N, indicating the number of instance entries in use within the block.
  - Bytes 5-4 : 16-bit integer, M, indicating the number of attribute value keys in each entry.
  - Bytes 7-6 : 16-bit integer, P, indicating the number of DUIDs in each entry.
- The values M and P are intrinsic to the configuration group and never change; they're stored in the block header as a convenience (so we don't have to read the view's definition block to retrieve the configuration group information). Following this header are N entries of the form [key1, .. keyM, duid1, .. duidP]; each entry is 4*(M+P) bytes long. The remainder of the block is unused. The maximum number of instances that can be stored in a block of size B is (B-8) / (4*(M+P)). Structure and operations favor adding and deleting instances quickly at the expense of possibly wasted space and slower performance when retrieving the entire instance list. A new instance is added simply by writing it into the next available entry <b>without</b> checking that an instance with the same ordered list of attribute value keys already exists. This leads to duplicate instances. An instance labelled by value keys [key1..keyM] is "deleted" simply by writing a new entry with the same key tuple but with the DUIDs all set to -1. When the instance list is retrieved, only the last entry with the key tuple [key1 .. keyM] has meaning -- it masks all of the preceding entries having the same key tuple, since these represent earlier edits. With this scheme, it is easy to create a lot of wasted space in the block. Each time the full instance list is retrieved, the amount of wasted space is measured and, if it exceeds a certain threshold, the entire block is rewritten to recover the wasted space.
Attribute-value-data lookup table (AVLUT). As described above, attribute-value-data associations are generated each time a new instance is added to any view's instance list. By storing these associations separately from the instance list, they survive in the event the view is removed from the hub, providing a means of searching hub data outside the context of a navigation view. The repository will contain a single AVLUT block for each distinct attribute that, paired with one or more assigned attribute values, "tags" one or more hub data sets. The attribute's integer key A is the content-specific parameter in the corresponding allocation slot in the index. The file block stores a sequence of 64-byte AVD entries. Each such entry holds 16 4-byte integers: [V, D1,..,D15], where V is the key of an attribute value assigned to A, and D1..D15 are a list of up to 15 different data set UIDs that have been "tagged" by the attribute-value pair (A,V). If M<15 data sets have been associated with (A,V), then the last 15-M DUIDs in the entry will all be -1, which is not a valid data set UID. If there are more than 15 sets associated with (A,V), then there will exist two or more AVD entries for attribute value V somewhere in the AVLUT block for attribute A; only one of these may be incomplete (since an incomplete AVD entry is filled for a given value key before creating a new entry for that key). An AVLUT block of size B can hold exactly N=B/64 entries (since B is always a multiple of 1024, which is always a multiple of 64). Entries are occupied contiguously from the start of the block. If there are any unused entries at the tail end of block, the first such entry has V==0.

File format description: data.dhr

The hub data repository is a single binary file in which every data set in a DataNav hub is stored. It is formatted for random access to minimize the amount of file IO that must be done to add/retrieve an individual data set to/from the file. In addition to storing hub data, the repository also maintains a "data-to-view lookup table" persisting associations between data sets and the hub navigation views in which those sets are displayed.

A hashing scheme is crucial to the data repository file design. A 32-bit integer hash code is computed for every data set added to the file, and this code is XOR-folded into a 12-bit "bucket" index. Assuming the hash is a good one, the idea is to disperse stored data sets uniformly among 4096 buckets. The implementation allows for up to 65536 data sets per bucket, so the file may theoretically store over 268 million data sets all told. We anticipate that a given DataNav hub will contain FAR fewer sets than this!

Each data set added to the repository is assigned a unique 32-bit integer ID, or "DUID". This DUID uniquely identifies the data set within the hub and, more importantly, provides the means to quickly retrieve the stored data from the repository. The data set's bucket index forms the first 12-bits of the UID, while its index within the bucket forms the next 16 bits (the upper 4 bits are always 0).

For each non-empty bucket there are one or more file blocks called "bucket blocks" that store three pieces of information for each data set in that bucket: the set's 32-bit hash code, an "associated views" field that indicates the hub navigation views in which the data set appears, and the 8-byte file offset to the file block in which the actual data is stored. Storing the hash code provides a means of quickly detecting whether or not a new data set being added to the file is actually an exact duplicate of an existing set. Given the manner in which hub views are "populated" with instance data (the same data set could appear in multiple instances), avoiding duplicate data sets is fundamental to the implementation of the data repository.

The "associated view" field is a bit field of length 64 that makes it possible to associate each data set with up to 64 distinct hub navigation views. A fixed 64x4-byte "view table" near the beginning of the repository file stores the UID of the view corresponding to each bit flag in this field. There were several design considerations that led to this scheme:

It is HIGHLY UNLIKELY that hub will contain more than 64 views.
It is VERY LIKELY that a significant number of data sets will be displayable in 2 or more views.
An alternative approach would be to store the 32-bit UID of each view in which a data set is displayed in that set's bucket block slot. But what happens when the set is associated with multiple views? The bucket block slot would have to be designed to accommodate some fixed number of view associations; if more than 2 (8 bytes), it would be less efficient than the bit-field scheme. The bit-field scheme, on the other hand, supports the extreme possibility of a data set appearing in 64 distinct views. If the bucket block slot was designed to accommodate a variable number of view associations, then scanning a bucket block would be a lot slower because the size of each slot would vary. Having a fixed bucket slot size is critical to fast retrieval of stored data and their view associations.

The data repository file begins with a 16-byte header, which includes the file tag and some integer fields:

Bytes 3-0: A 4-byte file tag == 0x52484440, which translates to "@DHR" in ASCII. The tag not only serves to identify the file as a DataNav hub data repository, but it also determines the file's endianness. If the tag reads as 0x40444852 when little-endian byte order is assumed, then the file was saved in big-endian order.
Bytes 7-4: The version number. The current version number is 1.
Bytes 11-8: The total number of data sets stored in the repository.
Bytes 15-12: Reserved. Should always be 0.

After the header comes the 64x4-byte "view table". Each 4-byte integer represents the UID of an existing hub navigation view; unoccupied entries in the table are set to 0, which is not a valid VUID. Index position in the table corresponds to the bit position in the 64-bit field in each data set's bucket slot. The table and the per-set "associated views" bit field allow one to attach each data set to any subset of 64 distinct views. An association is formed by storing the VUID in the view table (if it's not already there), then setting the bit at the corresponding position in the data set's bit field.

Immediately following the view table is the 4096x8 byte "bucket table", which is always present and is fixed in size. It is essentially a list of file pointers -- more precisely, 64-bit integers specifying the byte offset from the beginning of the file -- to the start of the first bucket block for that data bucket. When the file is first created and contains no data sets, all of the buckets are empty (no bucket blocks exist) and all of the bucket table pointers are 0 to indicate this fact. When the first data set is added to a bucket, an initial bucket block is added to the file and its offset is entered into the corresponding position in the bucket table. Observe that, once you know the bucket index N = 0..4095 for a data set (that is to be added or retrieved), you can locate the offset to the bucket file block directly by reading the 8-byte integer at file offset D = 264 + N*8.

Bucket blocks store the 32-bit hash code, the 64-bit "associated views" field, and an 8-byte file pointer for each data set that is stored in the file and is a member of that bucket. As mentioned above, the hash code is stored so that the repository can verify that a data set to be added to the bucket is not an exact duplicate of an existing set. If the new data set's hash code does not match any of the hash codes currently in the bucket, then it is safe to conclude that it's not a duplicate. If there is a match, then it is necessary to load the existing data set and compare it for equality with the data set being added.

The implementation permits up to 216 data sets per bucket -- indexed from 0 to 65535. However, in practical usage, it is expected that any given bucket will contain far fewer than the maximum. Thus, when the first data set is added to a bucket, the initial bucket file block is allocated to hold the hash code, bit field, and file pointer for up to 32 data sets. When the 33rd data set is added to the bucket, a second bucket block is allocated and appended at file's end, and its file offset is stored in the header of the preceding bucket block -- resulting in a linked-list of bucket blocks. The advantage in this design is that we do not waste file space (if there was a single bucket block per bucket, we'd have to reallocate it and copy the old block to it -- leaving the old block abandoned), at the expense of more file seeks as we traverse the linked list for a particular slot in the bucket.

Each bucket block begins with the 8-byte file offset to the next block in the bucket. This will always be 0 for the last block in the bucket block list. The "next block offset" is followed immediately by 32 20-byte slots. The first 4 bytes of a slot store the data set's hash code, the next 8 hold the "associated views" bit field, and the remaining 8 bytes hold the file offset to the start of the file "data block" in which the actual data set is stored. If a slot is unused, all bytes are set to zero. If the repository file has never been compacted and the data bucket currently contains N sets, then the bucket will have a linked list of min(1, floor(N/32)) bucket blocks, and only the last block will have unused slots, all of which appear at the tail end of the block (no "holes"). Compaction removes data sets from the file, so it can introduce unused slots anywhere within a bucket block list.

Given this file structure, we can summarize the work involved in adding a data set S to the repository:

Compute the 32-bit hash for S, hash32(S) = H. Use XOR-folding on H to get the set's bucket index B.
Read the 8-byte bucket block offset in the bucket table at file position 264 + 8*B. If 0, then the bucket is empty. Allocate an initial bucket block and write its file offset into the bucket table. Otherwise, scan the hash codes in the linked list of bucket blocks to make sure S is not a duplicate of an existing data set in the bucket. We'll assume that it is not.
If all bucket slots are in use across the linked list of bucket blocks, a new bucket block is allocated and its file offset is stored in the header of the preceding bucket block. Otherwise, find an unused slot to which the data set can be assigned.
At this point, we have found an available slot in the linked list of bucket blocks. Append the actual data set to file's end; the data block's file offset D is the file's size before the block was written. After writing the data set, write H, D and the "associated views" bit field into the available bucket block slot. If the slot index is P, the DUID assigned to the added dataset is (P << 12) + B.

Retrieving a dataset from the repository is easy. Masking out all but the lower 12 bits of the DUID yields the bucket index B. Read the 8-byte offset at file location 264 + 8*B to get the offset D to the first block in the bucket block list. The data set's slot index P within that bucket is given by (DUID >> 12) & (0xFFFF). We traverse the linked list of bucket blocks to find the P-th bucket slot and read the file offset to the data block. Once we have the data block offset, it's simply a matter of reading in and parsing that block.

Observe that bucket blocks and data set blocks will be interspersed with each other as the repository is populated with data. New blocks -- either bucket blocks or data blocks -- are always appended to the end of the file, so the file's size prior to writing will be the new block's file offset.

Data set block format.

Bytes 3-0: Block size in bytes. It will be a multiple of 4.
Bytes 7-4: Data format code (0=ptset, 1=mset, 2=series, 3=mseries, 4=raster1d, 5=xyzimg).
Bytes 11-8: Number of rows in data matrix, N. For the 2D formats, this is the number of tuples in the data set. For raster1d data, it is the total number of raster samples stored. For xyzimg, it is the height of the image data Z(x,y), or the granularity in y. Its value should be non-negative; 0 indicates an empty set.
Bytes 15-12: Number of columns in data matrix, M. For the 2D formats, this gives the length of a single datum tuple, representing a single data point (or collection of points sharing the same x-coordinate). Valid values are 2-6 for ptset, 2+ for mset, 1-3 for series, 1+ for mseries. The mset and mseries formats represent collections of data sets sharing the same x-coordinate vector. The number of individual sets in an mset is one less than the number of columns, while it is equal to the column count for an mseries. For raster1d data, this is the number of individual rasters stored (0+). For xyzimg, it is the width of the image data Z(x,y), or the granularity in x.
Bytes 16+ : A float array of length P+L -- P additional parameters, followed immediately by the raw data array of length L. For the ptset, mset, and raster1d formats, P = 0. For series and mseries, there are P=2 parameters, dx and x0 ; for xyzimg, there are P=4 parameters specifying the x- and y-coordinate ranges spanned by the image data Z(x,y): [x0 x1 y0 y1]. For all data formats except raster1d, the raw data array has length L=N*M. For the four 2D formats, the N M-tuples are stored sequentially in this array; for XYZIMG, the intensity image data are stored row-wise in the array. The raster1d data array is different. The first M entries are the lengths of the individual rasters in the collection, and the remaining N elements are the raster samples: [n1 n2 .. nM x1(1..n1) x2(1..n2) .. xM(1..nM)]. Note that N = n1+n2+ .. +nM. In this case, the array length L = N+M.

NOTE that the dataset ID string is NOT stored. Data sets in a hub are injected into view templates for display; their string IDs have no significance; it is the set DUID that matters.

File format description: figarch.dnf

The figure archive metadata repository file is structured for random file access much like the hub contents file, beginning with an 8-byte header, followed by a 2KB index block. These two sections are present even if no information is stored in the file. Their format is the same as for hub.dnc, except that the unique file tag constant for figarch.dnf is 0x464E4440 ("@DNF"), and the current version number is 1.

The header serves to identify the file as a DataNav figure archive metadata file. It is structured as follows:

- Bytes 3-0 : The file tag constant 0x464E4440, which translates to "@DNF" in ASCII. This tag not only serves to identify the file, but it also determines the file's endianness. If the tag reads as 0x40444E46 when little-endian byte order is assumed, then the file must have been saved in big-endian order.

Bytes 7-4 : Version number. The current version is 1.

The content and structure of an occupied file block in figarch.dnf varies with the block type:

- Figure archive summary information. There is always one and only one block of this type in the metadata repository. It stores a JSON array holding the archive's title, HTML description, and list of authors: [title, description, author1, author2, ... authorN]. Each array element is a non-empty string, and all elements after the second are assumed to be author names. The entire JSON array is converted to a string and stored in the block as a UTF-8 encoded byte array preceded by the array's length as a 32-bit integer. Whenever any piece of information changes, the entire block is overwritten to reflect the new state.
- Figure list. There is always one and only one block of this type in the repository. It stores a JSON array of 32-bit integers -- the list of UIDs identifying the N figures in the archive: [uid1, ..., uidN]. The entire JSON array is converted to a string and stored in the block as a UTF-8 encoded byte array preceded by the array's length as a 32-bit integer.
- Figure information. There is one block of this type for each figure stored in the figure archive. The UID of the figure is stored in the content-specific parameter field in the corresponding allocation slot of the block index. The block itself contains a single JSON array [title, desc] holding the figure's title and HTML description. The JSON array is converted to a string and stored as a UTF-8 encoded byte array preceded by its length as a 32-bit integer. When either title or description changes, the entire block is overwritten to persist those changes.

- - - Sample contents.json file
      - {
           "entity": "contents",
           "version": 4,
           "archives": [
              418177855, "data", 1,
              802120441, "data", 0,
              945695711, "figure", 1]
        }

Page updated

Report abuse