Low Cost Re-sizable Array (C/C++)

__/~\_/~\__
»»»»»»»»»»»»»»»          (== | ^ ==)

The problem with re-sizing arrays

An Array is a very convenient data-structure when elements of a set is more likely to be accessed in a random fashion. Its power lies in the fact that it requires constant access time for all elements, disregarding factors beyond programmers control (such a page-faults and cache misses).

The main drawback with Arrays (in its pure form) is that the number of elements is fixed during its creation. To counter this limitation, most programming languages provide libraries / Abstract Data Types (ADT) that mimick Growable / Dynamic Arrays (such as C's realloc function, C++ STL's std::vector). The use of such variants although convenient from a programmers perspective is often associated with a high cost. This is attributed to the expensive memory copy (from original array to the re-sized array) operation performed when expanding or shrinking the array. The use of such alternatives is acceptable for applications where the memory-footprint is low and the penalty is acceptable. Some amount indiscriminate use of memory doesn't have a objectional negative impact on such systems :).

But in systems which deal which a huge number of objects and requires optimal use of memory the conventional Growable array is too expensive. Consequently, most implementations resort to other data-types such as linked lists and such sacrificing on the convenience of arrays.

Here I propose an ADT which resembles Growable array, but does not incur the huge cost of memory copy operations when resized. But the access time is slight more than that of a regular array. To be more specific the access time is that of a an element in a dynamically created 2-Dimentional array. That should give you some hint as to what we are upto in this Re-Sizable Array ;).

Although this ADT is less advantageous than an Array, it gives considerable leverage over other alternative for a Growable arrays.

How do we beat the memory copy operations when an array is resized?

Our objective, is to cut down the cost of memory copy operation incurred when an array is resized. In fact, a considerable reduction in memory copy operations can help bring down this cost. And this is exactly how we are going to gain on performance.

In C / C++ the following evaluations are performed while accessing an element of a dynamically created array. (Since all implementations of Growable array are based on dynamically created array, we use the same as baseline for comparison with our new approach.)

1-Dimensional array : * ( <array-base-address> + <Index 1st-D> )

2-Dimensional array : * ( *( <array-base-address> + <Index 1st-D> ) + <Index 2nd-D> )

The ADT I propose here, represents data in a dynamically created 2-Dimentional array but provides access to it as if it were a 1-Dimensional array. I shall henceforth refer to this ADT as rsArray (Re-Sizable Array in short).

A dynamically created 2-Dimensional array is by definition an array of pointers to arrays. In rsArray, the user has to define two parameters - the Block Count (B) and the Block Size (N). Each array of the 2nd dimension has 'N' elements. The 1st dimension initially consists of 'B' pointers to arrays. At any point in time the 1st dimension contains an integral multiple of 'B' pointers to arrays, while the total number of elements is an integral multiple of 'N' and not necessarity N*B. Thus the number of elements in the array grow / shrink in increments of 'N'. The following diagram illustrates the above.

The only different between rsArray and conventional Growable array, is in their behaviour when the array has to grow beyond their current capacity. In the case of rsArray, only elements of the 1st dimension are copied into the new array of pointers to array. Thus all of the blocks with N elements each remain as they are. With an appropriate choice granularity - Block count (B) and Block Size(N), the penalty can be reduced to a negligible amount.

Mapping between 1-D index to corresponding 2-D index

Per the definition of the quantities B (Block count) and N (Blocks size), the following expression illustrated the mapping of a given index i of a rsArray to those of the internal 2-D array created dynamically.

ppElements [ i / N ] [ i % N ]

Here, ppElements refers to the dynamically created 1st dimention (array of pointers to array). Although this mapping seems simple, the division and modulo operation make it quite expensive. This cost will be incurred for access to every element.

To mitigate this cost, we enforce the following constraint - the number of elements in each block will be a power of 2. That is - N = 2(BPB+1), where BPB (Bits Per Block) is the number bit required to represent the index in the 2nd dimension. With this constraint in place the mapping is much more simple, involving a bit-shift and a masking operation as shown below.

ppElements [ i >> BPB ] [ i & ((1<<BPB)-1) ]

Since BPB is a constant (and hence the underlined component as well), the cost is much less in comparison with division and modulo.

All the differences considered, accessing an element in the rsArray incurrs the following penalty over a convention 1-D array.

  1. A bit-shift (evaluation of index of the 1st Dimension),
  2. A mask operation (evaluation of index of the 2nd Dimension),
  3. A Memory dereference (identifying base address of 2nd dimension array) and
  4. An addition (identifying the element in the 2nd dimension array)

The ADT in C++

The following are the pseudo-operations that will be required of an ADT of a Re-Sizable Array of fixed size 'm' [0 .. m-1]

  • Read (i) : Return the element with index i, 0 < i < m
  • Write (i, x) : Sets the element with index i to x, 0 < i < m
  • Grow : Increment m, creating a new element with index m
  • Shrink : Decrement m, discarding the element with index m-1

We will limit our discussion to these operations for clarity. More specialized operations, such as growing / shrinking by 'n' elements etc., can be easily derived from these fundamental operations.

The following is the prototype for the rsArray ADT in C++.

template<typename TYPE, unsigned int B, unsigned char BPB>
class rsArray
{
public:
   rsArray();
  ~rsArray();

  TYPE  operator [](DWORD index) const; // Read (i)
  TYPE& operator [](DWORD index); // Write (i,x)

  DWORD grow(const TYPE&);
  TYPE& grow();

  bool shrink(TYPE&);
  bool shrink();

private:
  TYPE  **ppElements;
  DWORD   dwFreeSpot;
  DWORD   dwMaxBlock;
};

The implementation for the same is available here - rsArray.h.

Conclusion

The rsArray ADT permits resizing an array without the need for expensive memory copy operations. At the same time it provides constant access times close to that of conventional single-dimension arrays. Further, the growth of the rsArray is linear and not exponential like most implementations of dynamic array. These factors make the rsArray ADT a good alternative over conventional Growable array implementations.

The rsArray does come with its own pitfalls as well. The user has to be well aware of these when deciding on the appropriateness of this ADT.

  • It breaks the fundamental property in arrays that consecutive elements in the array are in adjoining memory locations. This is especially true at the boundary of each block.
  • Poorer Locality of Reference than conventional array. This is a direct consequence of the above drawback.
  • Inability to quickly compute the index of an element given a reference to it. This again is due to discontinuous blocks.

Although rsArray has its negatives, these may outweigh the leverage it offers over other alternatives for a system- conventional dynamic array ADT, use of linked-lists etc. to avoid expensive memory copy operations etc.

Considering conventional Growable array implementation, rsArray most certainly makes a lucrative low cost alternative!

Do leave your thoughts on this here - .