Homework 5

Deadline: 1st January 2017.

Submit: by web form

General rules: read them

The description of the problem

Implement two types of hash tables (cuckoo and linear probing) and three systems of hash functions (tabulation, multiply-shift, and a naive modulo). Then study their performance in two cases.

The families of hash functions

Let U = {0, ..., 2u - 1} be the universe and m=2k be the size of the hash table where u is at least 31. All hashing systems map the universe to the table {0, ..., m - 1}.

Tabulation. Let x be an element of the universe U. If we treat x as a bitstring then we can split it into c parts each having the same length (except one if you choose 31 or 63 bit integers instead of 32 or 64), i.e. x = x1x2...xc. For each part we randomly initialize a tabulation table. Such a table contains exactly 2u/c bit strings of length k. Then h(x) = T1(x1) XOR T2(x2) XOR ... XOR Tc(xc). In other words, each part of the bit string x determines an index in its tabulation table. The entries at the indices contain values from the set {0, ..., m - 1}. The value of the hash function is the XOR of these values. Use a reasonable value of c, which balances the time to compute the hash function and the memory consumption of all tables.

Multiply-shift function. Let x be an element of the universe U and let a be a random odd number chosen uniformly from the universe U. Then h(x) = ((a * x) mod |U|) / (|U| / m). Implement this function efficiently.

The naive modulo function. Let x be an element of the universe U. Then h(x) = x mod m.

Random case study

Study how the running time and number of steps depend on the load factor for the following combinations of hash tables and functions.

  • Cuckoo hashing with tabulation

  • Cuckoo hashing with a multiply-shift function

  • Linear probing with tabulation

  • Linear probing with a multiply-shift function

  • Linear probing with the naive modulo function

We study the complexity (running time and the number of steps) of operation insert only. When using linear probing, the number of steps is the number of accessed cells to find an empty cell. When using Cuckoo hashing, the number of steps is the number of swaps needed to insert a new element, including swaps from the subsequent rehash operations.

Plot two graphs, each having one curve for every combination of a hash table and a hash function. The first graph shows how the running time depends on the load factor and the second graph shows how the number of steps depends on the load factor. For the measurements, consider a table of fixed size at least m = 220 and use a random number generator to generate the input sequence of inserted elements.

When an element is inserted, the table has some load factor. The goal is to study how the complexity of insertion depends on this load factor. In order to plot both graphs, average the complexity of insertion for tables with similar load factors. For example, you could split the insertion sequence into 100 subsequences and then measure the average complexity for every subsequence.

As always, the aim is to draw a graph from which the behaviour of the specified data structures can be read. So, you can cut off each curve where the measured values are too large. For linear probing it is theoretically possible to fill the table completely. However, you can stop the experiment when the load factor is almost 1. For cuckoo hashing, you can stop when inserting one element causes too many rehashes.

Case study of sequential inserts into a table using linear probing

Here, we consider only linear probing with tabulation and multiply-shift, and we measure only the number of steps. The input sequence of inserted elements is 1, 2, 3, ... . The goal is to study the dependence of the number of steps to insert an element on the size of the table m when the table is almost full (the load factor is around 0.9). The sizes of the table m are powers of two. For every size m, run sufficiently many tests. In every test t, first insert elements smaller than 0.89*m and then measure the average number of steps St to insert elements from 0.89*m to 0.91*m. Determine the following five statistical values for every size m to obtain five curves both for tabulation and multiply-shift:

  • the smallest value of St over all tests t,

  • the largest value of St,

  • the average of St,

  • the median of St, and

  • the last decile of St, i.e. a number which is at least as large as 90% of the values St but smaller than 10% of the values St.

Using one or more graphs, find a nice way to plot ten curves that depend on the size m from these five statistical values for both hashing functions. You should run sufficiently many tests to obtain statistically accurate results and also choose a reasonable range for the table sizes m.

You can use any appropriate tool or library to determine the required statistical values from the measured values.