Why?
search in constant time (or close to it) most of the time
achieve scalability when db is huge
Limitation
Only do well with search and associated insert/delete functions
Limited functions, but do it well.
hard to do: findmin, findmax, printAll-order, sort
especially powerful when data is sparse
Approach
use a table -- hash table
map key to table entry -- hash function
ideal hash function
simple, fast, no collision (no collision achieves O(1) search complexity)
unfortunately, given any realistic data set, no collision is an illusion, unrealistic expectation
Collision resolution
linear probing -- next cell (+1, +2, +3, ...)
fixed size hash table is organized as "wrap-around"
table index is mod with hash table size %HTsize
quadratic probing -- next n^2 cell (+1, +4, +9..)
similar approach in ethernet collision retry (that's called exponential back-off)
separate chaining (linked list approach)
double hash -- 2nd hash function to determine next cell
a reasonable 2nd hash function h(x) = R - (x mod R) where R is a prime smaller than table size
Delete an entry
probing (linear or quadratic) will have issue with searches after deletion
what shall we do with the deleted item entry in hash table?
do nothing ... error could happen
mark it ….
each entry has 3 states: empty, occupied, deleted
for insertion operation, deleted is treated like empty
for search operation, deleted is treated like occupied
rehash ... kind of expensive
Hash Table size
prime number
rehash (why?)
(when) table half full
Discussion
the table size needs to be big to reduce collision-induced probing
the hash function needs to be "right" for the expected data to reduce the number of collisions
just for fun, can you devise a hashing scheme that has NO collision or one that has 100% collision
More Discussion
Can a typical database use only hash to store objects? Unlikely... why?
In addition to hash, if we have another data structure (array or link list), how would it work?
Other hash function applications
Hash function itself is a very useful tool for many computer applications. Searching with hash table is one of them. Other applications that uses hash functions are:
integrity validation
parity, checksum, digest
CRC, MD-5, SHA-1
message encryption -- confidentiality
identity authentication
password checking without store actual password
birthday problem
compiler - symbol table