Introduction to Hashing

· klm's blog


Original post is here: eklausmeier.goip.de

In continuation of Simple Exercises for a C Programming Language Course here is a short introduction to hashing ("Streuspeicher"). Original task at hand is: Count how many times a number occurs in a list of numbers? The numbers itself can can vary wildly, while the number of numbers is not usually quite small, say 30,000. That way we cannot just use a simple array and put the numbers as index into the array, as the number values can vastly exceed 30,000. One solution is to use a hash table. This hash table contains the number as key, and the value (of the hash) is the number of occurences. Below simple C program does not actually solve the task at hand, but shows the building block to do that.

Chosing a proper hash function can be a topic of its own. See, for example Hash functions: An empirical comparison -- article by Peter Kankowski. Here we just use the modulus of the number w.r.t. to the maximum size of the hash table. For simplicity, let's take seven slots for our hash table.

Hash lookup is "usually" quite fast, see Hashing Just Random Numbers, i.e., it is O(1). But this can change, if the hash table gets full. The number of occupied slots to free slots in a hash table is called load-factor. If this load-factor approaches one, performance gets worse and worse.

Clearing the hash table is just setting every value to zero.

1#define MAXHASH	7
2
3// Hash table for (key,value). key must be nonzero, otherwise thought to be unused.
4struct { int key, value; } H[MAXHASH];
5
6void hash_init() {
7	int i;
8	for (i=0; i<MAXHASH; ++i) { H[i].key = 0; H[i].value = 0; }
9}

Inserting and lookup using so called "open-addressing". This means, when a slot in the hash table is occupied, we just look at its neighbor. If the neighbor is occupied as well, then look at the neighbor of the neighbor, and so on.

 1int hash_insert(int key, int value) {
 2	int i = key % MAXHASH;
 3	int k = MAXHASH;
 4
 5	while (H[i].key != 0) {
 6		i = (i + 1) % MAXHASH;
 7		if (--k <= 0) {
 8			printf("Hash table is full.\n");
 9			return (-1);
10		}
11	}
12	H[i].key = key;
13	H[i].value = value;
14	return i;
15}
16
17
18int hash_lookup(int key, int *value) {
19	int i = key % MAXHASH;
20	int k = MAXHASH;
21
22	while (H[i].key != key) {
23		i = (i + 1) % MAXHASH;
24		if (--k <= 0) return (-1);
25	}
26	*value = H[i].value;
27	return i;
28}

The main program, using above functions, is now:

 1#include <stdio.h>
 2#include <string.h>
 3
 4int main(int argc, char *argv[]) {
 5	char inp[200], cmd;
 6	int i, key, value, ret;
 7
 8	for (;;) {
 9		fgets(inp,200,stdin);
10		if (strstr(inp,"stop") != NULL) break;
11		key = 0, value = 0;
12		sscanf(inp,"%c %d %d",&cmd,&key,&value);
13		if (cmd == 'i')	// insert into hash
14			printf("ret=%d\n",hash_insert(key,value));
15		else if (cmd == 's') {	// search in hash
16			ret = hash_lookup(key,&value);
17			printf("ret=%d, value=%d\n",ret,value);
18		} else if (cmd == 'p') {	// print entire hash
19			for (i=0; i<MAXHASH; ++i)
20				printf("\t%d\t%d\t%d\n",i,H[i].key,H[i].value);
21		}
22	}
23
24	return 0;
25}

Running the program:

 1i 14 14
 2ret=0
 3i 15 15
 4ret=1
 5i 16 16
 6ret=2
 7s 15
 8ret=1, value=15
 9p
10        0       14      14
11        1       15      15
12        2       16      16
13        3       0       0
14        4       0       0
15        5       0       0
16        6       0       0
17s 21
18ret=-1, value=0
19stop