List Info

Thread: cdb format: hash table slots




cdb format: hash table slots
country flaguser name
United Kingdom
2008-04-11 12:18:42
Hello,

I've been looking recently at the cdb format and come across
a 
discrepancy between the actual output of cdbmake and the
output I 
expected. Hopefully somebody will be able to tell me where I
have gone 
wrong.

If I create a small cdb file with cdbmake, using the
records

+13,3:thomas.mangin->foo
+11,3:markcowgill->bar

so that the hashes of both keys are stored in the same hash
table, I 
notice that this table is 4 slots long.

Parsing the cdb file, I can see that the four slots are

2417049413, pointer to record with key markcowgill,
empty slot
empty slot
8948549, pointer to record with key thomas.mangin

where the value preceeding each pointer is a stored hash of
the 
respective key.

Taken from http://cr.yp.to/cdb/cdb.t
xt,

"""
A record is located as follows. Compute the hash value of
the key in
the record. The hash value modulo 256 is the number of a
hash table.
The hash value divided by 256, modulo the length of that
table, is a
slot number. Probe that slot, the next higher slot, and so
on, until
you find the record or run into an empty slot.
"""

Reading this paragraph, I expected to be able able to find
the entry for 
'markcowgill' by seeking to the starting position of the
hash table + 
(2*4) * n bytes, where n = (hash / 256) % 4.

In this particular case, (hash / 256) % 4 = (2417049413 /
256) % 4 = 3, 
meaning that I skip past the value I'm looking for (jumping
straight to 
'thomas.mangin') and falsely report that the key does not
exist.

I can still retrieve a key by hardcoding the slot position
to 0 or even 
reading each record sequentially but this doesn't seem to be
the best 
way to read quickly and I would appreciate it if somebody
could put me 
ok the right track.

In case it helps, I have written a rough and ready python
cdb 
implementation to check the data and rewrite it in a format
that does 
what I expect. I'm not sure that the write function is
correct but I ran 
it on a fairly large dataset and it gave me the results I
wanted.

Setting self.broken on line 31 to False forces slot number =
0 when 
performing a hash lookup and is the only way I can parse
data generated 
by cdbmake


David

  
Re: cdb format: hash table slots
user name
2008-04-20 00:48:04
On Fri, Apr 11, 2008 at 06:18:42PM +0100, David Farrar
wrote:
> In this particular case, (hash / 256) % 4 = (2417049413
/ 256) % 4 = 3, 
> meaning that I skip past the value I'm looking for
(jumping straight to 
> 'thomas.mangin') and falsely report that the key does
not exist.

What is missing from the above paragraph is that the slots
are
essentially a circular buffer.  So, after probing slot 3,
you should
then probe the next slot, which is slot *0*.  There you find
your hash
key, and the referenced data is what you want.  This is
normally how
linearly probed hash tables work.

-- 
Bruce Guenter <bruceuntroubled.org>                http://untroubled.org/
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )