Hello,
I've been looking recently at the cdb format and come across
a
discrepancy between the actual output of cdbmake and the
output I
expected. Hopefully somebody will be able to tell me where I
have gone
wrong.
If I create a small cdb file with cdbmake, using the
records
+13,3:thomas.mangin->foo
+11,3:markcowgill->bar
so that the hashes of both keys are stored in the same hash
table, I
notice that this table is 4 slots long.
Parsing the cdb file, I can see that the four slots are
2417049413, pointer to record with key markcowgill,
empty slot
empty slot
8948549, pointer to record with key thomas.mangin
where the value preceeding each pointer is a stored hash of
the
respective key.
Taken from http://cr.yp.to/cdb/cdb.t
xt,
"""
A record is located as follows. Compute the hash value of
the key in
the record. The hash value modulo 256 is the number of a
hash table.
The hash value divided by 256, modulo the length of that
table, is a
slot number. Probe that slot, the next higher slot, and so
on, until
you find the record or run into an empty slot.
"""
Reading this paragraph, I expected to be able able to find
the entry for
'markcowgill' by seeking to the starting position of the
hash table +
(2*4) * n bytes, where n = (hash / 256) % 4.
In this particular case, (hash / 256) % 4 = (2417049413 /
256) % 4 = 3,
meaning that I skip past the value I'm looking for (jumping
straight to
'thomas.mangin') and falsely report that the key does not
exist.
I can still retrieve a key by hardcoding the slot position
to 0 or even
reading each record sequentially but this doesn't seem to be
the best
way to read quickly and I would appreciate it if somebody
could put me
ok the right track.
In case it helps, I have written a rough and ready python
cdb
implementation to check the data and rewrite it in a format
that does
what I expect. I'm not sure that the write function is
correct but I ran
it on a fairly large dataset and it gave me the results I
wanted.
Setting self.broken on line 31 to False forces slot number =
0 when
performing a hash lookup and is the only way I can parse
data generated
by cdbmake
David
|