the crawldb is a serialization of a hadoop's
org.apache.hadoop.io.MapFile object. This structure contains
two
SequenceFiles, one for data and one for index. This is an
excerpt from
the javadoc of the MapFile class:
A file-based map from keys to values.
*
* <p>A map is a directory containing two files, the
<code>data</code> file,
* containing all keys and values in the map, and a smaller
<code>index</code>
* file, containing a fraction of the keys. The fraction is
determined by
* { link Writer#getIndexInterval()}.
MapFile.Reader class is for reading the contents of the map
file. By
using this class, you can enumerate all the entries of the
map file. And
since the keys of the crawldb are Text objects containing
urls, you can
just dump the keys one by one to another file. Try the
following :
MapFile.Reader reader = new MapFile.Reader (fs, seqFile,
conf);
Class keyC = reader.getKeyClass();
Class valueC = reader.getValueClass();
while (true) {
WritableComparable key = null;
Writable value = null;
try {
key =
(WritableComparable)keyC.newInstance();
value = (Writable)valueC.newInstance();
} catch (Exception ex) {
ex.printStackTrace();
System.exit(-1);
}
try {
if (!reader.next(key, value)) {
break;
}
out.println(key);
out.println(value);
} catch (Exception e) {
e.printStackTrace();
out.println("Exception occured. "
+ e);
break;
}
}
This code is just for demonstration, of course you can
customize it for
you needs, for example printing in xml format. you can check
the
javadocs of CrawlDatum, Crawldb, Text, MapFile,
SequenceFile classes
for further insight.
cha wrote:
> Hi Enis,
>
> I cant still able to figured it out how it can be
done..Can you explain
> elaborately.
> please..
>
> Regards,
> Chandresh
>
> Enis Soztutar wrote:
>
>> cha wrote:
>>
>>> hi sagar,
>>>
>>> Thanks for the reply.
>>>
>>> Actually am trying to digg out the code in the
same class..but not able
>>> to
>>> figure it out from where Urls has been read.
>>>
>>> When you dump the database, the file contains
:
>>>
>>> http://blog.cha.com/ Version
: 4
>>> Status: 2 (DB_fetched)
>>> Fetch time: Fri Apr 13 15:58:28 IST 2007
>>> Modified time: Thu Jan 01 05:30:00 IST 1970
>>> Retries since fetch: 0
>>> Retry interval: 30.0 days
>>> Score: 0.062367838
>>> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
>>> Metadata: null
>>>
>>> I figured it out rest of the things but not
sure how the Url name has
>>> been
>>> read..
>>>
>>> I just want plain urls only in the text
file..It is possible that i can
>>> use
>>> to write url in some xml formats..If yes then
how?
>>>
>>> Awaiting,
>>>
>>> Chandresh
>>>
>>>
>>>
>> Hi, crawldb is a actually a map file, which has
urls as keys(Text class)
>> and CrawlDatum objects as values. You can write a
generic map file
>> reader and then which extracts the keys and dumps
to a file.
>>
>>
>>
>>
>>
>
>
|