List Info

Thread: Re: extracting urls into text files




Re: extracting urls into text files
country flaguser name
United States
2007-03-19 02:30:52
Hi Enis,

I cant still able to figured it out how it can be done..Can
you explain
elaborately.
please..

Regards,
Chandresh

Enis Soztutar wrote:
> 
> cha wrote:
>> hi sagar,
>>
>> Thanks for the reply.
>>
>> Actually am trying to digg out the code in the same
class..but not able
>> to
>> figure it out from where Urls has been read.
>>
>> When you dump the database, the file contains :
>>
>> http://blog.cha.com/	Version
: 4
>> Status: 2 (DB_fetched)
>> Fetch time: Fri Apr 13 15:58:28 IST 2007
>> Modified time: Thu Jan 01 05:30:00 IST 1970
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.062367838
>> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
>> Metadata: null
>>
>> I figured it out rest of the things but not sure
how the Url name has
>> been
>> read..
>>
>> I just want plain urls only  in the text file..It
is possible that i can
>> use
>> to write url in some xml formats..If yes then how?
>>
>> Awaiting,
>>
>> Chandresh
>>
>>   
> Hi, crawldb is a actually a map file, which has urls as
keys(Text class) 
> and CrawlDatum objects as values. You can write a
generic map file 
> reader and then which extracts the keys and dumps to a
file.
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/extracting-ur
ls-into-text-files-tf3409030.html#a9547522
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: extracting urls into text files
country flaguser name
United States
2007-03-19 04:21:19
the crawldb is a serialization of a hadoop's 
org.apache.hadoop.io.MapFile object. This structure contains
two 
SequenceFiles, one for data and one for index. This is an
excerpt from 
the javadoc of the MapFile class:

A file-based map from keys to values.
 *
 * <p>A map is a directory containing two files, the
<code>data</code> file,
 * containing all keys and values in the map, and a smaller

<code>index</code>
 * file, containing a fraction of the keys.  The fraction is
determined by
 * {link Writer#getIndexInterval()}.

MapFile.Reader class is for reading the contents of the map
file. By 
using this class, you can enumerate all the entries of the
map file. And 
since the keys of the crawldb are Text objects containing
urls, you can 
just dump the keys one by one to another file. Try the
following :


MapFile.Reader reader = new MapFile.Reader (fs, seqFile,
conf);

        Class keyC = reader.getKeyClass();
        Class valueC = reader.getValueClass();

        while (true) {
            WritableComparable key = null;
            Writable value = null;
            try {
                key =
(WritableComparable)keyC.newInstance();
                value = (Writable)valueC.newInstance();
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(-1);
            }

            try {   
                if (!reader.next(key, value)) {
                    break;
                }

                out.println(key);
                out.println(value);
            } catch (Exception e) {
                e.printStackTrace();
                out.println("Exception occured. "
+ e);
                break;
            }
        }

This code is just for demonstration, of course you can
customize it for 
you needs, for example printing in xml format. you can check
the 
javadocs of CrawlDatum, Crawldb, Text,  MapFile,
SequenceFile classes 
for further insight.


cha wrote:
> Hi Enis,
>
> I cant still able to figured it out how it can be
done..Can you explain
> elaborately.
> please..
>
> Regards,
> Chandresh
>
> Enis Soztutar wrote:
>   
>> cha wrote:
>>     
>>> hi sagar,
>>>
>>> Thanks for the reply.
>>>
>>> Actually am trying to digg out the code in the
same class..but not able
>>> to
>>> figure it out from where Urls has been read.
>>>
>>> When you dump the database, the file contains
:
>>>
>>> http://blog.cha.com/	Version
: 4
>>> Status: 2 (DB_fetched)
>>> Fetch time: Fri Apr 13 15:58:28 IST 2007
>>> Modified time: Thu Jan 01 05:30:00 IST 1970
>>> Retries since fetch: 0
>>> Retry interval: 30.0 days
>>> Score: 0.062367838
>>> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
>>> Metadata: null
>>>
>>> I figured it out rest of the things but not
sure how the Url name has
>>> been
>>> read..
>>>
>>> I just want plain urls only  in the text
file..It is possible that i can
>>> use
>>> to write url in some xml formats..If yes then
how?
>>>
>>> Awaiting,
>>>
>>> Chandresh
>>>
>>>   
>>>       
>> Hi, crawldb is a actually a map file, which has
urls as keys(Text class) 
>> and CrawlDatum objects as values. You can write a
generic map file 
>> reader and then which extracts the keys and dumps
to a file.
>>
>>
>>
>>
>>     
>
>   
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )