List Info

Thread: Nutch/Lucene unique ID for every item crawled?




Nutch/Lucene unique ID for every item crawled?
user name
2007-10-20 04:24:30
Hello,

Does nutch/lucene provide for a unique ID for every item
that it has
crawled?

I checked the Lucene docid but from what I understood, the
lucene docid is
not unique for every item crawled. Is that so?

How can I get this unique ID, if it is available?

Thanks.

- Sagar
Re: Nutch/Lucene unique ID for every item crawled?
country flaguser name
United States
2007-10-20 23:48:13
Hey,
The lucene document id , an integer, may not be same for 2
different 
crawls.
I am not sure if this is wht u r looking for but U can store
a hash 
value of the url crawled ;)

- Sagar

Sagar Vibhute wrote:
> Hello,
>
> Does nutch/lucene provide for a unique ID for every
item that it has
> crawled?
>
> I checked the Lucene docid but from what I understood,
the lucene docid is
> not unique for every item crawled. Is that so?
>
> How can I get this unique ID, if it is available?
>
> Thanks.
>
> - Sagar
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.


Re: Nutch/Lucene unique ID for every item crawled?
user name
2007-10-21 08:10:16
Hash value of the url does sound useful. Thanks! 

But well, is the segment ID different for every crawl? In
which case the
segment ID + Doc Id can become a unique mapping. Trouble is,
I don't know
how to extract the doc id of a particular document while it
is being
crawled. I found a method which, given a doc Id gives the
document, but
that's not what I need, I kinda need the opposite.

Any leads?

- Sagar


On 10/21/07, Sagar Naik <sagarvisvo.com> wrote:
>
> Hey,
> The lucene document id , an integer, may not be same
for 2 different
> crawls.
> I am not sure if this is wht u r looking for but U can
store a hash
> value of the url crawled ;)
>
> - Sagar
>
> Sagar Vibhute wrote:
> > Hello,
> >
> > Does nutch/lucene provide for a unique ID for
every item that it has
> > crawled?
> >
> > I checked the Lucene docid but from what I
understood, the lucene docid
> is
> > not unique for every item crawled. Is that so?
> >
> > How can I get this unique ID, if it is available?
> >
> > Thanks.
> >
> > - Sagar
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content and is believed to be clean.
>
>
Re: Nutch/Lucene unique ID for every item crawled?
country flaguser name
United States
2007-10-21 10:36:59
hey

CRAWL 1:
        url: http://foo.com
        doc id =X
CRAWL 2:
        url: http://foo.com
        doc id =Y
X may be equal to Y

And yes, segment id is different for different crawls. It is
timestamp 
value and is the time when the
Generator is executed

May be if cud tell abt u r ultimate aim, we might be be able
to help u 
appropriately




Sagar Vibhute wrote:
> Hash value of the url does sound useful. Thanks! 
>
> But well, is the segment ID different for every crawl?
In which case the
> segment ID + Doc Id can become a unique mapping.
Trouble is, I don't know
> how to extract the doc id of a particular document
while it is being
> crawled. I found a method which, given a doc Id gives
the document, but
> that's not what I need, I kinda need the opposite.
>
> Any leads?
>
> - Sagar
>
>
> On 10/21/07, Sagar Naik <sagarvisvo.com> wrote:
>   
>> Hey,
>> The lucene document id , an integer, may not be
same for 2 different
>> crawls.
>> I am not sure if this is wht u r looking for but U
can store a hash
>> value of the url crawled ;)
>>
>> - Sagar
>>
>> Sagar Vibhute wrote:
>>     
>>> Hello,
>>>
>>> Does nutch/lucene provide for a unique ID for
every item that it has
>>> crawled?
>>>
>>> I checked the Lucene docid but from what I
understood, the lucene docid
>>>       
>> is
>>     
>>> not unique for every item crawled. Is that so?
>>>
>>> How can I get this unique ID, if it is
available?
>>>
>>> Thanks.
>>>
>>> - Sagar
>>>
>>>
>>>       
>> --
>> This message has been scanned for viruses and
>> dangerous content and is believed to be clean.
>>
>>
>>     
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )