hey
CRAWL 1:
url: http://foo.com
doc id =X
CRAWL 2:
url: http://foo.com
doc id =Y
X may be equal to Y
And yes, segment id is different for different crawls. It is
timestamp
value and is the time when the
Generator is executed
May be if cud tell abt u r ultimate aim, we might be be able
to help u
appropriately
Sagar Vibhute wrote:
> Hash value of the url does sound useful. Thanks!
>
> But well, is the segment ID different for every crawl?
In which case the
> segment ID + Doc Id can become a unique mapping.
Trouble is, I don't know
> how to extract the doc id of a particular document
while it is being
> crawled. I found a method which, given a doc Id gives
the document, but
> that's not what I need, I kinda need the opposite.
>
> Any leads?
>
> - Sagar
>
>
> On 10/21/07, Sagar Naik <sagar visvo.com> wrote:
>
>> Hey,
>> The lucene document id , an integer, may not be
same for 2 different
>> crawls.
>> I am not sure if this is wht u r looking for but U
can store a hash
>> value of the url crawled ;)
>>
>> - Sagar
>>
>> Sagar Vibhute wrote:
>>
>>> Hello,
>>>
>>> Does nutch/lucene provide for a unique ID for
every item that it has
>>> crawled?
>>>
>>> I checked the Lucene docid but from what I
understood, the lucene docid
>>>
>> is
>>
>>> not unique for every item crawled. Is that so?
>>>
>>> How can I get this unique ID, if it is
available?
>>>
>>> Thanks.
>>>
>>> - Sagar
>>>
>>>
>>>
>> --
>> This message has been scanned for viruses and
>> dangerous content and is believed to be clean.
>>
>>
>>
>
>
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.
|