List Info

Thread: snippets and stored field in nutch...




snippets and stored field in nutch...
user name
2007-10-11 14:08:05
Hey All,

Am I right in believing that in Lucene/Nutch, to be able to
return
content or snippet to a search query, the field to be
returned has to
be stored?

AFAIK, by default, Nutch dose not store the document field,
am I
right?  If so, how does it manage to return snippets? 
Wouldn't the
index be quite huge if nutch were storing document field by
default?

I will appreciate any help/comments as I'm bit lost with
this.

Ravi

Re: snippets and stored field in nutch...
country flaguser name
United States
2007-10-11 15:27:26
Hi Ravish.

You are correct that Nutch does not store document content
in the  
Lucene index. The content *is* stored in the Nutch segment,
which is  
where snippets come from.

Hope this helps.

-J


On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:

> Hey All,
>
> Am I right in believing that in Lucene/Nutch, to be
able to return
> content or snippet to a search query, the field to be
returned has to
> be stored?
>
> AFAIK, by default, Nutch dose not store the document
field, am I
> right?  If so, how does it manage to return snippets? 
Wouldn't the
> index be quite huge if nutch were storing document
field by default?
>
> I will appreciate any help/comments as I'm bit lost
with this.
>
> Ravi


Re: snippets and stored field in nutch...
user name
2007-10-11 16:13:35
Ah, I see, didn't know that, Thanks!

Interesting that nutch stores it in a different structure
(segments)
and doesn't reuse Lucene strategy of storing within index. 
Any
particular reason why?  Is there any other use of
"Segments" data
structure except to return snippets?

Cheers,
Ravish

On 10/11/07, John H. Lee <jleearchive.org> wrote:
> Hi Ravish.
>
> You are correct that Nutch does not store document
content in the
> Lucene index. The content *is* stored in the Nutch
segment, which is
> where snippets come from.
>
> Hope this helps.
>
> -J
>
>
> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
>
> > Hey All,
> >
> > Am I right in believing that in Lucene/Nutch, to
be able to return
> > content or snippet to a search query, the field to
be returned has to
> > be stored?
> >
> > AFAIK, by default, Nutch dose not store the
document field, am I
> > right?  If so, how does it manage to return
snippets?  Wouldn't the
> > index be quite huge if nutch were storing document
field by default?
> >
> > I will appreciate any help/comments as I'm bit
lost with this.
> >
> > Ravi
>
>

Re: snippets and stored field in nutch...
user name
2007-10-11 16:30:30
The segment is where all the content is stored.  It contains
all the
html of the pages nutch has crawled and the parsed content
(content
without html tags) used by lucene.  It can contain more or
less data
depending on your choice of plug-ins to run.  Try this out
on a small
segment: nutch readseg -dump <segment_dir>
<output>.  It will output
the segment as a text file so you can browse through it
yourself and
see what's in there.

Basically, the segment is where data is stored and
manipulated before
lucene gets involved.  It does not necessarily have to be
indexed to
be useful.  It all depends on what you're trying to
accomplish. 

On 10/11/07, Ravish Bhagdev <ravish.bhagdevgmail.com> wrote:
> Ah, I see, didn't know that, Thanks!
>
> Interesting that nutch stores it in a different
structure (segments)
> and doesn't reuse Lucene strategy of storing within
index.  Any
> particular reason why?  Is there any other use of
"Segments" data
> structure except to return snippets?
>
> Cheers,
> Ravish
>
> On 10/11/07, John H. Lee <jleearchive.org> wrote:
> > Hi Ravish.
> >
> > You are correct that Nutch does not store document
content in the
> > Lucene index. The content *is* stored in the Nutch
segment, which is
> > where snippets come from.
> >
> > Hope this helps.
> >
> > -J
> >
> >
> > On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev
wrote:
> >
> > > Hey All,
> > >
> > > Am I right in believing that in Lucene/Nutch,
to be able to return
> > > content or snippet to a search query, the
field to be returned has to
> > > be stored?
> > >
> > > AFAIK, by default, Nutch dose not store the
document field, am I
> > > right?  If so, how does it manage to return
snippets?  Wouldn't the
> > > index be quite huge if nutch were storing
document field by default?
> > >
> > > I will appreciate any help/comments as I'm
bit lost with this.
> > >
> > > Ravi
> >
> >
>

Re: snippets and stored field in nutch...
user name
2007-10-11 17:27:52
The reason it is stored in the segments instead of index to
allow 
summarizers to be run on the content of hits to produce the
summaries 
that appear in the search results.  Summarizers are
pluggable and the 
actual content used to produce the summary can change.  And
summaries 
can be changed without re-fetching or re-indexing.  If a
summary were 
stored in the index, re-indexing would have to occur to make
changes.

Also the way the search process works, Nutch returns hits
(basically 
document ids).  These hits are then sorted and deduped and
the best x 
number (usually 10) returned.  For only these 10 best hits,
hit details 
(fields in the index) and summaries are retrieved.  So there
is 
something to be said about the amount of data being pushed
over the network.

Dennis Kubes

Ravish Bhagdev wrote:
> Ah, I see, didn't know that, Thanks!
> 
> Interesting that nutch stores it in a different
structure (segments)
> and doesn't reuse Lucene strategy of storing within
index.  Any
> particular reason why?  Is there any other use of
"Segments" data
> structure except to return snippets?
> 
> Cheers,
> Ravish
> 
> On 10/11/07, John H. Lee <jleearchive.org> wrote:
>> Hi Ravish.
>>
>> You are correct that Nutch does not store document
content in the
>> Lucene index. The content *is* stored in the Nutch
segment, which is
>> where snippets come from.
>>
>> Hope this helps.
>>
>> -J
>>
>>
>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev
wrote:
>>
>>> Hey All,
>>>
>>> Am I right in believing that in Lucene/Nutch,
to be able to return
>>> content or snippet to a search query, the field
to be returned has to
>>> be stored?
>>>
>>> AFAIK, by default, Nutch dose not store the
document field, am I
>>> right?  If so, how does it manage to return
snippets?  Wouldn't the
>>> index be quite huge if nutch were storing
document field by default?
>>>
>>> I will appreciate any help/comments as I'm bit
lost with this.
>>>
>>> Ravi
>>

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )