|
List Info
Thread: snippets and stored field in nutch...
|
|
| snippets and stored field in nutch... |

|
2007-10-11 14:08:05 |
Hey All,
Am I right in believing that in Lucene/Nutch, to be able to
return
content or snippet to a search query, the field to be
returned has to
be stored?
AFAIK, by default, Nutch dose not store the document field,
am I
right? If so, how does it manage to return snippets?
Wouldn't the
index be quite huge if nutch were storing document field by
default?
I will appreciate any help/comments as I'm bit lost with
this.
Ravi
|
|
| Re: snippets and stored field in
nutch... |
  United States |
2007-10-11 15:27:26 |
Hi Ravish.
You are correct that Nutch does not store document content
in the
Lucene index. The content *is* stored in the Nutch segment,
which is
where snippets come from.
Hope this helps.
-J
On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
> Hey All,
>
> Am I right in believing that in Lucene/Nutch, to be
able to return
> content or snippet to a search query, the field to be
returned has to
> be stored?
>
> AFAIK, by default, Nutch dose not store the document
field, am I
> right? If so, how does it manage to return snippets?
Wouldn't the
> index be quite huge if nutch were storing document
field by default?
>
> I will appreciate any help/comments as I'm bit lost
with this.
>
> Ravi
|
|
| Re: snippets and stored field in
nutch... |

|
2007-10-11 16:13:35 |
Ah, I see, didn't know that, Thanks!
Interesting that nutch stores it in a different structure
(segments)
and doesn't reuse Lucene strategy of storing within index.
Any
particular reason why? Is there any other use of
"Segments" data
structure except to return snippets?
Cheers,
Ravish
On 10/11/07, John H. Lee <jlee archive.org> wrote:
> Hi Ravish.
>
> You are correct that Nutch does not store document
content in the
> Lucene index. The content *is* stored in the Nutch
segment, which is
> where snippets come from.
>
> Hope this helps.
>
> -J
>
>
> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
>
> > Hey All,
> >
> > Am I right in believing that in Lucene/Nutch, to
be able to return
> > content or snippet to a search query, the field to
be returned has to
> > be stored?
> >
> > AFAIK, by default, Nutch dose not store the
document field, am I
> > right? If so, how does it manage to return
snippets? Wouldn't the
> > index be quite huge if nutch were storing document
field by default?
> >
> > I will appreciate any help/comments as I'm bit
lost with this.
> >
> > Ravi
>
>
|
|
| Re: snippets and stored field in
nutch... |

|
2007-10-11 16:30:30 |
The segment is where all the content is stored. It contains
all the
html of the pages nutch has crawled and the parsed content
(content
without html tags) used by lucene. It can contain more or
less data
depending on your choice of plug-ins to run. Try this out
on a small
segment: nutch readseg -dump <segment_dir>
<output>. It will output
the segment as a text file so you can browse through it
yourself and
see what's in there.
Basically, the segment is where data is stored and
manipulated before
lucene gets involved. It does not necessarily have to be
indexed to
be useful. It all depends on what you're trying to
accomplish.
On 10/11/07, Ravish Bhagdev <ravish.bhagdev gmail.com> wrote:
> Ah, I see, didn't know that, Thanks!
>
> Interesting that nutch stores it in a different
structure (segments)
> and doesn't reuse Lucene strategy of storing within
index. Any
> particular reason why? Is there any other use of
"Segments" data
> structure except to return snippets?
>
> Cheers,
> Ravish
>
> On 10/11/07, John H. Lee <jlee archive.org> wrote:
> > Hi Ravish.
> >
> > You are correct that Nutch does not store document
content in the
> > Lucene index. The content *is* stored in the Nutch
segment, which is
> > where snippets come from.
> >
> > Hope this helps.
> >
> > -J
> >
> >
> > On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev
wrote:
> >
> > > Hey All,
> > >
> > > Am I right in believing that in Lucene/Nutch,
to be able to return
> > > content or snippet to a search query, the
field to be returned has to
> > > be stored?
> > >
> > > AFAIK, by default, Nutch dose not store the
document field, am I
> > > right? If so, how does it manage to return
snippets? Wouldn't the
> > > index be quite huge if nutch were storing
document field by default?
> > >
> > > I will appreciate any help/comments as I'm
bit lost with this.
> > >
> > > Ravi
> >
> >
>
|
|
| Re: snippets and stored field in
nutch... |

|
2007-10-11 17:27:52 |
The reason it is stored in the segments instead of index to
allow
summarizers to be run on the content of hits to produce the
summaries
that appear in the search results. Summarizers are
pluggable and the
actual content used to produce the summary can change. And
summaries
can be changed without re-fetching or re-indexing. If a
summary were
stored in the index, re-indexing would have to occur to make
changes.
Also the way the search process works, Nutch returns hits
(basically
document ids). These hits are then sorted and deduped and
the best x
number (usually 10) returned. For only these 10 best hits,
hit details
(fields in the index) and summaries are retrieved. So there
is
something to be said about the amount of data being pushed
over the network.
Dennis Kubes
Ravish Bhagdev wrote:
> Ah, I see, didn't know that, Thanks!
>
> Interesting that nutch stores it in a different
structure (segments)
> and doesn't reuse Lucene strategy of storing within
index. Any
> particular reason why? Is there any other use of
"Segments" data
> structure except to return snippets?
>
> Cheers,
> Ravish
>
> On 10/11/07, John H. Lee <jlee archive.org> wrote:
>> Hi Ravish.
>>
>> You are correct that Nutch does not store document
content in the
>> Lucene index. The content *is* stored in the Nutch
segment, which is
>> where snippets come from.
>>
>> Hope this helps.
>>
>> -J
>>
>>
>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev
wrote:
>>
>>> Hey All,
>>>
>>> Am I right in believing that in Lucene/Nutch,
to be able to return
>>> content or snippet to a search query, the field
to be returned has to
>>> be stored?
>>>
>>> AFAIK, by default, Nutch dose not store the
document field, am I
>>> right? If so, how does it manage to return
snippets? Wouldn't the
>>> index be quite huge if nutch were storing
document field by default?
>>>
>>> I will appreciate any help/comments as I'm bit
lost with this.
>>>
>>> Ravi
>>
|
|
[1-5]
|
|