|
List Info
Thread: RE: Does Index have a Tokenizer Built into it
|
|
| RE: Does Index have a Tokenizer Built
into it |
  Netherlands |
2007-07-16 03:26:31 |
Hello,
> Ard,
>
> I do have access to the URL's of the documents, but
because I
> will be making
> short snippets for many pages (suppose it had about 20
hits
> per page and I
> need to make Snippets for each of them) I was worried
it would be
> inefficient to open each "hit" tokenize it
and then make the
> Snippet, of
Yes, getting all the documents over http just to get the
snippet, for example the first 2 lines, is really bad for
your performance in search overviews.
Logically, what you want to show, you need to store in your
index. For example, if for search hits you need to show the
title and subtitle, just store these two in the index. If
you want to have a google like highlighter of text snippets
where the term occured, you need to store the entire text
IIRC (see HighlighterTest in lucene).
How many docs are you talking about that you cannot store
the entire content?
You could also just index the content and not store it, and
in another lucene field, store the first 2 or 3 lines of
the document, which serve as text snippet. Making correct
extracts of text snippets is very hard (see lingpipe for
example)
Regards Ard
> course the price of this may be worth the price of the
increased Index
> size. I have been looking into storing "Field
Vectors with
> positions" in
> the index. It seems that by doing this I will have
access to
> everything
> that the Tokenizer is giving me correct? Will I need
to
> store "term text"
> in order to be able to access the actual term instead
of
> stemmed words?
>
> Thanks for all your help,
>
> --JP
>
> On 7/13/07, Ard Schrijvers <a.schrijvers hippo.nl> wrote:
> >
> > Hello,
> >
> > > I'm wondering if after
> > > opening the
> > > index I can retrieve the Tokens (not the
terms) of a
> > > document, something
> > > akin to
IndexReader.Document(n).getTokenizer().
> >
> > It is obviously not possible to get the original
tokens of
> the document
> > back when you haven't stored the document,
because:
> >
> > 1) the analyzer might have removed stop words in
the first place
> > 2) the terms in lucene index are perhaps stemmed
words /
> synonyms / etc
> > etc
> > 3) how would you expect things like spaces,
commas, dots etc to be
> > restored?
> >
> > And, I think what you want does not comply with an
inverted
> index. When
> > you do not store the document, you always loose
information
> about the
> > document during indexing/analyzing
> >
> > How many documents are you talking about? They
must be
> either somewhere on
> > FS or accessible over http...when you need the
document,
> why not just
> > provide a link to the original location?
> >
> > Regards Ard
> >
> > >
> > > In summary:
> > >
> > > My current ( too wasteful implementation is
this)
> > >
> > > StandardTokenizer(BufferedReader (
> > >
IndexReader.Document(n).getField("text"
> > > ) )
> > >
> > > I'm wondering if Lucene has a more efficient
manner to
> > > retrieve the tokens
> > > of a document from an index. Because it
seems like it has
> > > information about
> > > every "term" already, Since you can
get retrieve a
> > > TermPositions object.
> > >
> > > Thanks,
> > >
> > >
> > > --JP
> > >
> >
> >
>
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> > For additional commands, e-mail:
java-user-help lucene.apache.org
> >
> >
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Does Index have a Tokenizer Built
into it |

|
2007-07-16 11:39:35 |
Some of the data sets that will be using have about 2 TB of
data (90 million
web pages). The Snippet I will be generating I would like
to include the
words that are being queried, so I don't want to simply
store the first 2 or
3 lines. I have looked at the HighlighterTest and I do
believe that it
requires the entire text of the document. However, unlike
the highlighter I
know where the termOffset in the document.
The input to my Snippet will be a vector of querywords and
their offsets in
the document. (not their position in the document). I'm
reading about the
option "term vectors" I can store while indexing
my data. It seems to be
much more efficient than storing the entire document, I'm
just not sure if
the "term offset" is the same as a "token
offset". Here's what I'm reading
in case I'm totally off the ball here and this is useless to
me:
http://lucene.apache.org/java/docs/fileformats
.html#Term%20Vectors
It seems like this has all the information that I would have
if I tokenized
the document anyways, or am I missing something?
Thanks again for all the help!
--JP
On 7/16/07, Ard Schrijvers <a.schrijvers hippo.nl> wrote:
>
> Hello,
>
> > Ard,
> >
> > I do have access to the URL's of the documents,
but because I
> > will be making
> > short snippets for many pages (suppose it had
about 20 hits
> > per page and I
> > need to make Snippets for each of them) I was
worried it would be
> > inefficient to open each "hit" tokenize
it and then make the
> > Snippet, of
>
> Yes, getting all the documents over http just to get
the snippet, for
> example the first 2 lines, is really bad for your
performance in search
> overviews.
>
> Logically, what you want to show, you need to store in
your index. For
> example, if for search hits you need to show the title
and subtitle, just
> store these two in the index. If you want to have a
google like highlighter
> of text snippets where the term occured, you need to
store the entire text
> IIRC (see HighlighterTest in lucene).
>
> How many docs are you talking about that you cannot
store the entire
> content?
>
> You could also just index the content and not store it,
and in another
> lucene field, store the first 2 or 3 lines of the
document, which serve as
> text snippet. Making correct extracts of text snippets
is very hard (see
> lingpipe for example)
>
> Regards Ard
>
> > course the price of this may be worth the price of
the increased Index
> > size. I have been looking into storing
"Field Vectors with
> > positions" in
> > the index. It seems that by doing this I will
have access to
> > everything
> > that the Tokenizer is giving me correct? Will I
need to
> > store "term text"
> > in order to be able to access the actual term
instead of
> > stemmed words?
> >
> > Thanks for all your help,
> >
> > --JP
> >
> > On 7/13/07, Ard Schrijvers <a.schrijvers hippo.nl> wrote:
> > >
> > > Hello,
> > >
> > > > I'm wondering if after
> > > > opening the
> > > > index I can retrieve the Tokens (not the
terms) of a
> > > > document, something
> > > > akin to
IndexReader.Document(n).getTokenizer().
> > >
> > > It is obviously not possible to get the
original tokens of
> > the document
> > > back when you haven't stored the document,
because:
> > >
> > > 1) the analyzer might have removed stop words
in the first place
> > > 2) the terms in lucene index are perhaps
stemmed words /
> > synonyms / etc
> > > etc
> > > 3) how would you expect things like spaces,
commas, dots etc to be
> > > restored?
> > >
> > > And, I think what you want does not comply
with an inverted
> > index. When
> > > you do not store the document, you always
loose information
> > about the
> > > document during indexing/analyzing
> > >
> > > How many documents are you talking about?
They must be
> > either somewhere on
> > > FS or accessible over http...when you need
the document,
> > why not just
> > > provide a link to the original location?
> > >
> > > Regards Ard
> > >
> > > >
> > > > In summary:
> > > >
> > > > My current ( too wasteful implementation
is this)
> > > >
> > > > StandardTokenizer(BufferedReader (
> > > >
IndexReader.Document(n).getField("text"
> > > > ) )
> > > >
> > > > I'm wondering if Lucene has a more
efficient manner to
> > > > retrieve the tokens
> > > > of a document from an index. Because it
seems like it has
> > > > information about
> > > > every "term" already, Since
you can get retrieve a
> > > > TermPositions object.
> > > >
> > > > Thanks,
> > > >
> > > >
> > > > --JP
> > > >
> > >
> > >
> >
------------------------------------------------------------
---------
> > > To unsubscribe, e-mail:
java-user-unsubscribe lucene.apache.org
> > > For additional commands, e-mail:
java-user-help lucene.apache.org
> > >
> > >
> >
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-user-help lucene.apache.org
>
>
|
|
| Re: Does Index have a Tokenizer Built
into it |

|
2007-07-17 12:21:06 |
Hi,
I've been looking into the indexing documents with the
vectors for terms and
positions on to solve my problem. However, I've run into a
bit of a snag.
After indexing I have been able to retrieve the
TermPositionVector from the
index and it has all of the data, but I cannot find a way
where given a
position I can retrieve the term at that position. Which is
how I was hoping
to create my contextual snippets.
They have functions where given a term you can get it's
position but I see
no method to achieve the reverse affect. Is there another
class I need to
use for this?
--JP
On 7/16/07, John Paul Sondag <jsondag2 uiuc.edu> wrote:
>
> Some of the data sets that will be using have about 2
TB of data (90
> million web pages). The Snippet I will be generating I
would like to
> include the words that are being queried, so I don't
want to simply store
> the first 2 or 3 lines. I have looked at the
HighlighterTest and I do
> believe that it requires the entire text of the
document. However, unlike
> the highlighter I know where the termOffset in the
document.
>
> The input to my Snippet will be a vector of querywords
and their offsets
> in the document. (not their position in the document).
I'm reading about
> the option "term vectors" I can store while
indexing my data. It seems to
> be much more efficient than storing the entire
document, I'm just not sure
> if the "term offset" is the same as a
"token offset". Here's what I'm
> reading in case I'm totally off the ball here and this
is useless to me:
>
> http://lucene.apache.org/java/docs/fileformats
.html#Term%20Vectors
>
> It seems like this has all the information that I would
have if I
> tokenized the document anyways, or am I missing
something?
>
> Thanks again for all the help!
>
> --JP
>
>
>
>
> On 7/16/07, Ard Schrijvers < a.schrijvers hippo.nl> wrote:
> >
> > Hello,
> >
> > > Ard,
> > >
> > > I do have access to the URL's of the
documents, but because I
> > > will be making
> > > short snippets for many pages (suppose it had
about 20 hits
> > > per page and I
> > > need to make Snippets for each of them) I was
worried it would be
> > > inefficient to open each "hit"
tokenize it and then make the
> > > Snippet, of
> >
> > Yes, getting all the documents over http just to
get the snippet, for
> > example the first 2 lines, is really bad for your
performance in search
> > overviews.
> >
> > Logically, what you want to show, you need to
store in your index. For
> > example, if for search hits you need to show the
title and subtitle, just
> > store these two in the index. If you want to have
a google like highlighter
> > of text snippets where the term occured, you need
to store the entire text
> > IIRC (see HighlighterTest in lucene).
> >
> > How many docs are you talking about that you
cannot store the entire
> > content?
> >
> > You could also just index the content and not
store it, and in another
> > lucene field, store the first 2 or 3 lines of the
document, which serve as
> > text snippet. Making correct extracts of text
snippets is very hard (see
> > lingpipe for example)
> >
> > Regards Ard
> >
> > > course the price of this may be worth the
price of the increased Index
> > > size. I have been looking into storing
"Field Vectors with
> > > positions" in
> > > the index. It seems that by doing this I
will have access to
> > > everything
> > > that the Tokenizer is giving me correct?
Will I need to
> > > store "term text"
> > > in order to be able to access the actual term
instead of
> > > stemmed words?
> > >
> > > Thanks for all your help,
> > >
> > > --JP
> > >
> > > On 7/13/07, Ard Schrijvers
<a.schrijvers hippo.nl> wrote:
> > > >
> > > > Hello,
> > > >
> > > > > I'm wondering if after
> > > > > opening the
> > > > > index I can retrieve the Tokens
(not the terms) of a
> > > > > document, something
> > > > > akin to IndexReader.Document
(n).getTokenizer().
> > > >
> > > > It is obviously not possible to get the
original tokens of
> > > the document
> > > > back when you haven't stored the
document, because:
> > > >
> > > > 1) the analyzer might have removed stop
words in the first place
> > > > 2) the terms in lucene index are perhaps
stemmed words /
> > > synonyms / etc
> > > > etc
> > > > 3) how would you expect things like
spaces, commas, dots etc to be
> > > > restored?
> > > >
> > > > And, I think what you want does not
comply with an inverted
> > > index. When
> > > > you do not store the document, you
always loose information
> > > about the
> > > > document during indexing/analyzing
> > > >
> > > > How many documents are you talking
about? They must be
> > > either somewhere on
> > > > FS or accessible over http...when you
need the document,
> > > why not just
> > > > provide a link to the original
location?
> > > >
> > > > Regards Ard
> > > >
> > > > >
> > > > > In summary:
> > > > >
> > > > > My current ( too wasteful
implementation is this)
> > > > >
> > > > > StandardTokenizer(BufferedReader (
> > > > >
IndexReader.Document(n).getField("text"
> > > > > ) )
> > > > >
> > > > > I'm wondering if Lucene has a more
efficient manner to
> > > > > retrieve the tokens
> > > > > of a document from an index.
Because it seems like it has
> > > > > information about
> > > > > every "term" already,
Since you can get retrieve a
> > > > > TermPositions object.
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > > --JP
> > > > >
> > > >
> > > >
> > >
------------------------------------------------------------
---------
> > > > To unsubscribe, e-mail:
java-user-unsubscribe lucene.apache.org
> > > > For additional commands, e-mail:
java-user-help lucene.apache.org
> > > >
> > > >
> > >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> > For additional commands, e-mail:
java-user-help lucene.apache.org
> >
> >
>
|
|
| Re: Does Index have a Tokenizer Built
into it |
  United States |
2007-07-18 00:07:18 |
: After indexing I have been able to retrieve the
TermPositionVector from the
: index and it has all of the data, but I cannot find a way
where given a
: position I can retrieve the term at that position. Which
is how I was hoping
: to create my contextual snippets.
there is no easy way to go from a position to a term --
coincidently there
is a very recent thread on this on java-dev...
http://www.nabble.c
om/Best-Practices-for-getting-Strings-from-a-position-range-
tf4084187.html
...a new API may come out of it, but in the mean time you
may be
interested in taking the approach the current highlighter
uses (as
mentioned in that thread), of using the TermPositionVector
to rebuild the
orriginal tokenstream, then skipping ahead to the positions
you are
interested in.
-Hoss
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Does Index have a Tokenizer Built
into it |

|
2007-07-18 11:14:19 |
Is there a way to know how big to make the array before hand
(how many terms
are in the topic total?). I'm worried about the efficiency
of this, since
I'd have to rebuild every document that is a "hit"
on the fly to make a
snippet for each "hit" on the page (say 10 a
page).
Now I have to wonder how storing the termPosition vectors in
the index +
sorting them by position compares to storing the location
of the document +
using a tokenizer on the document. Both in the end give me
the result I
want.
Any opinions?
--JP
On 7/18/07, Chris Hostetter <hossman_lucene fucit.org> wrote:
>
>
> : After indexing I have been able to retrieve the
TermPositionVector from
> the
> : index and it has all of the data, but I cannot find a
way where given a
> : position I can retrieve the term at that position.
Which is how I was
> hoping
> : to create my contextual snippets.
>
> there is no easy way to go from a position to a term --
coincidently there
> is a very recent thread on this on java-dev...
>
> http://www.nabble.c
om/Best-Practices-for-getting-Strings-from-a-position-range-
tf4084187.html
>
> ...a new API may come out of it, but in the mean time
you may be
> interested in taking the approach the current
highlighter uses (as
> mentioned in that thread), of using the
TermPositionVector to rebuild the
> orriginal tokenstream, then skipping ahead to the
positions you are
> interested in.
>
>
>
> -Hoss
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-user-help lucene.apache.org
>
>
|
|
[1-5]
|
|