|
List Info
Thread: highlighting
|
|
| highlighting |

|
2006-04-04 13:51:16 |
I would like to have highlighting of selected field(s) in
Solr search
results. Certainly a custom request handler can do this,
but I'm
curious if the standard handler and configuration should
evolve to
handle the common need for search term highlighting, and if
so how
would that ideally look in the configuration and search
request?
I am game for developing the highlighting piece in some way
in the
next few days, and would gladly contribute that feature back
provided
it was done in a way that fits with Solr's architecture.
Thanks,
Erik
|
|
| highlighting |

|
2006-04-04 15:02:25 |
On 4/4/06, Erik Hatcher <erik ehatchersolutions.com>
wrote:
> I would like to have highlighting of selected field(s)
in Solr search
> results. Certainly a custom request handler can do
this, but I'm
> curious if the standard handler and configuration
should evolve to
> handle the common need for search term highlighting,
Absolutely!
> and if so how would that ideally look in the
configuration and search request?
Great question... and how would it look in the search
results as well.
I haven't used highlighting yet in Lucene, so I'm not sure
what the
best way to fit it into Solr would be.
I guess it's time to go read that part in LIA
One thing right off the bat: I think highlighting probably
needs the
stored fields...
To support streaming of large result sets, I don't retrieve
all the
documents up front - it's actually done in the XML
serializer. That
may make things slightly more difficult.
It's probably best to focus on the ideal interface first
(query
parameters as input format, and desired XML output format).
For the XML output format, we need to decide if the hilight
info goes
in or after each <field>, in or after each
<doc>, or in a separate
section altogether. Also need to consider multivalued
fields.
The current format for fields looks like this for
single-valued fields:
<field name="title">How now brown
cow</field>
And this for multi-valued fields:
<arr name="title"><str>This is the
first title</str> <str>This is
the second</str> </arr>
-Yonik
|
|
| highlighting |

|
2006-04-04 16:02:48 |
> It's probably best to focus on the ideal interface
first (query
> parameters as input format, and desired XML output
format).
We might also want to keep termvectors in mind when thinking
about
this stuff... seems like they are related (per-field
optional/extra
data).
-Yonik
|
|
| highlighting |

|
2006-04-05 01:18:22 |
For the record, i know next to nothing about highlighting in
Lucene. i
can't remember if i read that chapter in LIA or not
: curious if the standard handler and configuration should
evolve to
: handle the common need for search term highlighting, and
if so how
+1
: would that ideally look in the configuration and search
request?
one of the things i've been doing in my custom plugins (one
of which is
really generic and i'm hoping to get permission to commit
it back to solr
real soon now) is to make every possible query param have a
corrisponding
identically named init param (in the solr config) which it
uses as the
default. That way you can have...
<str name="highlightFields">title
description</str>
...in your solrconfig.xml, and clients that want differnet
behavior can
override it with...
highlightFields=title+description+body
...in the URL.
: I am game for developing the highlighting piece in some
way in the
: next few days, and would gladly contribute that feature
back provided
: it was done in a way that fits with Solr's architecture.
from a usage standpoint, i think adding both a URL param and
init param
to StandardRequestHandler that takes in a space seperated
list of
fieldNames to highlight makes a lot of sense ... the
question is what do
we do with it?
Modifing XMLWriter and SolrQueryResponse to have
"defaultHighlightFields"
in the same way they currently have
"defaultReturnFields" seems like it
makes the most sense, (especially since that way other
plugins can use it
as well). Then the XMLWriter can include a new
<hi>word</hi> in it's
output anytime it wants to highlight something.
(NOTE: Adding XML markup for highlighting probably means the
default
"Protocol Version" should be rev'ed to 2.2, and
highlighting should be
flat out disabled if the version is less then that so older
clients
aren't suddenly suprised to find xml markup in their
strings if the server
configuration cahnges)
-Hoss
|
|
| highlighting |

|
2006-04-17 12:46:10 |
I managed to hack some highlighting into a request handler
last night
for a quick and dirty application demo, but it is less than
ideal.
The current situation with XMLWriter actually pulling the
Document
from the index coupled with the lack of access to the Query
causes
this to currently be a tricky situation. My hack is just
within the
handleRequest method of the request handler and makes a
second pass
over the DocList and re-retrieves the Document objects to
highlight
them, and adds the highlighted text to additional XML
elements in the
response, not to the <doc>'s. So my current hack is
not worth
contributing.
Yonik additionally brought up some other very good points
regarding
term vectors and stored fields. Stored fields would be
necessary for
highlighting in the general sense, certainly, but I envision
some
applications wanting to store the original text elsewhere
and a
custom highlighting hook used to retrieve the original text
through
other means.
I'm not quite sure where to go with this highlighting issue
from here
given what seems to be a bit of an overhaul in where the
Document
objects are accessed, or in being able to get the full
context of the
Query (and filters, etc) down to the XMLWriter.
Thoughts?
Erik
On Apr 4, 2006, at 9:18 PM, Chris Hostetter wrote:
>
> For the record, i know next to nothing about
highlighting in
> Lucene. i
> can't remember if i read that chapter in LIA or not
>
> : curious if the standard handler and configuration
should evolve to
> : handle the common need for search term highlighting,
and if so how
>
> +1
>
> : would that ideally look in the configuration and
search request?
>
> one of the things i've been doing in my custom plugins
(one of
> which is
> really generic and i'm hoping to get permission to
commit it back
> to solr
> real soon now) is to make every possible query param
have a
> corrisponding
> identically named init param (in the solr config) which
it uses as the
> default. That way you can have...
> <str name="highlightFields">title
description</str>
> ...in your solrconfig.xml, and clients that want
differnet behavior
> can
> override it with...
> highlightFields=title+description+body
> ...in the URL.
>
> : I am game for developing the highlighting piece in
some way in the
> : next few days, and would gladly contribute that
feature back
> provided
> : it was done in a way that fits with Solr's
architecture.
>
> from a usage standpoint, i think adding both a URL
param and init
> param
> to StandardRequestHandler that takes in a space
seperated list of
> fieldNames to highlight makes a lot of sense ... the
question is
> what do
> we do with it?
>
> Modifing XMLWriter and SolrQueryResponse to have
> "defaultHighlightFields"
> in the same way they currently have
"defaultReturnFields" seems
> like it
> makes the most sense, (especially since that way other
plugins can
> use it
> as well). Then the XMLWriter can include a new
<hi>word</hi> in it's
> output anytime it wants to highlight something.
>
> (NOTE: Adding XML markup for highlighting probably
means the default
> "Protocol Version" should be rev'ed to
2.2, and highlighting should be
> flat out disabled if the version is less then that so
older clients
> aren't suddenly suprised to find xml markup in their
strings if the
> server
> configuration cahnges)
>
>
> -Hoss
|
|
| highlighting |

|
2006-04-18 14:29:26 |
On 4/17/06, Erik Hatcher <erik ehatchersolutions.com>
wrote:
> The current situation with XMLWriter actually pulling
the Document
> from the index
Yeah, but seeing people ask for *all* matching documents (or
sometimes
evel all documents in the index), makes me think that we
need to keep
streamability.
> coupled with the lack of access to the Query causes
> this to currently be a tricky situation.
> My hack is just within the
> handleRequest method of the request handler and makes a
second pass
> over the DocList and re-retrieves the Document objects
to highlight
> them,
There are a number of ways this could be handled, I think.
1) Preventing documents from being retrieved more than once:
a) may not be a big deal with the document cache enabled,
since they
should still be there
b) could create a subclass of DocList or another class
that contains
Document objects, not just the ids. XMLWriter would need to
be
changed to handle this type of class.
2) Access to the query for highlighting:
a) I don't think streamability of results is important
for
highlighting (I assume no one will ask for a million
documents and
have them all highlighted), so it could be done ahead of
time for all
the documents.
b) More context (or even user-specified context) could be
added to
the SolrRequest, and the Query(s) could go there.
c) If we had a custom DocList object from 1.b then it
could also
have a custom one for highlighting that carried this extra
info.
> and adds the highlighted text to additional XML
elements in the
> response, not to the <doc>'s. So my current
hack is not worth
> contributing.
I'm not even sure what the ideal highlighter syntax would
look like...
Do you have an example of what you would consider ideal?
Highlighting seems important and universal enough that I
wouldn't be
opposed to adding special syntax for it if it's reallly
needed. We
would want to make it flexible/powerful enough to handle
whatever Mark
Harwood is cooking up for future highlighting as well.
> Yonik additionally brought up some other very good
points regarding
> term vectors and stored fields. Stored fields would be
necessary for
> highlighting in the general sense, certainly, but I
envision some
> applications wanting to store the original text
elsewhere and a
> custom highlighting hook used to retrieve the original
text through
> other means.
Hmmm, some sort of callback interface for XMLWriter for
classes it
doesn't know about?
> I'm not quite sure where to go with this highlighting
issue from here
> given what seems to be a bit of an overhaul in where
the Document
> objects are accessed, or in being able to get the full
context of the
> Query (and filters, etc) down to the XMLWriter.
Ahh, just details... nothing that can't be fixed.
> Thoughts?
Focus on the interface:
- how clients will specify what extra info they want
- how clients typically parse and use the XML (extra bonus
if we can
make it semi-friendly to stylesheets/XSLT), and the ideal
syntax for
representing the extra info
Then it's just a small matter of implementing it
-Yonik
|
|
[1-6]
|
|