List Info

Thread: Created: (SOLR-236) Field collapsing




Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 09:10:59
On 6/11/07, Will Johnson <wjohnsongetconnected.com>
wrote:
> Having worked on a number of customer implementations
regarding this
> feature I can say that the number one requirement is
for the facet
> counts to be accurate post collapsing.  It all comes
down to the user
> experience.  For example, if I run a query that get
collapsed and has a
> facet count for the non-collapsed value then when I
click on that facet
> for refinement the number of hits in my subsequent
query will not match
> the number of hits displayed by that facet count.

I assumed they would... I think our signals might be crossed
w.r.t.
the meaning of pre or post collapsing.  Faceting "post
collapsing" I
took to mean that the base docset would be restricted to the
top "n"
of each category.

circuitcity does it how I would expect... field collapsing
does not
effect the facets on the left.
For example, if I search for memory, a facet tells me that
there are
70 under "Digital Cameras".  If I look down the
collapsed results,
"Digital Cameras" only shows the top match, but
has a link to "View
all 70 matches".

I don't know what bestbuy is doing, but when I search for
memory, I
get a brand facet with "Sony (244)"... if I click
that, it finds 95
items I can page through (but some facets still display
counts higher
than 95).

>  Ie if it says there
> are 10 docs in my result set of type x then when I
click on type x I
> expect to get back 10 hits.

Agree.

> Further, I could easily end up with a
> result set with 15 total hits but a facet count hat
says there are 200
> results of type x which is very disconcerting from a
user perspective.

15 documents displayed to the user, or 15 total documents
that matched
the query?
If the latter, I don't see how you could get greater than 15
for any
facet count.

-Yonik

RE: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 10:05:05
>I assumed they would... I think our signals might be
crossed w.r.t.
>the meaning of pre or post collapsing.  Faceting
"post collapsing" I
>took to mean that the base docset would be restricted to
the top "n"
>of each category.

In my view, faceting should occur on the full collapsed
result set.  Ie
break down 100 hits to 50 unique ones, then compute facets
on those 50
even though you may only return 10 to the user.

>circuitcity does it how I would expect... field
collapsing does not
>effect the facets on the left.
>For example, if I search for memory, a facet tells me
that there are
>70 under "Digital Cameras".  If I look down
the collapsed results,
>"Digital Cameras" only shows the top match,
but has a link to "View
>all 70 matches".

I agree, circuit city is a use case where you want
pre-faceting.  If you
think about site collapsing though I may se that there are
57 documents
in my result set of type x, then clicking on type x should
show me 57
docs.

>15 documents displayed to the user, or 15 total
documents that matched
>the query?
>If the latter, I don't see how you could get greater
than 15 for any
>facet count.

If I see that there are 15 of type x and click on it then
'total result
found' on the next page should say 15, not any higher.


-Yonik

RE: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 10:10:36
And one other point, one of the reasons why it's hard to
find an example
of post-faceting is that many of the major engines can't do
it. 

- will

-----Original Message-----
From: Will Johnson [mailto:wjohnsongetconnected.com] 
Sent: Monday, June 11, 2007 11:05 AM
To: solr-devlucene.apache.org
Subject: RE: [jira] Commented: (SOLR-236) Field collapsing

>I assumed they would... I think our signals might be
crossed w.r.t.
>the meaning of pre or post collapsing.  Faceting
"post collapsing" I
>took to mean that the base docset would be restricted to
the top "n"
>of each category.

In my view, faceting should occur on the full collapsed
result set.  Ie
break down 100 hits to 50 unique ones, then compute facets
on those 50
even though you may only return 10 to the user.

>circuitcity does it how I would expect... field
collapsing does not
>effect the facets on the left.
>For example, if I search for memory, a facet tells me
that there are
>70 under "Digital Cameras".  If I look down
the collapsed results,
>"Digital Cameras" only shows the top match,
but has a link to "View
>all 70 matches".

I agree, circuit city is a use case where you want
pre-faceting.  If you
think about site collapsing though I may se that there are
57 documents
in my result set of type x, then clicking on type x should
show me 57
docs.

>15 documents displayed to the user, or 15 total
documents that matched
>the query?
>If the latter, I don't see how you could get greater
than 15 for any
>facet count.

If I see that there are 15 of type x and click on it then
'total result
found' on the next page should say 15, not any higher.


-Yonik

Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 17:57:31
On 11-Jun-07, at 8:10 AM, Will Johnson wrote:

> And one other point, one of the reasons why it's hard
to find an  
> example
> of post-faceting is that many of the major engines
can't do it.

It seems that the only way to do it would be to collapse the
entire  
result set first, which entails loading the stored fields of
the  
whole docset.

That doesn't seem particularly feasible to do exactly.

-Mike

Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 18:26:02
: It seems that the only way to do it would be to collapse
the entire
: result set first, which entails loading the stored fields
of the
: whole docset.
:
: That doesn't seem particularly feasible to do exactly.

I haven't really been following this conversation that
closely, but
assuming what you guys are talking about is desirable, it
seems like one
way to accomplish it might be to make it operate on the
*indexed* values
for a field ... wouldn't iterating over each doc, and using
the
FieldCache+TermDocs make it very efficient to find all the
docs that have
the same indexed value as the current one?


-Hoss


Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 19:16:04
On 6/11/07, Chris Hostetter <hossman_lucenefucit.org> wrote:
>
> : It seems that the only way to do it would be to
collapse the entire
> : result set first, which entails loading the stored
fields of the
> : whole docset.
> :
> : That doesn't seem particularly feasible to do
exactly.
>
> I haven't really been following this conversation that
closely, but
> assuming what you guys are talking about is desirable,
it seems like one
> way to accomplish it might be to make it operate on the
*indexed* values
> for a field

Yes, the current JIRA patch uses the FieldCache.

>... wouldn't iterating over each doc, and using the
> FieldCache+TermDocs make it very efficient to find all
the docs that have
> the same indexed value as the current one?

The most efficient way will heavily depend on the nature of
the
collapse field (few terms or many).  I can't currently think
of a way
to do it efficiently for both.

-Yonik

Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-11 19:48:31
: Yes, the current JIRA patch uses the FieldCache.

I just ment in contrast with Mike's comment about iterating
over all the
stored fields to support the "post-faceting"
situation (but frankly i'm
not sure that i undersatnd what the
"post-faceting" situation is, so feel
free to ignore me)

: >... wouldn't iterating over each doc, and using the
: > FieldCache+TermDocs make it very efficient to find
all the docs that have
: > the same indexed value as the current one?
:
: The most efficient way will heavily depend on the nature
of the
: collapse field (few terms or many).  I can't currently
think of a way
: to do it efficiently for both.

this sounds a lot like the faceting problem (to term enum or
notto term
enum) and the discussion about building a "facet field
cache" at server
startup if we know faceting is important on certain fields
... by default
we can do our best, but with added configuration hints
telling us what you
expect, we can make more informed guesses.


-Hoss


Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-12 16:13:07
On 11-Jun-07, at 5:48 PM, Chris Hostetter wrote:

>
> : Yes, the current JIRA patch uses the FieldCache.
>
> I just ment in contrast with Mike's comment about
iterating over  
> all the
> stored fields to support the "post-faceting"
situation (but frankly  
> i'm
> not sure that i undersatnd what the
"post-faceting" situation is,  
> so feel
> free to ignore me)

I'm not sure either--I assume that it means facet on a
DocSet that is  
limited to the the representative doc in each collapsed
group.  Or is  
it faceting within each group?

If so, then all documents in the result set needs to be
collapsed to  
determine this list of docs (which perhaps is not too
inefficient?).   
The way I do field collapsing is simply gathering documents
and  
collapsing them until I've gathered X groups for user
display (which  
usually involves looking at a few tens of documents more,
rather than  
the entire 3,000,000+ result set).

I'm going to bow out now, as I don't think I understand what
exactly  
we're talking about <g>

-Mike

Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-12 16:36:50
On 6/12/07, Mike Klaas <mike.klaasgmail.com> wrote:
> The way I do field collapsing is simply gathering
documents and
> collapsing them until I've gathered X groups for user
display (which
> usually involves looking at a few tens of documents
more, rather than
> the entire 3,000,000+ result set).

Isn't this then dependent on the order of the documents in
the index?
Or it sounds like you don't "promote" lower
scoring documents into a
higher scoring group unless they both happen to be in the
top docs
requested?

-Yonik

Re: Commented: (SOLR-236) Field collapsing
user name
2007-06-12 16:53:13
On 12-Jun-07, at 2:36 PM, Yonik Seeley wrote:

> On 6/12/07, Mike Klaas <mike.klaasgmail.com> wrote:
>> The way I do field collapsing is simply gathering
documents and
>> collapsing them until I've gathered X groups for
user display (which
>> usually involves looking at a few tens of documents
more, rather than
>> the entire 3,000,000+ result set).
>
> Isn't this then dependent on the order of the documents
in the index?
> Or it sounds like you don't "promote" lower
scoring documents into a
> higher scoring group unless they both happen to be in
the top docs
> requested?

Precisely.  I don't care how many docs are in a group, just
avoiding  
displaying two documents in the same group.  That way you
can process  
the docs in score order for essentially zero cost.

-Mike

[1-10] [11-20] [21-30] [31-40] [41-50] [51-56]

about | contact  Other archives ( Real Estate discussion Medical topics )