|
List Info
Thread: Color search
|
|
| Color search |

|
2007-09-28 03:00:26 |
Hi,
We're running an e-commerce site that provides product
search. We've been
able to extract colors from product images, and we think
it'd be cool and
useful to search products by color. A product image can have
up to 5 colors
(from a color space of about 100 colors), so we can
implement it easily with
Solr's facet search (thanks all who've developed Solr).
The problem arises when we try to sort the results by the
color relevancy.
What's different from a normal facet search is that colors
are weighted. For
example, a black dress can have 70% of black, 20% of gray,
10% of brown. A
search query "color:black" should return results
in which the black dress
ranks higher than other products with less percentage of
black.
My question is: how to configure and index the color field
so that products
with higher percentage of color X ranks higher for query
"color:X"?
Thanks for your help!
- Guangwei
|
|
| Re: Color search |

|
2007-09-28 08:23:44 |
If it were just a couple of colors, you could have a
separate field
for each color and then index the percent in that field.
black:70
grey:20
and then you could use a function query to influence the
score (or you
could sort by the color percent).
However, this doesn't scale well to a large index with a
large number of colors.
Each field used like that will take up 4 bytes per document
in the index.
so if you have 1M documents, that's 1Mdocs * 100colors *
4bytes = 400MB
Doable depending on your index size (use "int" or
"float" and not
"sint" or "sfloat" type for this... it
will be better on the memory).
If you needed to be better on the memory, you could encode
all of the
colors into a single value (perhaps into a compact string...
one
percentile per byte or something) and then have a custom
function that
extracts the value for a particular color. (this involves
some java
development)
-Yonik
On 9/28/07, Guangwei Yuan <guyuan gmail.com> wrote:
> Hi,
>
> We're running an e-commerce site that provides product
search. We've been
> able to extract colors from product images, and we
think it'd be cool and
> useful to search products by color. A product image can
have up to 5 colors
> (from a color space of about 100 colors), so we can
implement it easily with
> Solr's facet search (thanks all who've developed
Solr).
>
> The problem arises when we try to sort the results by
the color relevancy.
> What's different from a normal facet search is that
colors are weighted. For
> example, a black dress can have 70% of black, 20% of
gray, 10% of brown. A
> search query "color:black" should return
results in which the black dress
> ranks higher than other products with less percentage
of black.
>
> My question is: how to configure and index the color
field so that products
> with higher percentage of color X ranks higher for
query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
>
|
|
| Re: Color search |
  United States |
2007-09-28 08:31:58 |
Another option would be to extend Solr (and donate back) to
incorporate Lucene's payload functionality, in which case
you could
associate the percentile of the color as a payload and use
the
BoostingTermQuery... If you're
interested in this, a discussion
on solr-dev is probably warranted to figure out the best way
to do this.
-Grant
On Sep 28, 2007, at 9:23 AM, Yonik Seeley wrote:
> If it were just a couple of colors, you could have a
separate field
> for each color and then index the percent in that
field.
>
> black:70
> grey:20
>
> and then you could use a function query to influence
the score (or you
> could sort by the color percent).
>
> However, this doesn't scale well to a large index with
a large
> number of colors.
> Each field used like that will take up 4 bytes per
document in the
> index.
>
> so if you have 1M documents, that's 1Mdocs * 100colors
* 4bytes =
> 400MB
> Doable depending on your index size (use
"int" or "float" and not
> "sint" or "sfloat" type for this...
it will be better on the memory).
>
> If you needed to be better on the memory, you could
encode all of the
> colors into a single value (perhaps into a compact
string... one
> percentile per byte or something) and then have a
custom function that
> extracts the value for a particular color. (this
involves some java
> development)
>
> -Yonik
>
>
> On 9/28/07, Guangwei Yuan <guyuan gmail.com> wrote:
>> Hi,
>>
>> We're running an e-commerce site that provides
product search.
>> We've been
>> able to extract colors from product images, and we
think it'd be
>> cool and
>> useful to search products by color. A product image
can have up to
>> 5 colors
>> (from a color space of about 100 colors), so we can
implement it
>> easily with
>> Solr's facet search (thanks all who've developed
Solr).
>>
>> The problem arises when we try to sort the results
by the color
>> relevancy.
>> What's different from a normal facet search is that
colors are
>> weighted. For
>> example, a black dress can have 70% of black, 20%
of gray, 10% of
>> brown. A
>> search query "color:black" should return
results in which the
>> black dress
>> ranks higher than other products with less
percentage of black.
>>
>> My question is: how to configure and index the
color field so that
>> products
>> with higher percentage of color X ranks higher for
query "color:X"?
>>
>> Thanks for your help!
>>
>> - Guangwei
>>
--------------------------
Grant Ingersoll
http://lucene.granti
ngersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://w
iki.apache.org/lucene-java/LuceneFAQ
|
|
| Re: Color search |
  United States |
2007-09-28 09:13:56 |
Hi Guangwei,
When you index your products, you could have a single color
field, and
include duplicates of each color component proportional to
its weight.
For example, if you decide to use 10% increments, for your
black dress
with 70% of black, 20% of gray, 10% of brown, you would
index the
following terms for the color field:
black black black black black black black
gray gray
brown
This works because Lucene natively interprets document term
frequencies
as weights.
Steve
Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product
search. We've been
> able to extract colors from product images, and we
think it'd be cool and
> useful to search products by color. A product image can
have up to 5 colors
> (from a color space of about 100 colors), so we can
implement it easily with
> Solr's facet search (thanks all who've developed
Solr).
>
> The problem arises when we try to sort the results by
the color relevancy.
> What's different from a normal facet search is that
colors are weighted. For
> example, a black dress can have 70% of black, 20% of
gray, 10% of brown. A
> search query "color:black" should return
results in which the black dress
> ranks higher than other products with less percentage
of black.
>
> My question is: how to configure and index the color
field so that products
> with higher percentage of color X ranks higher for
query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
|
|
| Re: Color search |
  United States |
2007-09-28 13:23:45 |
This discussion is incredibly interesting to me! We solved
this
simply by indexing the color names, and faceting on that.
Not a very
elegant solution, to be sure - but it works. If people
search for a
"green running shoe" they get -green- running
shoes.
I would be very very interested in having a color picker
ajax app
which then went out and found the products with colors most
like the
one you chose.
+--------------------------------------------------------+
| Matthew Runo
| Zappos Development
| mruno zappos.com
| 702-943-7833
+--------------------------------------------------------+
On Sep 28, 2007, at 1:00 AM, Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product
search.
> We've been
> able to extract colors from product images, and we
think it'd be
> cool and
> useful to search products by color. A product image can
have up to
> 5 colors
> (from a color space of about 100 colors), so we can
implement it
> easily with
> Solr's facet search (thanks all who've developed
Solr).
>
> The problem arises when we try to sort the results by
the color
> relevancy.
> What's different from a normal facet search is that
colors are
> weighted. For
> example, a black dress can have 70% of black, 20% of
gray, 10% of
> brown. A
> search query "color:black" should return
results in which the black
> dress
> ranks higher than other products with less percentage
of black.
>
> My question is: how to configure and index the color
field so that
> products
> with higher percentage of color X ranks higher for
query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
|
|
| Re: Color search |

|
2007-09-28 14:57:42 |
: useful to search products by color. A product image can
have up to 5 colors
: (from a color space of about 100 colors), so we can
implement it easily with
: Solr's facet search (thanks all who've developed Solr).
:
: The problem arises when we try to sort the results by the
color relevancy.
: What's different from a normal facet search is that colors
are weighted. For
: example, a black dress can have 70% of black, 20% of gray,
10% of brown. A
if 5 is a hard max on the number of colors that you support,
then you can
always use 5 seperate fields to store the colors in order of
"dominance"
and then query on those 5 fields with varying boosts...
color_1:black^10 color_2:black^7 color_3:black^4
color_4:black color_5:black^0.1
...something like this will loose the % granularity info
that you have (so
a 60% black skirt and an 80% black dress would both score
the same against
black since it's hte dominant color)
alternately: i'm assuming your percentage data only has so
much confidence
-- maybe on the order of 10%?. you can have a seperate
field for each
"bucket" of color percentages and index the name
of hte color in the
corrisponding bucket. with 10% granularity that's only 10
fields -- a 10
clause boolean query for the color is no big deal ... even
going to 5%
would be trivial.
Incidently: people interested in teh general topic of color
faceting at
a finer granularity then just color names may want to check
out this
thread from last...
http://www.nabble.com/faceting-and-categ
orizing-on-color--tf1801106.html
-Hoss
|
|
| Re: Color search |
  Canada |
2007-09-28 16:27:06 |
On 28-Sep-07, at 6:31 AM, Grant Ingersoll wrote:
> Another option would be to extend Solr (and donate
back) to
> incorporate Lucene's payload functionality, in which
case you could
> associate the percentile of the color as a payload and
use the
> BoostingTermQuery... If you're
interested in this, a
> discussion on solr-dev is probably warranted to figure
out the best
> way to do this.
For reference, here is a summary of the changes needed:
1. A payload analyzer (here is an example that tokenizes
strings of
<token>:<whatever>:<score> into
<token> with payload <score>:
/** Returns the next token in the stream, or null at EOS.
*/
public final Token next() throws IOException {
Token t = input.next();
if (null == t)
return null;
String s = t.termText();
if(s.indexOf(":") > -1 ) {
String []parts = s.split(":");
assert parts.length == 3;
String colour = parts[0];
int bits =
Float.floatToIntBits(Float.parseFloat(parts[1]));
byte []buf = new byte[4];
for(int shift=0, i=0; shift < 32; shift += 8, i++)
{
buf[i] = (byte)( (bits>>shift) & 0xff );
}
Token gen = new Token(colour, t.startOffset(),
t.endOffset());
gen.setPayload(new Payload(buf));
t = gen;
}
return t;
}
2. A payload deserializer. Add this method to your custom
Similarity
class:
public float scorePayload(byte [] payload, int offset,
int length) {
assert length == 4;
int accum = ((payload[0+offset]&0xff)) |
((payload[1+offset]&0xff)<<8) |
((payload[2+offset]&0xff)<<16)
|
((payload[3+offset]&0xff)<<24);
return Float.intBitsToFloat(accum);
}
3. Add a relevant query clause. In a custom request
handler, you
could have a parameter to add BoostingTermQueries:
q= new BoostingTermQuery(new
Term("colourPayload", colour))
query.add(q, Occur.SHOULD);
How to add this generically is an interesting question.
There are
many possibilities, especially on the request handler and
tokenizer
side of things. If there is a consensus on a sensible way
of doing
this, I could contribute the bits of code that I have.
HTH,
-Mike
|
|
| Re: Color search |

|
2007-09-28 21:46:37 |
Thanks for all the replies. I think creating 10 fields and
feeding each
field with a color's value for 10% from that color is a
reasonable approach,
and easy to implement too. One problem though, is that not
all products have
a total of 100% colors (due to various reasons including our
color
extraction algorithm, etc.) So, for a product with 50% of
#000000, and 20%
of #999999, I'll have to fill the remaining three fields
with some dummy
values. Otherwise, Lucene seems to score it higher than
products that also
have 50% of #000000, but more than 20% of some other colors.
Since I also
need a way to exclude the dummy value when faceting, is
there a neater
solution?
I'll certainly look at the payload functionality, which is
new to me
- Guangwei
|
|
| Re: Color search |

|
2007-09-29 02:37:04 |
: extraction algorithm, etc.) So, for a product with 50% of
#000000, and 20%
: of #999999, I'll have to fill the remaining three fields
with some dummy
: values. Otherwise, Lucene seems to score it higher than
products that also
: have 50% of #000000, but more than 20% of some other
colors. Since I also
that doesn't really make sense to me ... your input is
colors to search
for, and you query each of those colors against every field
right? so if
i said i want grey and red dresses, you query for...
+(c0:grey c1:grey c2:grey c3:grey c4:grey
c5:grey c6:grey c7:grey c8:grey c9:grey)
+(c0:red c1:red c2:red c3:red c4:red
c5:red c6:red c7:red c8:red)
...right? a document that doesn't have any value in c6, c7
or c8
shouldn't score higher then any other documents ... if
anything it should
score lower because of the coord factor.
can you you explain exactly how you are indexing the data
and what your
query looks like?
-Hoss
|
|
| Re: Color search |

|
2007-09-29 03:22:21 |
>
> can you you explain exactly how you are indexing the
data and what your
> query looks like?
>
I used the same field name (color), not 10 different names
(c0 - c9).
So the index fields look like (50% #000000, 20% #999999):
color: #000000
color: #000000
color: #000000
color: #000000
color: #000000
color: #999999
color: #999999
The query for black dresses will be:
color:#000000
|
|
|
|