|
List Info
Thread: solr instances for different content?
|
|
| solr instances for different content? |

|
2007-11-05 08:34:32 |
Typical newspaper site with: news, jobs, homes, autos,
classifieds,
community-generated content, guestimate of .5 million
documents
Do I really need to create a different solr index for each
vertical? How
ineffecient is it to add a few additional fields for each
content type?
Thinking of having a string field name "vertical"
that would be used to
segment by verticals above.
My intuition is that most of the additional fields would be
numbers:
integers, prices, decimals.
Thanks,
Tim
--
True innovation is not just about changing a product, a
service or even a
marketplace; its also about recognizing and relishing the
need to change
yourself.
|
|
| Re: solr instances for different
content? |

|
2007-11-05 08:39:23 |
500K is definitely doable with good hardware, but it also
depends on
what your queries look like, how many fields you are
faceting on,
etc... look at your performance now to try and judge how
much
headroom you have.
-Yonik
On 11/5/07, Tim Archambault <tim.j.archambault gmail.com> wrote:
> Typical newspaper site with: news, jobs, homes, autos,
classifieds,
> community-generated content, guestimate of .5 million
documents
>
> Do I really need to create a different solr index for
each vertical? How
> ineffecient is it to add a few additional fields for
each content type?
>
> Thinking of having a string field name
"vertical" that would be used to
> segment by verticals above.
>
> My intuition is that most of the additional fields
would be numbers:
> integers, prices, decimals.
>
> Thanks,
>
> Tim
|
|
| Re: solr instances for different
content? |
  United States |
2007-11-05 09:19:47 |
One reason to consider separate indexes is in terms of
relevance. Do
you want content from classifieds effecting the rankings of
your news
searches? May not be an issue for you depending on your
term
distributions, but might be something to consider. As you
suspect,
though, having multiple indexes will require more management
of the
various instances. Perhaps you can logically group things
to only
have a couple of indexes? For instance, maybe home, auto,
classifieds
are similar in content and structure and news and
community-generated
content are similar?
-Grant
On Nov 5, 2007, at 9:34 AM, Tim Archambault wrote:
> Typical newspaper site with: news, jobs, homes, autos,
classifieds,
> community-generated content, guestimate of .5 million
documents
>
> Do I really need to create a different solr index for
each vertical?
> How
> ineffecient is it to add a few additional fields for
each content
> type?
>
> Thinking of having a string field name
"vertical" that would be used
> to
> segment by verticals above.
>
> My intuition is that most of the additional fields
would be numbers:
> integers, prices, decimals.
>
> Thanks,
>
> Tim
>
> --
> True innovation is not just about changing a product, a
service or
> even a
> marketplace; its also about recognizing and relishing
the need to
> change
> yourself.
--------------------------
Grant Ingersoll
http://lucene.granti
ngersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://w
iki.apache.org/lucene-java/LuceneFAQ
|
|
| Re: solr instances for different
content? |

|
2007-11-05 09:27:29 |
Good points Grant. I'm envisioning my front end working so
that a user would
never be able to search across all the verticals at once.
EVERY query would inject "vertical:jobs" or
"vertical:news" or
"vertical:Autos", etc.. etc...
This may detrimentally affect my faceted results sets so
I'll have to think
about this more.
Wouldn't this approach overcome my relevancy and scoring
issues?
On 11/5/07, Grant Ingersoll <gsingers apache.org> wrote:
>
> One reason to consider separate indexes is in terms of
relevance. Do
> you want content from classifieds effecting the
rankings of your news
> searches? May not be an issue for you depending on
your term
> distributions, but might be something to consider.
As you suspect,
> though, having multiple indexes will require more
management of the
> various instances. Perhaps you can logically group
things to only
> have a couple of indexes? For instance, maybe home,
auto, classifieds
> are similar in content and structure and news and
community-generated
> content are similar?
>
> -Grant
>
> On Nov 5, 2007, at 9:34 AM, Tim Archambault wrote:
>
> > Typical newspaper site with: news, jobs, homes,
autos, classifieds,
> > community-generated content, guestimate of .5
million documents
> >
> > Do I really need to create a different solr index
for each vertical?
> > How
> > ineffecient is it to add a few additional fields
for each content
> > type?
> >
> > Thinking of having a string field name
"vertical" that would be used
> > to
> > segment by verticals above.
> >
> > My intuition is that most of the additional fields
would be numbers:
> > integers, prices, decimals.
> >
> > Thanks,
> >
> > Tim
> >
> > --
> > True innovation is not just about changing a
product, a service or
> > even a
> > marketplace; its also about recognizing and
relishing the need to
> > change
> > yourself.
>
> --------------------------
> Grant Ingersoll
> http://lucene.granti
ngersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://w
iki.apache.org/lucene-java/LuceneFAQ
>
>
>
--
True innovation is not just about changing a product, a
service or even a
marketplace; its also about recognizing and relishing the
need to change
yourself.
|
|
| Re: solr instances for different
content? |
  United States |
2007-11-05 11:00:34 |
I don't think that will solve the relevance issues, given
that the IDF
(described at ht
tp://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
javadoc/org/apache/lucene/search/Similarity.html)
is per document, not per field. In the end, though, it
may be
negligible. Can you test it out fairly quickly?
One other thing to think about with multiple indexes is
whether or not
keeping them separate affords you some extra flexibility at
the cost
of some more up front work? For instance, news is probably
updated
much more frequently than classifieds and so you may want to
tune it
for frequent updates and possibly even give it more
hardware, whereas
classifieds may not be as critical (or vice versa, I'm not
in the news
biz.) Naturally, the tradeoff is you need to develop tools
to manage
these various indexes, whereas the single index approach is
already
pretty well understood.
I would expect that as the mutlicore (https:
//issues.apache.org/jira/browse/SOLR-350
) patch evolves, it is going to bring in more management
tools for
working with various indexes (perhaps you can donate your
expertise if
you go this route?)
-Grant
On Nov 5, 2007, at 10:27 AM, Tim Archambault wrote:
> Good points Grant. I'm envisioning my front end working
so that a
> user would
> never be able to search across all the verticals at
once.
>
> EVERY query would inject "vertical:jobs" or
"vertical:news" or
> "vertical:Autos", etc.. etc...
>
> This may detrimentally affect my faceted results sets
so I'll have
> to think
> about this more.
>
> Wouldn't this approach overcome my relevancy and
scoring issues?
>
> On 11/5/07, Grant Ingersoll <gsingers apache.org> wrote:
>>
>> One reason to consider separate indexes is in terms
of relevance. Do
>> you want content from classifieds effecting the
rankings of your news
>> searches? May not be an issue for you depending on
your term
>> distributions, but might be something to consider.
As you suspect,
>> though, having multiple indexes will require more
management of the
>> various instances. Perhaps you can logically group
things to only
>> have a couple of indexes? For instance, maybe
home, auto,
>> classifieds
>> are similar in content and structure and news and
community-generated
>> content are similar?
>>
>> -Grant
>>
>> On Nov 5, 2007, at 9:34 AM, Tim Archambault wrote:
>>
>>> Typical newspaper site with: news, jobs, homes,
autos, classifieds,
>>> community-generated content, guestimate of .5
million documents
>>>
>>> Do I really need to create a different solr
index for each vertical?
>>> How
>>> ineffecient is it to add a few additional
fields for each content
>>> type?
>>>
>>> Thinking of having a string field name
"vertical" that would be used
>>> to
>>> segment by verticals above.
>>>
>>> My intuition is that most of the additional
fields would be numbers:
>>> integers, prices, decimals.
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>> --
>>> True innovation is not just about changing a
product, a service or
>>> even a
>>> marketplace; its also about recognizing and
relishing the need to
>>> change
>>> yourself.
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.granti
ngersoll.com
>>
>> Lucene Boot Camp Training:
>> ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://w
iki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>
>
> --
> True innovation is not just about changing a product, a
service or
> even a
> marketplace; its also about recognizing and relishing
the need to
> change
> yourself.
--------------------------
Grant Ingersoll
http://lucene.granti
ngersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://w
iki.apache.org/lucene-java/LuceneFAQ
|
|
| Re: solr instances for different
content? |

|
2007-11-06 00:20:58 |
: I don't think that will solve the relevance issues, given
that the IDF
: (described at
: ht
tp://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
javadoc/org/apache/lucene/search/Similarity.html)
: is per document, not per field. In the end, though, it
may be negligible.
well .. yes, but every "Term" has a specific
field, so if if each
differnet type of document used completely different fields
there would be
no "idf poisining" across types.
you would still have the sorting issue ... but:
a) given enough ram and few enough sort fields it's not a
big deal.
b) sorting fields may actually be generic enough that they
could be
common to more then one types (ie: "datePosted",
"price", etc...)
-Hoss
|
|
[1-6]
|
|