List Info

Thread: Re: Distributed index




Re: Distributed index
country flaguser name
United States
2007-06-22 08:36:37

Karol Rybak wrote:
>>
>>
>> Karol Rybak wrote:
>> > Hello, i have some questions about nutch in
general. I need to create a
>> > simple web crawler, however we want to index a
lot of documents it'll
>> > probably be about 100 million in future. I
have a couple of servers i
>> can
>>
>> 100 million pages = 50-100 servers and 20-40T of
space distributed.
>> Ideally the setup would be processing machines and
search servers.  You
>> would have say 50 or so processing machines that
would handle the
>> crawling, indexing, mapreduce, and dfs.  Then you
would have 50 more
>> somewhat less powered (possibly) servers that
handle just serving the
>> search.  You can get away with having the
processing and search servers
>> on the same machines but the search will slow down
considerably while
>> running large jobs.
> 
> 
> 
> Hello, thanks for your answer, 20-40T of space seems
large, the question is
> do you store fetched files, or just indexes ? I don't
want to maintain 
> local
> storage, i need only indexing...
> 

You need space to stored the fetched documents (segments). 
Even when 
compressed, 100M documents takes a lot of space.  You are
going to have 
crawldb, linkdb, and indexes which effectively doubles the
amount of 
space you need.  This will have to be on a DFS because there
is no 
single machine that can handle this load and because raid at
this level 
is prohibitively expensive.  On the DFS you are going to
replicate your 
data blocks at a minimum 3 times for redundancy so you just
tripled your 
space.

You will still need space on the machines for processing the
next jobs, 
unless you plan to delete all of the databases and start
from scratch 
every time which isn't advised.  So for sorts and other map
reduce job 
processing you will want to leave approximately 30% of the
space open on 
each box.  Depending on the jobs you are running you may
need more.

If you are using the same boxes for search servers you will
then have to 
copy the indexes from the DFS to local which again doubles
the space 
needed.  The estimate that we use is 100-200G for every 1M
pages 
indexed.  You probably can get away with 50G per 1M pages
but we have 
large computational jobs that are running and we don't want
to run out 
of space.

A rough calculation would be ~4G compressed content per 1M
pages fetched 
initially or 4K compressed per fetched page. So 4G * 2 for
crawl, link, 
indexing = 8G * 3 for DFS replication = 36G * 1.3 for
processing space = 
46.8G + 4G for local indexes = 50.8G.

You said above that you don't want local storage.  Search
has to be on 
local file systems.  While you may technically be able to
pull a search 
result from the DFS you will almost certainly run out of
memory and the 
search will take an excessively long time (minutes, not
subsecond) if it 
returns.  Search is a hardware intensive business in part
because of the 
number of servers that are needed to handle serving large
indexes.

If anybody knows of a better way to setup a search
architecture than 
2-4M pages per index per search server I would love to hear
about it. 
The former suggestions of space and architecture are what we
have 
experienced.

Dennis Kubes

Re: Distributed index
user name
2007-06-22 08:46:52
On 6/22/07, Dennis Kubes <kubesapache.org> wrote:
>
>
> Karol Rybak wrote:
> >>
> >>
> >> Karol Rybak wrote:
> >> > Hello, i have some questions about nutch
in general. I need to create a
> >> > simple web crawler, however we want to
index a lot of documents it'll
> >> > probably be about 100 million in future.
I have a couple of servers i
> >> can
> >>
> >> 100 million pages = 50-100 servers and 20-40T
of space distributed.
> >> Ideally the setup would be processing machines
and search servers.  You
> >> would have say 50 or so processing machines
that would handle the
> >> crawling, indexing, mapreduce, and dfs.  Then
you would have 50 more
> >> somewhat less powered (possibly) servers that
handle just serving the
> >> search.  You can get away with having the
processing and search servers
> >> on the same machines but the search will slow
down considerably while
> >> running large jobs.
> >
> >
> >
> > Hello, thanks for your answer, 20-40T of space
seems large, the question is
> > do you store fetched files, or just indexes ? I
don't want to maintain
> > local
> > storage, i need only indexing...
> >
>
> You need space to stored the fetched documents
(segments).  Even when
> compressed, 100M documents takes a lot of space.  You
are going to have
> crawldb, linkdb, and indexes which effectively doubles
the amount of
> space you need.  This will have to be on a DFS because
there is no
> single machine that can handle this load and because
raid at this level
> is prohibitively expensive.  On the DFS you are going
to replicate your
> data blocks at a minimum 3 times for redundancy so you
just tripled your
> space.
>
> You will still need space on the machines for
processing the next jobs,
> unless you plan to delete all of the databases and
start from scratch
> every time which isn't advised.  So for sorts and other
map reduce job
> processing you will want to leave approximately 30% of
the space open on
> each box.  Depending on the jobs you are running you
may need more.
>
> If you are using the same boxes for search servers you
will then have to
> copy the indexes from the DFS to local which again
doubles the space
> needed.  The estimate that we use is 100-200G for every
1M pages
> indexed.  You probably can get away with 50G per 1M
pages but we have
> large computational jobs that are running and we don't
want to run out
> of space.
>
> A rough calculation would be ~4G compressed content per
1M pages fetched
> initially or 4K compressed per fetched page. So 4G * 2
for crawl, link,
> indexing = 8G * 3 for DFS replication = 36G * 1.3 for
processing space =
> 46.8G + 4G for local indexes = 50.8G.
>
> You said above that you don't want local storage. 
Search has to be on
> local file systems.  While you may technically be able
to pull a search
> result from the DFS you will almost certainly run out
of memory and the
> search will take an excessively long time (minutes, not
subsecond) if it
> returns.  Search is a hardware intensive business in
part because of the
> number of servers that are needed to handle serving
large indexes.

Actually as long as indexes are on local machines, fetching
summaries
from DFS is not that slow(probably less than 5 seconds).
Obviously,
also storing them locally improves performance(to subsecond
levels).

>
> If anybody knows of a better way to setup a search
architecture than
> 2-4M pages per index per search server I would love to
hear about it.
> The former suggestions of space and architecture are
what we have
> experienced.
>
> Dennis Kubes
>


-- 
Doğacan Güney
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )