On 6/22/07, Dennis Kubes <kubes apache.org> wrote:
>
>
> Karol Rybak wrote:
> >>
> >>
> >> Karol Rybak wrote:
> >> > Hello, i have some questions about nutch
in general. I need to create a
> >> > simple web crawler, however we want to
index a lot of documents it'll
> >> > probably be about 100 million in future.
I have a couple of servers i
> >> can
> >>
> >> 100 million pages = 50-100 servers and 20-40T
of space distributed.
> >> Ideally the setup would be processing machines
and search servers. You
> >> would have say 50 or so processing machines
that would handle the
> >> crawling, indexing, mapreduce, and dfs. Then
you would have 50 more
> >> somewhat less powered (possibly) servers that
handle just serving the
> >> search. You can get away with having the
processing and search servers
> >> on the same machines but the search will slow
down considerably while
> >> running large jobs.
> >
> >
> >
> > Hello, thanks for your answer, 20-40T of space
seems large, the question is
> > do you store fetched files, or just indexes ? I
don't want to maintain
> > local
> > storage, i need only indexing...
> >
>
> You need space to stored the fetched documents
(segments). Even when
> compressed, 100M documents takes a lot of space. You
are going to have
> crawldb, linkdb, and indexes which effectively doubles
the amount of
> space you need. This will have to be on a DFS because
there is no
> single machine that can handle this load and because
raid at this level
> is prohibitively expensive. On the DFS you are going
to replicate your
> data blocks at a minimum 3 times for redundancy so you
just tripled your
> space.
>
> You will still need space on the machines for
processing the next jobs,
> unless you plan to delete all of the databases and
start from scratch
> every time which isn't advised. So for sorts and other
map reduce job
> processing you will want to leave approximately 30% of
the space open on
> each box. Depending on the jobs you are running you
may need more.
>
> If you are using the same boxes for search servers you
will then have to
> copy the indexes from the DFS to local which again
doubles the space
> needed. The estimate that we use is 100-200G for every
1M pages
> indexed. You probably can get away with 50G per 1M
pages but we have
> large computational jobs that are running and we don't
want to run out
> of space.
>
> A rough calculation would be ~4G compressed content per
1M pages fetched
> initially or 4K compressed per fetched page. So 4G * 2
for crawl, link,
> indexing = 8G * 3 for DFS replication = 36G * 1.3 for
processing space =
> 46.8G + 4G for local indexes = 50.8G.
>
> You said above that you don't want local storage.
Search has to be on
> local file systems. While you may technically be able
to pull a search
> result from the DFS you will almost certainly run out
of memory and the
> search will take an excessively long time (minutes, not
subsecond) if it
> returns. Search is a hardware intensive business in
part because of the
> number of servers that are needed to handle serving
large indexes.
Actually as long as indexes are on local machines, fetching
summaries
from DFS is not that slow(probably less than 5 seconds).
Obviously,
also storing them locally improves performance(to subsecond
levels).
>
> If anybody knows of a better way to setup a search
architecture than
> 2-4M pages per index per search server I would love to
hear about it.
> The former suggestions of space and architecture are
what we have
> experienced.
>
> Dennis Kubes
>
--
Doğacan Güney
|