Look at how nutch distributes search.
Each search engine has an independent index that contains a
fraction of the
documents. Each one does every search and the results are
merged. Each
engine has an index of 5-10 million documents so a web scale
index takes
100-1000 machines.
This is described in the Lucene book.
On 8/23/07 8:56 AM, "Samuel LEMOINE"
<samuel.lemoine lingway.com> wrote:
> Thanks so much, it helps me a lot. I'm actually quite
lost with Hadoop's
> mechanisms.
> The point of my study is to distribute the Lucene
searching phase with
> Hadoop...
> According to what I'v understood, a way to distribute
the search over a
> big Lucene's index would be to put this index on HDFS,
and to implement
> the Lucene search job under the Mapper interface, am I
right ?
> But I'm stuck because of Lucene searchable
architecture... the
> IndexReader takes the whole path where's located the
index as argument,
> I don't see how to distribute it...
> Well I guess this issue is quite different of the
original subject of
> this thread, maybe should I post a new message about
this issue.
>
>
> Arun C Murthy a écrit :
>> Samuel,
>>
>> Samuel LEMOINE wrote:
>>> Well, I don't get it... when you pass arguments
to a map job, you
>>> just give a key and a value, how can hadoop
make the link between
>>> those arguments and the data's concerned?
Really, your answer don't
>>> help me at all, sorry ^^
>>>
>>
>> The input of a map-reduce job is a file or a bunch
of files. These
>> files are usually stored on HDFS, which splits up a
logical file into
>> physical blocks of fixed size (configurable with
default size of
>> 128MB). Each block is replicated for reliability.
>>
>> The important point to note is that both the HDFS
and Map-Reduce
>> clusters run on the same hardware i.e. a combined
data and compute
>> cluster.
>>
>> Now when you launch a job (with lots of maps and
reduces) the inputs
>> file-sets are split into FileSplits (logical
splits, user can control
>> the splitting). Now the framework schedules as many
maps as there are
>> splits i.e. there is a one-to-one correspondence
between maps and
>> splits and each map processes one input split.
>>
>> The key idea is to try and *schedule* each map on
the _datanode_ (i.e.
>> one among the set of datanodes) which contains the
actual block for
>> the logical input-split that the map is supposed to
process. This is
>> what we refer to as 'data-locality. Hence we move
the computation (the
>> actual map) to the data (input split).
>>
>> This is feasible due to:
>> a) HDFS & Map-Reduce share the same physical
cluster.
>> b) HDFS exposes (via relevant apis) the underlying
block-locations
>> where a file is physically stored on the
file-system.
>>
>> hth,
>> Arun
>>
>>
>> Essentially what Hadoop's map-reduce tries to do is
to schedule *maps* on
>>> Devaraj Das a écrit :
>>>
>>>> That's the paradigm of Hadoop's
Map-Reduce.
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Samuel LEMOINE
[mailto:samuel.lemoine lingway.com] Sent:
>>>>> Thursday, August 23, 2007 2:48 PM
>>>>> To: hadoop-user lucene.apache.org
>>>>> Subject: "Moving Computation is
Cheaper than Moving Data"
>>>>>
>>>>> When I read the Hadoop documentation:
>>>>> The Hadoop Distributed File System:
Architecture and Design
>>>>> (http
://lucene.apache.org/hadoop/hdfs_design.html)
>>>>>
>>>>> a paragraph hold my attention:
>>>>>
>>>>>
>>>>> "Moving Computation is
Cheaper than Moving Data"
>>>>>
>>>>> A computation requested by an
application is much more efficient if
>>>>> it is executed near the data it
operates on. This is especially
>>>>> true when the size of the data set is
huge. This minimizes network
>>>>> congestion and increases the overall
throughput of the system. The
>>>>> assumption is that it is often better
to migrate the computation
>>>>> closer to where the data is located
rather than moving the data to
>>>>> where the application is running. HDFS
provides interfaces for
>>>>> applications to move themselves closer
to where the data is located.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I'd like to know how to perform that,
espacially with the aim of
>>>>> distributed Lucene search ? Which
Hadoop classes should I use to do
>>>>> that ?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Samuel
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
|