List Info

Thread: Re: "Moving Computation is Cheaper than Moving Data"




Re: "Moving Computation is Cheaper than Moving Data"
country flaguser name
United States
2007-08-23 13:07:37
Look at how nutch distributes search.

Each search engine has an independent index that contains a
fraction of the
documents.  Each one does every search and the results are
merged.  Each
engine has an index of 5-10 million documents so a web scale
index takes
100-1000 machines.

This is described in the Lucene book.


On 8/23/07 8:56 AM, "Samuel LEMOINE"
<samuel.lemoinelingway.com> wrote:

> Thanks so much, it helps me a lot. I'm actually quite
lost with Hadoop's
> mechanisms.
> The point of my study is to distribute the Lucene
searching phase with
> Hadoop...
> According to what I'v understood, a way to distribute
the search over a
> big Lucene's index would be to put this index on HDFS,
and to implement
> the Lucene search job under the Mapper interface, am I
right ?
> But I'm stuck because of Lucene searchable
architecture... the
> IndexReader takes the whole path where's located the
index as argument,
> I don't see how to distribute it...
> Well I guess this issue is quite different of the
original subject of
> this thread, maybe should I post a new message about
this issue.
> 
> 
> Arun C Murthy a écrit :
>> Samuel,
>> 
>> Samuel LEMOINE wrote:
>>> Well, I don't get it... when you pass arguments
to a map job, you
>>> just give a key and a value, how can hadoop
make the link between
>>> those arguments and the data's concerned?
Really, your answer don't
>>> help me at all, sorry ^^
>>> 
>> 
>> The input of a map-reduce job is a file or a bunch
of files. These
>> files are usually stored on HDFS, which splits up a
logical file into
>> physical blocks of fixed size (configurable with
default size of
>> 128MB). Each block is replicated for reliability.
>> 
>> The important point to note is that both the HDFS
and Map-Reduce
>> clusters run on the same hardware i.e. a combined
data and compute
>> cluster.
>> 
>> Now when you launch a job (with lots of maps and
reduces) the inputs
>> file-sets are split into FileSplits (logical
splits, user can control
>> the splitting). Now the framework schedules as many
maps as there are
>> splits i.e. there is a one-to-one correspondence
between maps and
>> splits and each map processes one input split.
>> 
>> The key idea is to try and *schedule* each map on
the _datanode_ (i.e.
>> one among the set of datanodes) which contains the
actual block for
>> the logical input-split that the map is supposed to
process. This is
>> what we refer to as 'data-locality. Hence we move
the computation (the
>> actual map) to the data (input split).
>> 
>> This is feasible due to:
>> a) HDFS & Map-Reduce share the same physical
cluster.
>> b) HDFS exposes (via relevant apis) the underlying
block-locations
>> where a file is physically stored on the
file-system.
>> 
>> hth,
>> Arun
>> 
>> 
>> Essentially what Hadoop's map-reduce tries to do is
to schedule *maps* on
>>> Devaraj Das a écrit :
>>> 
>>>> That's the paradigm of Hadoop's
Map-Reduce.
>>>>  
>>>> 
>>>>> -----Original Message-----
>>>>> From: Samuel LEMOINE
[mailto:samuel.lemoinelingway.com] Sent:
>>>>> Thursday, August 23, 2007 2:48 PM
>>>>> To: hadoop-userlucene.apache.org
>>>>> Subject: "Moving Computation is
Cheaper than Moving Data"
>>>>> 
>>>>> When I read the Hadoop documentation:
>>>>> The Hadoop Distributed File System:
Architecture and Design
>>>>> (http
://lucene.apache.org/hadoop/hdfs_design.html)
>>>>> 
>>>>> a paragraph hold my attention:
>>>>> 
>>>>> 
>>>>>       "Moving Computation is
Cheaper than Moving Data"
>>>>> 
>>>>> A computation requested by an
application is much more efficient if
>>>>> it is executed near the data it
operates on. This is especially
>>>>> true when the size of the data set is
huge. This minimizes network
>>>>> congestion and increases the overall
throughput of the system. The
>>>>> assumption is that it is often better
to migrate the computation
>>>>> closer to where the data is located
rather than moving the data to
>>>>> where the application is running. HDFS
provides interfaces for
>>>>> applications to move themselves closer
to where the data is located.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to know how to perform that,
espacially with the aim of
>>>>> distributed Lucene search ? Which
Hadoop classes should I use to do
>>>>> that ?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> Samuel
>>>>> 
>>>>>     
>>>> 
>>>> 
>>>> 
>>>>   
>>> 
>>> 
>> 
>> 
> 


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )