I've been using the Hadoop project off and on for the last
year in some
ongoing work studying Wikipedia. One of the tasks I
developed computes the
revision-to-revision diff across all edits in the Wikipedia
history. From
the time I first developed the job (last summer) to the
latest operation
(last week, running on the 0.13.0 release), I've seen a
pretty remarkable
increase in performance. Even though the the input size has
more than
doubled, the time to run the job on Hadoop has dropped by
half, for a
roughly 4x overall improvement in performance. Thanks
everyone!
--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp
|