File open time is an issue if you have lots and lots of
little files.
If you are doing this analysis once or a few times, then it
isn't worth
reformatting into a few larger files.
If you are likely to do this analysis dozens of times, then
opening larger
files will probably give you a significant benefit in terms
of runtime.
If the runtime isn't terribly important, then the filename
per line approach
will work fine.
Note that the filename per line approach is a great way to
do the
pre-processing into a few large files which will then be
analyzed faster.
On 10/24/07 5:09 PM, "David Balatero"
<ezwelty u.washington.edu> wrote:
> I have a corpus of 300,000 raw HTML files that I want
to read in and
> parse using Hadoop. What is the best input file format
to use in this
> case? I want to have access to each page's raw HTML in
the mapper, so
> I can parse from there.
>
> I was thinking of preprocessing all the files, removing
the new
> lines, and putting them in a big <key, value>
file:
>
> url1, html with stripped new lines
> url2, ....
> url3, ....
> ...
> urlN, ....
>
> I'd rather not do all this preprocessing, just to
wrangle the text
> into Hadoop. Any other suggestions? What if I just
stored the path to
> the HTML file in a <key, value> type
>
> url1, path_to_file1
> url2, path_to_file2
> ...
> urlN, path_to_fileN
>
> Then in the mapper, I could read each file in from the
DFS on the
> fly. Anyone have any other good ideas? I feel like
there's some key
> function that I'm just stupidly overlooking...
>
> Thanks!
> David Balatero
|