List Info

Thread: Parsing a directory of 300,000 HTML files?




Parsing a directory of 300,000 HTML files?
country flaguser name
United States
2007-10-24 19:09:49
I have a corpus of 300,000 raw HTML files that I want to
read in and  
parse using Hadoop. What is the best input file format to
use in this  
case? I want to have access to each page's raw HTML in the
mapper, so  
I can parse from there.

I was thinking of preprocessing all the files, removing the
new  
lines, and putting them in a big <key, value> file:

url1, html with stripped new lines
url2, ....
url3, ....
...
urlN, ....

I'd rather not do all this preprocessing, just to wrangle
the text  
into Hadoop. Any other suggestions? What if I just stored
the path to  
the HTML file in a <key, value> type

url1, path_to_file1
url2, path_to_file2
...
urlN, path_to_fileN

Then in the mapper, I could read each file in from the DFS
on the  
fly. Anyone have any other good ideas? I feel like there's
some key  
function that I'm just stupidly overlooking...

Thanks!
David Balatero

Re: Parsing a directory of 300,000 HTML files?
country flaguser name
United States
2007-10-24 19:29:18

File open time is an issue if you have lots and lots of
little files.

If you are doing this analysis once or a few times, then it
isn't worth
reformatting into a few larger files.

If you are likely to do this analysis dozens of times, then
opening larger
files will probably give you a significant benefit in terms
of runtime.

If the runtime isn't terribly important, then the filename
per line approach
will work fine.

Note that the filename per line approach is a great way to
do the
pre-processing into a few large files which will then be
analyzed faster.

On 10/24/07 5:09 PM, "David Balatero"
<ezweltyu.washington.edu> wrote:

> I have a corpus of 300,000 raw HTML files that I want
to read in and
> parse using Hadoop. What is the best input file format
to use in this
> case? I want to have access to each page's raw HTML in
the mapper, so
> I can parse from there.
> 
> I was thinking of preprocessing all the files, removing
the new
> lines, and putting them in a big <key, value>
file:
> 
> url1, html with stripped new lines
> url2, ....
> url3, ....
> ...
> urlN, ....
> 
> I'd rather not do all this preprocessing, just to
wrangle the text
> into Hadoop. Any other suggestions? What if I just
stored the path to
> the HTML file in a <key, value> type
> 
> url1, path_to_file1
> url2, path_to_file2
> ...
> urlN, path_to_fileN
> 
> Then in the mapper, I could read each file in from the
DFS on the
> fly. Anyone have any other good ideas? I feel like
there's some key
> function that I'm just stupidly overlooking...
> 
> Thanks!
> David Balatero


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )