I've found the solution.
quite simple actually, purely java related.
code:
byte [] data = Content.getContent();
String file = application.getRealPath("/") +
"file.dat";
FileOutputStream fileoutputstream = new
FileOutputStream(file);
for (int i = 0; i < data.length; i++)
{
fileoutputstream.write(data[i]);
}
fileoutputstream.close();
that solved the issue.
On 10/9/07, eyal edri <eyal.edri gmail.com> wrote:
>
> Can anyone help with this?
> Is there another IO java class i can use for saving the
byte array?
>
> Eyal.
>
> On 9/22/07, eyal edri < eyal.edri gmail.com> wrote:
> >
> > Am i catching this content byte array too late (in
the code)?
> >
> > is there a previous data field that holds the page
content before the
> > content byte array?
> >
> > thanks,
> >
> >
> > On 9/20/07, eyal edri <eyal.edri gmail.com
> wrote:
> > >
> > > Hi,
> > >
> > > I've made some progress with downloading
files (EXE/ZIP).
> > > I'm not using yet the plugin system, just
injected code to the "
> > > fetcher.java" meantime to test it.
> > > I've written the following code: (after this
line: Content content
> > > = output.getContent(); )
> > >
> > >
> > > // - save the file to fs
> > > // define regrex to capture domainname
& filename
> > > Pattern regex = Pattern.compile
("http://([^/]*).*/([^/]*)$");
> > > Matcher urlMatcher =
regex.matcher(content.getUrl());
> > >
> > > String domain = null;
> > > String fileLast = null;
> > > // get $1 &$2 backreference from
regrex
> > > while ( urlMatcher.find() ) {
> > > domain = urlMatcher.group(1);
> > > fileLast = urlMatcher.group(2);
> > > }
> > > LOG.info ("filename " +
fileLast);
> > > LOG.info ("domain " +
domain);
> > > File downloadDir = new
File("/home/eyale/nutch/DOWNLOADS/" +
> > > domain);
> > > // CHECK IF DIR EXITS
> > > if ( !downloadDir.exists() )
> > > downloadDir.mkdir();
> > > String filename = downloadDir +
"/" + fileLast;
> > >
> > > FileOutputStream out = new
FileOutputStream (new File
> > > (filename));
> > > ObjectOutputStream obj = new
ObjectOutputStream (out);
> > >
> > > // the content.getContent() returns a
byte array
> > > obj.write (content.getContent());
> > > obj.close();
> > >
> > > after downloading this file, i've found out
that it is slightly bigger
> > > than the original file
> > > (compare with file retrived from WGET).
> > > why is that? does this byte array contain
more information/data?
> > > how can i get the real file data only?
> > >
> > > thanks,
> > >
> > >
> > > On 9/11/07, Martin Kuen <martin.kuen gmail.com
> wrote:
> > > >
> > > > hi,
> > > >
> > > > I don't think that nutch can be
configured to store each downloaded
> > > > file as
> > > > a file (one file downloaded - one file
on your local disk).
> > > > The "byte array called
content" can be directly stored I think. I
> > > > think
> > > > that's worth giving it a try. The
fetcher uses (binary) streams to
> > > > handle
> > > > the downloaded content, so I think it
*should* be okay.
> > > >
> > > > Another approach (my two cents):
> > > > 1. Run the fetcher with the -noParse
option (most likely not even
> > > > necessary)
> > > > 2. check if the fetcher is advised to
store the content (there is a
> > > > property in nutch-default.xml)
> > > > 3. create a dump with the
"readseg" command and the "-dump"
option
> > > > 4. process the dump file and cut out
what is necessary
> > > >
> > > > Just interested if that could work . . .
however:
> > > > I had a look at the class implementing
the readseg command and found
> > > > that
> > > > the dump file is created with a
"PrintWriter". This will create
> > > > trouble I
> > > > think. Maybe you can modify the
SegmentReader (use an OutputStream).
> > > >
> > > > Regarding the fetcher - it's using a
binary stream to store the
> > > > content
> > > > (FSDataOutputStream).
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Martin
> > > >
> > > >
> > > > On 9/11/07, eyal edri < eyal.edri gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I've asked this question before on
a different mail list, with no
> > > > real
> > > > > response.
> > > > > I hope someone saw the need for
this actions and could help.
> > > > >
> > > > > I'm trying to config nutch to
download certain file types
> > > > (exe/zip) to the
> > > > > file system while crawling.
> > > > > I know nutch doesn't have a
parse-exe plugin, so i'll focus on the
> > > > ZIP
> > > > > (once
> > > > > i will understand the logic, i will
write a parse-exe plugin).
> > > > >
> > > > > I want to know if nutch supports
the downloading of files
> > > > inherently
> > > > > (using
> > > > > only conf files) or if not, how can
i alter the parse-zip plugin
> > > > in order
> > > > > to
> > > > > download the file.
> > > > > (i saw the parser gets a byte array
called "content", can i save
> > > > this to
> > > > > the
> > > > > fs ?).
> > > > >
> > > > > thanks,
> > > > >
> > > > >
> > > > > --
> > > > > Eyal Edri
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Eyal Edri
> >
> >
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri
--
Eyal Edri
|