|
List Info
Thread: Apache Droids - standalone crawl framework
|
|
| Apache Droids - standalone crawl
framework |
  Spain |
2007-02-20 10:26:49 |
Hi all,
I have finished a first version of an Apache Labs project
called Apache
Droids and checked it in. All Apache committer have write
access there
so fell all free to enhance the code. Like said all
committer have write
access on droids and everybody is welcome to join the
effort.
Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT
to get
started.
What is this?
-------------
Droids aims to be an intelligent standalone crawl framework
that
automatically seeks out relevant online information based
on the user's
specifications. The core is a simple crawler which can be
easily extended by plugins. So if a project/app needs
special
processing for a crawled url one can easily write a plugin
to implement
the functionality.
Why was it created?
-------------------
Mainly because of personal curiosity:
The background of this work is that Cocoon trunk does not
provide a
crawler anymore and Forrest is based on it, meaning we
cannot update
anymore till we found a crawler replacement. Getting more
involved in
Solr and Nutch I see request for a generic standalone
crawler.
How does the first implementation crawler-x-m02y07 looks
like?
-----------------------------------------------------------
---
I took nutch, ripped out and modified the awesome
plugin/extension
framework to create the droid core.
Now I could implement all funtionality in plugins. Droids
should make
it very easy to extend it.
I wrote some proof of concept plugins that make up
crawler-x-m02y07 to
- crawl an url (CrawlerImpl)
- extract links (only <a/> ATM) via a parse-html
plugin
- merge them with the queue
- save or print out the crawled pages.
Why crawler-x-m02y07?
---------------------
Droids tries to be a framework for different droids.
The first implementation is a "crawler" with the
name "x"
first archived in the second "m"onth of the
"y"ear 20"07"
Next steps
----------
I still need to write a droids factory, that one can write
another implementation then Xm02y07 as crawler plugin and
invoke it via
the Cli. Another todo is to implement a dependency system
like Apache
Ivy instead to copycat the nutch approach.
Open questions for nutch
------------------------
Exists interest for a nutch crawler plugin to utilize native
nutch
plugins and imitate the nutch crawler in Droids? Is nutch
interested in
such a plugin? Does it makes sense?
Please test and report feedback to labs labs.apache.org. I will happily
answer all mails there.
salu2
--
Thorsten Scherler
thorsten.at.apache.org
Open Source Java & XML consulting,
training and solutions
|
|
| Re: Apache Droids - standalone crawl
framework |
  United States |
2007-02-20 15:10:26 |
Hi Thorsten,
I have quickly looked at the Droid code, and was wondering
why you don't
want to completely reuse the Nutch plugin API in Droid. This
way, you
could reuse the Nutch parse-* plugins without modifications.
Just trying
to understand...
Thanks,
Renaud
Thorsten Scherler wrote:
> Hi all,
>
> I have finished a first version of an Apache Labs
project called Apache
> Droids and checked it in. All Apache committer have
write access there
> so fell all free to enhance the code. Like said all
committer have write
> access on droids and everybody is welcome to join the
effort.
>
> Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT
to get
> started.
>
> What is this?
> -------------
> Droids aims to be an intelligent standalone crawl
framework that
> automatically seeks out relevant online information
based on the user's
> specifications. The core is a simple crawler which can
be
> easily extended by plugins. So if a project/app needs
special
> processing for a crawled url one can easily write a
plugin to implement
> the functionality.
>
> Why was it created?
> -------------------
> Mainly because of personal curiosity:
> The background of this work is that Cocoon trunk does
not provide a
> crawler anymore and Forrest is based on it, meaning we
cannot update
> anymore till we found a crawler replacement. Getting
more involved in
> Solr and Nutch I see request for a generic standalone
crawler.
>
> How does the first implementation crawler-x-m02y07
looks like?
>
------------------------------------------------------------
--
> I took nutch, ripped out and modified the awesome
plugin/extension
> framework to create the droid core.
> Now I could implement all funtionality in plugins.
Droids should make
> it very easy to extend it.
> I wrote some proof of concept plugins that make up
crawler-x-m02y07 to
> - crawl an url (CrawlerImpl)
> - extract links (only <a/> ATM) via a parse-html
plugin
> - merge them with the queue
> - save or print out the crawled pages.
>
> Why crawler-x-m02y07?
> ---------------------
> Droids tries to be a framework for different droids.
> The first implementation is a "crawler" with
the name "x"
> first archived in the second "m"onth of the
"y"ear 20"07"
>
> Next steps
> ----------
> I still need to write a droids factory, that one can
write
> another implementation then Xm02y07 as crawler plugin
and invoke it via
> the Cli. Another todo is to implement a dependency
system like Apache
> Ivy instead to copycat the nutch approach.
>
> Open questions for nutch
> ------------------------
> Exists interest for a nutch crawler plugin to utilize
native nutch
> plugins and imitate the nutch crawler in Droids? Is
nutch interested in
> such a plugin? Does it makes sense?
>
> Please test and report feedback to labs labs.apache.org. I will happily
> answer all mails there.
>
> salu2
>
--
Renaud Richardet +1 617
230 9112
my email is my first name at apache.org http://www.oslutions.com
|
|
| Re: Apache Droids - standalone crawl
framework |

|
2007-02-20 16:04:39 |
On 2/20/07, Renaud Richardet <renaud apache.org> wrote:
> Hi Thorsten,
>
> I have quickly looked at the Droid code, and was
wondering why you don't
> want to completely reuse the Nutch plugin API in Droid.
This way, you
> could reuse the Nutch parse-* plugins without
modifications. Just trying
> to understand...
hmm.. interesting .. I am not fully on board with Nutch. But
how would
the end output
of such crawl be.. as i.e ..
HTML file 1 crawl --> parse-html --> parsing rule
say.. "pick up <a
href=> tags" --> dump it to a predefined text
file (i.e. 1 "ahref
tag" per line or whatever based on template or
something) ... or?? cos
the current Nutch saves it in bin format.. so I am trying
to
understand here as well...
Regards
> > I have finished a first version of an Apache Labs
project called Apache
> > Droids and checked it in. All Apache committer
have write access there
> > so fell all free to enhance the code. Like said
all committer have write
> > access on droids and everybody is welcome to join
the effort.
> >
> > Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT
to get
> > started.
> >
> > What is this?
> > -------------
> > Droids aims to be an intelligent standalone crawl
framework that
> > automatically seeks out relevant online
information based on the user's
> > specifications. The core is a simple crawler
which can be
> > easily extended by plugins. So if a project/app
needs special
> > processing for a crawled url one can easily write
a plugin to implement
> > the functionality.
> >
> > Why was it created?
> > -------------------
> > Mainly because of personal curiosity:
> > The background of this work is that Cocoon trunk
does not provide a
> > crawler anymore and Forrest is based on it,
meaning we cannot update
> > anymore till we found a crawler replacement.
Getting more involved in
> > Solr and Nutch I see request for a generic
standalone crawler.
> >
> > How does the first implementation crawler-x-m02y07
looks like?
> >
------------------------------------------------------------
--
> > I took nutch, ripped out and modified the awesome
plugin/extension
> > framework to create the droid core.
> > Now I could implement all funtionality in
plugins. Droids should make
> > it very easy to extend it.
> > I wrote some proof of concept plugins that make
up crawler-x-m02y07 to
> > - crawl an url (CrawlerImpl)
> > - extract links (only <a/> ATM) via a
parse-html plugin
> > - merge them with the queue
> > - save or print out the crawled pages.
> >
> > Why crawler-x-m02y07?
> > ---------------------
> > Droids tries to be a framework for different
droids.
> > The first implementation is a "crawler"
with the name "x"
> > first archived in the second "m"onth of
the "y"ear 20"07"
> >
> > Next steps
> > ----------
> > I still need to write a droids factory, that one
can write
> > another implementation then Xm02y07 as crawler
plugin and invoke it via
> > the Cli. Another todo is to implement a dependency
system like Apache
> > Ivy instead to copycat the nutch approach.
> >
> > Open questions for nutch
> > ------------------------
> > Exists interest for a nutch crawler plugin to
utilize native nutch
> > plugins and imitate the nutch crawler in Droids?
Is nutch interested in
> > such a plugin? Does it makes sense?
> >
> > Please test and report feedback to labs labs.apache.org. I will happily
> > answer all mails there.
> >
> > salu2
> >
>
>
> --
> Renaud Richardet
+1 617 230 9112
> my email is my first name at apache.org http://www.oslutions.com
>
>
|
|
| Re: Apache Droids - standalone crawl
framework |
  United States |
2007-02-20 16:53:38 |
rubdabadub wrote:
> On 2/20/07, Renaud Richardet <renaud apache.org> wrote:
>> Hi Thorsten,
>>
>> I have quickly looked at the Droid code, and was
wondering why you don't
>> want to completely reuse the Nutch plugin API in
Droid. This way, you
>> could reuse the Nutch parse-* plugins without
modifications. Just trying
>> to understand...
>
>
> hmm.. interesting .. I am not fully on board with
Nutch. But how would
> the end output
> of such crawl be.. as i.e ..
>
> HTML file 1 crawl --> parse-html --> parsing rule
say.. "pick up <a
> href=> tags" --> dump it to a predefined
text file (i.e. 1 "ahref
> tag" per line or whatever based on template or
something) ... or?? cos
> the current Nutch saves it in bin format.. so I am
trying to
> understand here as well...
Errr, you're right: the parsers return an object of type
Parse, and no
file... Does anybody see a way to integrate this?
Thanks,
Renaud
|
|
[1-4]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|