List Info

Thread: Reviving Nutch 0.7




Reviving Nutch 0.7
user name
2007-01-22 00:47:38
Hi,

I've been meaning to write this message for a while, and
Andrzej's StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once
Hadoop stabilizes, it will be even more valuable than it is
today.  However, I think there is still a need for something
much simpler, something like what Nutch 0.7 used to be. 
Fairly regular nutch-user inquiries confirm this.  Nutch has
too few developers to maintain and further develop both of
these concepts, and the main Nutch developers need the more
powerful version - 0.8 and beyond.  So, what is going to
happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it
might be worth at least considering and discussing the
possibility of somehow branching that version into a
parallel project that's not just in a maintenance mode, but
has its own group of developers (not me, no time :( ) that
pushes it forward.

Thoughts?

Otis




Re: Reviving Nutch 0.7
user name
2007-01-22 02:05:22
Otis,
Some time ago people on the list said that they are willing
to at
least maintain Nutch 0.7 branch. As a committer (not very
active
recently) I volunteered to commit patches when they appear -
I do not
have enough time at the moment to do active coding. I have
created a
7.3 release in JIRA so we can start looking at it. So - we
are ready
and willing to move Nutch 0.7 forward but it looks like
there is no
interest at the moment.
Regards
Piotr

On 1/22/07, Otis Gospodnetic <otis_gospodneticyahoo.com> wrote:
> Hi,
>
> I've been meaning to write this message for a while,
and Andrzej's StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and
once Hadoop stabilizes, it will be even more valuable than
it is today.  However, I think there is still a need for
something much simpler, something like what Nutch 0.7 used
to be.  Fairly regular nutch-user inquiries confirm this. 
Nutch has too few developers to maintain and further develop
both of these concepts, and the main Nutch developers need
the more powerful version - 0.8 and beyond.  So, what is
going to happen to 0.7?  Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch
that it might be worth at least considering and discussing
the possibility of somehow branching that version into a
parallel project that's not just in a maintenance mode, but
has its own group of developers (not me, no time :( ) that
pushes it forward.
>
> Thoughts?
>
> Otis
>
>
>
>

Re: Reviving Nutch 0.7
user name
2007-01-22 03:13:11
On 1/22/07, Otis Gospodnetic <otis_gospodneticyahoo.com> wrote:
> Hi,
>
> I've been meaning to write this message for a while,
and Andrzej's StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and
once Hadoop stabilizes, it will be even more valuable than
it is today.  However, I think there is still a need for
something much simpler, something like what Nutch 0.7 used
to be.  Fairly regular nutch-user inquiries confirm this. 
Nutch has too few developers to maintain and further develop
both of these concepts, and the main Nutch developers need
the more powerful version - 0.8 and beyond.  So, what is
going to happen to 0.7?  Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch
that it might be worth at least considering and discussing
the possibility of somehow branching that version into a
parallel project that's not just in a maintenance mode, but
has its own group of developers (not me, no time :( ) that
pushes it forward.
>
> Thoughts?

I agree with you that there is a need for 0.7-style Nutch. I
wouldn't
say reviving but more "Disecting and re-directing"
. here
you go
--- my focus here is 0.7 style i.e. mid-size, enterprise
need.

Solr could use a good crawler cos it has everything else ..
(AFAIK)
probably this is not technically "plug an pray " also
I am not sure
Solr community wants a crawler but it could benefit from
such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7
plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there
any interest.

Cheers

RE: Reviving Nutch 0.7
user name
2007-01-22 04:37:59
Hello,

I'm writing this on behalf of both Armel Nene and myself. 

We think that you and those who have responded have a point.
 We've been
experiencing quite a number of problems with getting Nutch
0.8 adapted for
our needs, and making changes to support evolving business
requirements as
they come up.

So much so, that we've considered replacing the
"spine" of Nutch with our
own programs, which would still be compatible with the Nutch
plugins (same
parameters etc.), but that would allow us more ease in
making changes and
debug.  We've decided to lay out some of our challenges for
you to consider.
 
Our major needs are the ability to deploy on large
enterprise file systems
(1-10 Terabytes, large compared to average file systems, but
small compared
to the WWW).  We also need to support http, but only
specific web sites,
subscription web sites and so on.  We don't need to
replicate a
generic-Google implementation.

The main features we are currently working on relate
primarily to
near-real-time crawling, specifically:
- Incremental Crawling, where changes are monitored at the
folder level,
which is much faster than fetching every URL and checking
for a change.
Note that this is similar to adaptive crawling, but will be
even more
efficient.
- Special handling for parsing of large files (possibly
farming those out to
dedicated processors a-la Amazon).  Hadoop would be useful
here, but we
would consider re-adding this at a later stage.
- Incremental Indexing, where documents are added to or
removed from a live
index, instead of rebuilding a new index each time.

We would be happy to join a group of 0.7 developers, if that
would enable us
to pursue this enterprise-based direction, which clearly has
different
challenges than those facing WWW-crawling.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solut
ions.com

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodneticyahoo.com] 
Sent: 22 January 2007 06:48
To: Nutch Developer List
Subject: Reviving Nutch 0.7

Hi,

I've been meaning to write this message for a while, and
Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once
Hadoop
stabilizes, it will be even more valuable than it is today. 
However, I
think there is still a need for something much simpler,
something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries
confirm this.
Nutch has too few developers to maintain and further develop
both of these
concepts, and the main Nutch developers need the more
powerful version - 0.8
and beyond.  So, what is going to happen to 0.7? 
Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it
might be worth
at least considering and discussing the possibility of
somehow branching
that version into a parallel project that's not just in a
maintenance mode,
but has its own group of developers (not me, no time :( )
that pushes it
forward.

Thoughts?

Otis





Re: Reviving Nutch 0.7
user name
2007-01-22 09:24:15
2007/1/22, Otis Gospodnetic <otis_gospodneticyahoo.com>:
>
> Hi,
>
> I've been meaning to write this message for a while,
and Andrzej's
> StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and
once Hadoop
> stabilizes, it will be even more valuable than it is
today.  However, I
> think there is still a need for something much simpler,
something like what
> Nutch 0.7 used to be.  Fairly regular nutch-user
inquiries confirm
> this.  Nutch has too few developers to maintain and
further develop both of
> these concepts, and the main Nutch developers need the
more powerful version
> - 0.8 and beyond.  So, what is going to happen to 0.7? 
Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch
that it might be
> worth at least considering and discussing the
possibility of somehow
> branching that version into a parallel project that's
not just in a
> maintenance mode, but has its own group of developers
(not me, no time :( )
> that pushes it forward.
>
> Thoughts?
>
>
Before doubling (or after 0.9.0 tripling?) the
maintenance/development  work
please consider the following:

One option would be re factoring the code in a way that the
parts that are
usable to other projects like protocols?, parsers (this
actually was
proposed by
Jukka Zitting some time last year) and stuff would be
modified to be
independent
of nutch (and hadoop) code. Yeah, this is easy to say, but
would require
significant amount of work.

The "more focused",smaller chunks of nutch would
probably also get bigger
audience (perhaps also outside nutch land) and that way
perhaps more people
willing to work for them.

Don't know about others but at least I would be more willing
to work towards
this goal than the one where there would be practically many
separate
projects,
each sharing common functionality but different code base.

--
 Sami Siren
[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )