List Info

Thread: Is there PyNutch?




Is there PyNutch?
country flaguser name
United States
2007-02-14 11:27:39
Hello all,

The core of Nutch - Lucene has a Python port PyLucene. I
wonder
if there is a Python port for Nutch? We have some
distributed
Nutch searchers running. I'm thinking, if would be nice to
have the merger/frontend available to Python and take
advantage of
the powerful Python web frameworks.

-- 
Best regards,
Jack

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection
around 
http://mail.yahoo.com 
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re: Is there PyNutch?
user name
2007-02-14 11:34:34
On Feb 14, 2007, at 12:27 PM, Jack L wrote:
> The core of Nutch - Lucene has a Python port PyLucene.
I wonder
> if there is a Python port for Nutch? We have some
distributed
> Nutch searchers running. I'm thinking, if would be nice
to
> have the merger/frontend available to Python and take
advantage of
> the powerful Python web frameworks.


There is a Python frontend to Nutch built by Dennis Kubes:
http://wiki.apache.org/nutch/Automating_Fetches_with_
Python

And in our setup we mix Nutch's java parsers and crawlers
with our  
own homebuilt Python ones. We use Solr via a Python class to
inject  
data into the main nutch index. You have to be very careful
with  
index and segment merging but otherwise it works well.

I was initially using PyLucene for this task but I found
that Solr  
does a great job at abstracting the index files from the
application,  
and we can run multiple crawl processes on many machines all
feeding  
to the same Solr-led index. With PyLucene/Lucene you need to
worry  
about locks and the indexWriter/Reader.

For more on Nutch->Solr, see http://blog
.foofactory.fi/2007/02/online- 
indexing-integrating-nutch-with.html




_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re Is there PyNutch?
country flaguser name
United States
2007-02-14 13:06:34
Hello Brian,

Thanks for the reply. (I'm not sure if this discussion is
interesting
to PyLucene dev list. If it's considered OT, I shall take
the next
email offline.)

I looked at the first link you sent. It's not actually what
I'm
looking for. In our set up, we have multiple
crawler/indexer/searcher
boxes talking to one merger/web server front-end using Nutch
IPC.
The front-end box sends queries to multiple back-end
searchers and
merge the results it has received, and presents them in a
web page.
I'm hoping to find a way to replace the front-end Java
implementation
with Python. So, the piece I'm looking for does not touch
the
segments. Instead, it speaks Nutch IPC and parses the query
strings, issues queries to the back-end, and merges results
and puts
them in a web page.

Thanks for mentioning your experience with solr. I haven't
tried it
with large amount of data. My concern is, inserting using
HTTP POST
is much less efficient than local file access (the Nutch
approach.)
I'm not sure if it's able to handle millions of daily
submits.

-- 
Best regards,
Jack

Wednesday, February 14, 2007, 9:34:34 AM, you wrote:

> On Feb 14, 2007, at 12:27 PM, Jack L wrote:
>> The core of Nutch - Lucene has a Python port
PyLucene. I wonder
>> if there is a Python port for Nutch? We have some
distributed
>> Nutch searchers running. I'm thinking, if would be
nice to
>> have the merger/frontend available to Python and
take advantage of
>> the powerful Python web frameworks.


> There is a Python frontend to Nutch built by Dennis
Kubes:
> http://wiki.apache.org/nutch/Automating_Fetches_with_
Python

> And in our setup we mix Nutch's java parsers and
crawlers with our  
> own homebuilt Python ones. We use Solr via a Python
class to inject
> data into the main nutch index. You have to be very
careful with  
> index and segment merging but otherwise it works well.

> I was initially using PyLucene for this task but I
found that Solr  
> does a great job at abstracting the index files from
the application,
> and we can run multiple crawl processes on many
machines all feeding
> to the same Solr-led index. With PyLucene/Lucene you
need to worry  
> about locks and the indexWriter/Reader.

For more on Nutch->>Solr, see
For more on Nutch->>http://blog
.foofactory.fi/2007/02/online- 
> indexing-integrating-nutch-with.html




> _______________________________________________
> pylucene-dev mailing list
> pylucene-devosafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection
around 
http://mail.yahoo.com 
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Distributed Indexes, Pycon, was Re: Is there PyNutch?
country flaguser name
United States
2007-02-19 17:45:24
On Wednesday February 14 2007 1:06 pm, Jack L wrote:
> Hello Brian,
>
> Thanks for the reply. (I'm not sure if this discussion
is interesting
> to PyLucene dev list. If it's considered OT, I shall
take the next
> email offline.)

I consider this on-topic, largely cause I'm interested &
there's nowhere else 
to discuss it. ;)

> I looked at the first link you sent. It's not actually
what I'm
> looking for. In our set up, we have multiple
crawler/indexer/searcher
> boxes talking to one merger/web server front-end using
Nutch IPC.
> The front-end box sends queries to multiple back-end
searchers and
> merge the results it has received, and presents them in
a web page.
> I'm hoping to find a way to replace the front-end Java
implementation
> with Python. So, the piece I'm looking for does not
touch the
> segments. Instead, it speaks Nutch IPC and parses the
query
> strings, issues queries to the back-end, and merges
results and puts
> them in a web page.

I've been kicking around this sort of idea around with my
coworkers recently.  
While I haven't used Nutch/Solr, we've used techniques from
the later.

Some background:
We're a python shop [0].  In general, we're working with
relatively small data 
sets, but running rather complex queries  and pre-indexing
analysis.  We use 
an in-house spider and a Python webserver.  The front-end
runs on a local 
copy of a PyLucene index updated via  the Solr in-process
technique [1]

We're starting to push up against the capacity limits of
querying on a single 
server and are thinking about how to partition the index to
multiple boxes. In 
Java, this appears to be done using a MultiSearcher and
RemoteSearchable. 
The later is implemented on Java RMI [2], which is not in
PyLucene [3].

Here, the fun begins.  As best I can tell,
MultiSearcher/RemoteSearchable 
require multiple calls to slave machines per query. The
general thought would 
be to re-implement such a thing in Python, using something
like Perpsective 
Broker [4].

I don't really want to do this, however, as it just doesn't
sound like my idea 
of a good time.  I'm starting to formulate some thoughts for
alternate 
approaches, but haven't totally sorted it out.  So, the
question on 
everyone's mind is:

For all you folks using PyLucene for *queries*, how do you
scale beyond a 
single machine?

Anyone going to PyCon?  Want to have a Birds of a Feather on
Lucene/Text 
Search/Distributed Computing? [5]

--Pete

[0] Interested?  We're hiring.  This message only hints at
the sorts of 
problems we're trying to solve.  Contact me off-list.

[1] ht
tp://wiki.apache.org/solr/CollectionDistribution . This
was not the 
easiest thing in the world to get working acceptably with
PyLucene, though 
that appears to have more to do with Boehm GC.  It also
requires about 2x the 
RAM during the switchover period and beats on the disk.

[2] 
http://lucene.apache.org/java/
docs/api/org/apache/lucene/search/MultiSearcher.html
http://lucene.apache.org/ja
va/docs/api/org/apache/lucene/search/RemoteSearchable.html
http://java.sun.com/j2se/1.4.2/d
ocs/api/java/rmi/server/UnicastRemoteObject.html

[3] http://www.archivesat.com/pylucene_developers/thre
ad323504.htm

[4] http://twistedmatrix.com/projects/core/d
ocumentation/howto/pb-intro.html

[5] http://us.pycon.org/TX
2007/BoF
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re: Distributed Indexes, Pycon, was Re: Is there PyNutch?
country flaguser name
United States
2007-02-19 18:02:16
On Mon, 19 Feb 2007, Pete wrote:

> Here, the fun begins.  As best I can tell,
MultiSearcher/RemoteSearchable
> require multiple calls to slave machines per query. The
general thought would
> be to re-implement such a thing in Python, using
something like Perpsective
> Broker [4].

MultiSearcher and ParallelMultiSearcher are supported by
PyLucene.
RemoteSearcher is not (just thinking of RMI with gcj gives
me a headache).

Andi..
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re: Distributed Indexes, Pycon, was Re: Is there PyNutch?
country flaguser name
United States
2007-02-21 15:53:45
Hello Pete,

solr was suggested in earlier emails for a dedicated search
server.
From the feedback of the solr list, it actually works pretty
well.
And you can interface with it from Python code. But it is
not really
distributed. It does not currently span to multiple
servers.

While PyLucene covers local search very well, Python indeed
still lacks
a way to query distributed searchers and merge the results,
like what
Nutch does. From my (limited) understanding, PyLucene +
Nutch IPC will
do something similar to what Nutch backend does, and some
merger logic
+ IPC will be similar to the Nutch front-end. So the missing
pieces
are Python implementation of Nutch IPC and the merging
code?

-- 
Best regards,
Jack

> Here, the fun begins.  As best I can tell,
MultiSearcher/RemoteSearchable
> require multiple calls to slave machines per query. The
general thought would
> be to re-implement such a thing in Python, using
something like Perpsective
> Broker [4].

> I don't really want to do this, however, as it just
doesn't sound like my idea
> of a good time.  I'm starting to formulate some
thoughts for alternate
> approaches, but haven't totally sorted it out.  So, the
question on 
> everyone's mind is:

> For all you folks using PyLucene for *queries*, how do
you scale beyond a
> single machine?

> Anyone going to PyCon?  Want to have a Birds of a
Feather on Lucene/Text
> Search/Distributed Computing? [5]

> --Pete


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection
around
http://mail.yahoo.com
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )