|
List Info
Thread: Is there PyNutch?
|
|
| Is there PyNutch? |
  United States |
2007-02-14 11:27:39 |
Hello all,
The core of Nutch - Lucene has a Python port PyLucene. I
wonder
if there is a Python port for Nutch? We have some
distributed
Nutch searchers running. I'm thinking, if would be nice to
have the merger/frontend available to Python and take
advantage of
the powerful Python web frameworks.
--
Best regards,
Jack
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection
around
http://mail.yahoo.com
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| Re: Is there PyNutch? |

|
2007-02-14 11:34:34 |
On Feb 14, 2007, at 12:27 PM, Jack L wrote:
> The core of Nutch - Lucene has a Python port PyLucene.
I wonder
> if there is a Python port for Nutch? We have some
distributed
> Nutch searchers running. I'm thinking, if would be nice
to
> have the merger/frontend available to Python and take
advantage of
> the powerful Python web frameworks.
There is a Python frontend to Nutch built by Dennis Kubes:
http://wiki.apache.org/nutch/Automating_Fetches_with_
Python
And in our setup we mix Nutch's java parsers and crawlers
with our
own homebuilt Python ones. We use Solr via a Python class to
inject
data into the main nutch index. You have to be very careful
with
index and segment merging but otherwise it works well.
I was initially using PyLucene for this task but I found
that Solr
does a great job at abstracting the index files from the
application,
and we can run multiple crawl processes on many machines all
feeding
to the same Solr-led index. With PyLucene/Lucene you need to
worry
about locks and the indexWriter/Reader.
For more on Nutch->Solr, see http://blog
.foofactory.fi/2007/02/online-
indexing-integrating-nutch-with.html
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| Re Is there PyNutch? |
  United States |
2007-02-14 13:06:34 |
Hello Brian,
Thanks for the reply. (I'm not sure if this discussion is
interesting
to PyLucene dev list. If it's considered OT, I shall take
the next
email offline.)
I looked at the first link you sent. It's not actually what
I'm
looking for. In our set up, we have multiple
crawler/indexer/searcher
boxes talking to one merger/web server front-end using Nutch
IPC.
The front-end box sends queries to multiple back-end
searchers and
merge the results it has received, and presents them in a
web page.
I'm hoping to find a way to replace the front-end Java
implementation
with Python. So, the piece I'm looking for does not touch
the
segments. Instead, it speaks Nutch IPC and parses the query
strings, issues queries to the back-end, and merges results
and puts
them in a web page.
Thanks for mentioning your experience with solr. I haven't
tried it
with large amount of data. My concern is, inserting using
HTTP POST
is much less efficient than local file access (the Nutch
approach.)
I'm not sure if it's able to handle millions of daily
submits.
--
Best regards,
Jack
Wednesday, February 14, 2007, 9:34:34 AM, you wrote:
> On Feb 14, 2007, at 12:27 PM, Jack L wrote:
>> The core of Nutch - Lucene has a Python port
PyLucene. I wonder
>> if there is a Python port for Nutch? We have some
distributed
>> Nutch searchers running. I'm thinking, if would be
nice to
>> have the merger/frontend available to Python and
take advantage of
>> the powerful Python web frameworks.
> There is a Python frontend to Nutch built by Dennis
Kubes:
> http://wiki.apache.org/nutch/Automating_Fetches_with_
Python
> And in our setup we mix Nutch's java parsers and
crawlers with our
> own homebuilt Python ones. We use Solr via a Python
class to inject
> data into the main nutch index. You have to be very
careful with
> index and segment merging but otherwise it works well.
> I was initially using PyLucene for this task but I
found that Solr
> does a great job at abstracting the index files from
the application,
> and we can run multiple crawl processes on many
machines all feeding
> to the same Solr-led index. With PyLucene/Lucene you
need to worry
> about locks and the indexWriter/Reader.
For more on Nutch->>Solr, see
For more on Nutch->>http://blog
.foofactory.fi/2007/02/online-
> indexing-integrating-nutch-with.html
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection
around
http://mail.yahoo.com
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| Distributed Indexes, Pycon, was Re: Is
there PyNutch? |
  United States |
2007-02-19 17:45:24 |
On Wednesday February 14 2007 1:06 pm, Jack L wrote:
> Hello Brian,
>
> Thanks for the reply. (I'm not sure if this discussion
is interesting
> to PyLucene dev list. If it's considered OT, I shall
take the next
> email offline.)
I consider this on-topic, largely cause I'm interested &
there's nowhere else
to discuss it. ;)
> I looked at the first link you sent. It's not actually
what I'm
> looking for. In our set up, we have multiple
crawler/indexer/searcher
> boxes talking to one merger/web server front-end using
Nutch IPC.
> The front-end box sends queries to multiple back-end
searchers and
> merge the results it has received, and presents them in
a web page.
> I'm hoping to find a way to replace the front-end Java
implementation
> with Python. So, the piece I'm looking for does not
touch the
> segments. Instead, it speaks Nutch IPC and parses the
query
> strings, issues queries to the back-end, and merges
results and puts
> them in a web page.
I've been kicking around this sort of idea around with my
coworkers recently.
While I haven't used Nutch/Solr, we've used techniques from
the later.
Some background:
We're a python shop [0]. In general, we're working with
relatively small data
sets, but running rather complex queries and pre-indexing
analysis. We use
an in-house spider and a Python webserver. The front-end
runs on a local
copy of a PyLucene index updated via the Solr in-process
technique [1]
We're starting to push up against the capacity limits of
querying on a single
server and are thinking about how to partition the index to
multiple boxes. In
Java, this appears to be done using a MultiSearcher and
RemoteSearchable.
The later is implemented on Java RMI [2], which is not in
PyLucene [3].
Here, the fun begins. As best I can tell,
MultiSearcher/RemoteSearchable
require multiple calls to slave machines per query. The
general thought would
be to re-implement such a thing in Python, using something
like Perpsective
Broker [4].
I don't really want to do this, however, as it just doesn't
sound like my idea
of a good time. I'm starting to formulate some thoughts for
alternate
approaches, but haven't totally sorted it out. So, the
question on
everyone's mind is:
For all you folks using PyLucene for *queries*, how do you
scale beyond a
single machine?
Anyone going to PyCon? Want to have a Birds of a Feather on
Lucene/Text
Search/Distributed Computing? [5]
--Pete
[0] Interested? We're hiring. This message only hints at
the sorts of
problems we're trying to solve. Contact me off-list.
[1] ht
tp://wiki.apache.org/solr/CollectionDistribution . This
was not the
easiest thing in the world to get working acceptably with
PyLucene, though
that appears to have more to do with Boehm GC. It also
requires about 2x the
RAM during the switchover period and beats on the disk.
[2]
http://lucene.apache.org/java/
docs/api/org/apache/lucene/search/MultiSearcher.html
http://lucene.apache.org/ja
va/docs/api/org/apache/lucene/search/RemoteSearchable.html
a>
http://java.sun.com/j2se/1.4.2/d
ocs/api/java/rmi/server/UnicastRemoteObject.html
[3] http://www.archivesat.com/pylucene_developers/thre
ad323504.htm
[4] http://twistedmatrix.com/projects/core/d
ocumentation/howto/pb-intro.html
[5] http://us.pycon.org/TX
2007/BoF
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| Re: Distributed Indexes, Pycon, was Re:
Is there PyNutch? |
  United States |
2007-02-19 18:02:16 |
On Mon, 19 Feb 2007, Pete wrote:
> Here, the fun begins. As best I can tell,
MultiSearcher/RemoteSearchable
> require multiple calls to slave machines per query. The
general thought would
> be to re-implement such a thing in Python, using
something like Perpsective
> Broker [4].
MultiSearcher and ParallelMultiSearcher are supported by
PyLucene.
RemoteSearcher is not (just thinking of RMI with gcj gives
me a headache).
Andi..
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| Re: Distributed Indexes, Pycon, was Re:
Is there PyNutch? |
  United States |
2007-02-21 15:53:45 |
Hello Pete,
solr was suggested in earlier emails for a dedicated search
server.
From the feedback of the solr list, it actually works pretty
well.
And you can interface with it from Python code. But it is
not really
distributed. It does not currently span to multiple
servers.
While PyLucene covers local search very well, Python indeed
still lacks
a way to query distributed searchers and merge the results,
like what
Nutch does. From my (limited) understanding, PyLucene +
Nutch IPC will
do something similar to what Nutch backend does, and some
merger logic
+ IPC will be similar to the Nutch front-end. So the missing
pieces
are Python implementation of Nutch IPC and the merging
code?
--
Best regards,
Jack
> Here, the fun begins. As best I can tell,
MultiSearcher/RemoteSearchable
> require multiple calls to slave machines per query. The
general thought would
> be to re-implement such a thing in Python, using
something like Perpsective
> Broker [4].
> I don't really want to do this, however, as it just
doesn't sound like my idea
> of a good time. I'm starting to formulate some
thoughts for alternate
> approaches, but haven't totally sorted it out. So, the
question on
> everyone's mind is:
> For all you folks using PyLucene for *queries*, how do
you scale beyond a
> single machine?
> Anyone going to PyCon? Want to have a Birds of a
Feather on Lucene/Text
> Search/Distributed Computing? [5]
> --Pete
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection
around
http://mail.yahoo.com
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
[1-6]
|
|