|
List Info
Thread: site text search
|
|
| site text search |

|
2006-02-08 21:58:29 |
On 2/8/06, Dirk Koopman <djk tobit.co.uk> wrote:
> What do people use these days for providing local text
searches on
> smallish websites?
Plucene can do small-ish. About 20x slower than Lucene, but
hey...
There's even a few docs kicking about http://plucene.minty.org/
a>
Haven't really used the stuff below, so cannot comment, but
if you are
looking for things to play with ...
http://www.rec
tangular.com/kinosearch/
http://www.xapian.org/
http://meta.wikimedia.org/wiki/FulltextSearchEngines
I think, but can't be sure, Wikipedia ended up with
pylucene, in part
due to being able to compile with gcj.
Personally, I reckon fulltext indexing in mysql could be the
way to go
: anyone got any stats on index sizes and effect on query
speed
performance?
|
|
| site text search |

|
2006-02-09 09:45:21 |
On Feb 08, 2006, at 22:58, Minty wrote:
> http://www.xapian.org/
I haven't used it myself yet but I've heard many strong
recommendations on Xapian, and its Perl binding
Search::Xapian. It's
been on pile of things to test for a while.
--
Robin Berjon
Senior Research Scientist
Expway, http://expway.com/
|
|
| site text search |

|
2006-02-09 10:30:16 |
On Thu, 2006-02-09 at 10:45 +0100, Robin Berjon wrote:
> On Feb 08, 2006, at 22:58, Minty wrote:
> > http://www.xapian.org/
>
> I haven't used it myself yet but I've heard many strong
> recommendations on Xapian, and its Perl binding
Search::Xapian. It's
> been on pile of things to test for a while.
>
I am, just for now, using http://swish-e.org One of the
reasons for this
is that I have used it (long ago) in the past and also I
find it (much)
faster and much easier to have several instances of than
htdig. The fact
that is does not use a whole 512Mb (+ all the swap and then
core dump
and fail) on a website of about 30MB is also a point in its
favour. I
have used it on much larger ones.
However, I see that xapian is very similar and may be a
better search
engine as such, it is certainly more modern. So I shall have
a look at
that in the next few days.
Java, however good, is not an option I am afraid.
Dirk
|
|
| site text search |

|
2006-02-09 10:13:31 |
On Wed, Feb 08, 2006 at 09:58:29PM +0000, Minty wrote:
> http://www.xapian.org/
We use Locayta, which is the commercial version of this, for
our search
stuff. It works reasonably well, and is quite featureful,
but you do need
to do some translation to get things into a format that it
will deal with.
It's also slightly weird getting your head round its input
and output
formats.
I believe www.cam.ac.uk used to (and may still) use xapian
as its main site
search, among several other reasonably high profile sites.
--
Lusercop.net - LARTing Lusers everywhere since 2002
|
|
| site text search |

|
2006-02-09 11:12:52 |
Dirk Koopman writes:
> I am, just for now, using http://swish-e.org One of the
reasons for this
> is that I have used it (long ago) in the past and also
I find it (much)
> faster and much easier to have several instances of
than htdig. The fact
> that is does not use a whole 512Mb (+ all the swap and
then core dump
> and fail) on a website of about 30MB is also a point in
its favour. I
> have used it on much larger ones.
I have some familiarity with Swish-e, and, yes, it seems to
be pretty
fast at both indexing and searching. And the Perl interface
is entirely
good enough.
But I'd be wary of trying to use it on a new project any
time soon:
- It has no Unicode support (though you can use any 8-bit
character
set). Apparently this is due to be fixed in version
3.0, but I've
no idea of how soon that's expected.
- It doesn't support deleting documents from an index, or
reindexing
changed documents that are already in the index. The
current stable
version can be built with ./configure
--enable-incremental to do
that, but it's described as an experimental feature.
Again, the
plan seems to be for version 3.0 to do this properly.
--
Aaron Crane
|
|
| site text search |

|
2006-02-09 11:21:03 |
On Thu, Feb 09, 2006 at 10:13:31AM +0000, Lusercop wrote:
> I believe www.cam.ac.uk used to (and may still) use
xapian as its main site
> search, among several other reasonably high profile
sites.
web-search.cam.ac.uk uses some commercial product from
Inktomi now.
--
Benjamin Smith <bsmith cpan.org, bs338 srcf.ucam.org>
|
|
| site text search |

|
2006-02-09 12:10:36 |
On Thu, 2006-02-09 at 11:12 +0000, Aaron Crane wrote:
> Dirk Koopman writes:
> > I am, just for now, using http://swish-e.org One of the
reasons for this
> > is that I have used it (long ago) in the past and
also I find it (much)
> > faster and much easier to have several instances
of than htdig. The fact
> > that is does not use a whole 512Mb (+ all the swap
and then core dump
> > and fail) on a website of about 30MB is also a
point in its favour. I
> > have used it on much larger ones.
>
> I have some familiarity with Swish-e, and, yes, it
seems to be pretty
> fast at both indexing and searching. And the Perl
interface is entirely
> good enough.
>
> But I'd be wary of trying to use it on a new project
any time soon:
>
> - It has no Unicode support (though you can use any
8-bit character
> set). Apparently this is due to be fixed in
version 3.0, but I've
> no idea of how soon that's expected.
I don't think it is under active development and, in any
case, if I were
doing it would be UTF8 rather unicode. Much easier to deal
with and more
generalised as I think others connected with this list have
found. Since
I only speak English and Dutch and my users have to
communicate in one
of those to get any support, the UTF8/Unicode aspect doesn't
affect
me
> - It doesn't support deleting documents from an
index, or reindexing
> changed documents that are already in the index.
The current stable
> version can be built with ./configure
--enable-incremental to do
> that, but it's described as an experimental
feature. Again, the
> plan seems to be for version 3.0 to do this
properly.
>
This isn't an issue with the amounts of data I am trying to
index.
Although I have, in the past, used it for databases of
several 100's of
MB and it wasn't much of an issue then.
It indexes the whole 30-odd MB in less than a minute on a
600Mhz Via C3,
once a day. Can't complain really.
Dirk
|
|
| site text search |

|
2006-02-09 21:53:58 |
On 2/9/06, Benjamin Smith <bsmith vtrl.co.uk> wrote:
> web-search.cam.ac.uk uses some commercial product from
Inktomi now.
... who are actually now Yahoo! in all but name.
|
|
| site text search |

|
2006-02-08 22:13:47 |
> On 2/8/06, Dirk Koopman <djk tobit.co.uk> wrote:
> > What do people use these days for providing local
text searches on
> > smallish websites?
Oh, and there is nutch.org which is basically lucene under
the hood +
loads of handy stuff that makes for an easy (relatively),
large scale
(if you need it), fairly customised site search tool that
would cost
you a small fortune if you bought it commercially.
On the down side, it's still written in Java, and their own
website
search is powered by google. Doh.
http://lucene.apache.
org/nutch/
|
|
| site text search |

|
2006-02-08 22:23:10 |
On 8 Feb 2006, at 21:58, Minty wrote:
> Personally, I reckon fulltext indexing in mysql could
be the way to go
> : anyone got any stats on index sizes and effect on
query speed
> performance?
Two nice things about searching against the MySQL database:
* you're only searching the content - not any boilerplate
that
appears on every page
* if you have a site that presents multiple views of the
same
data (e.g. articles sorted by date, by subject, by
keyword)
then a crawler based indexer will index each item many
times
- once for each view in which it appears; MySQL will only
have
a single copy of the data.
--
Andy Armstrong, hexten.net
|
|
| site text search |

|
2006-02-08 22:29:35 |
On 8 Feb 2006, at 22:13, Minty wrote:
> On the down side, it's still written in Java, and their
own website
> search is powered by google. Doh.
use Socks::Fireproof;
Java's OK
--
Andy Armstrong, hexten.net
|
|
[1-11]
|
|