|
List Info
Thread: Proposed changes to omindex
|
|
| Proposed changes to omindex |

|
2006-08-11 05:52:59 |
Proposed changes to omindex
Currently Available Items
=========================
1) Have the Q prefix contain the 16 byte MD5 of the full
file name used for document lookup during
indexing.
2) Add the document’s last modified time to the value table
(ID 0). This would allow incremental
indexing based on the timestamp and also sorting by date in
omega (SORT=0)
a. Currently I store the timestamp as a 10 byte string (left
zero padded UNIX time string) i.e.
0969492426
b. However, for maximum space savings it could be stored as
a 4 byte string in big endian format
with a get/set utility function to handle the conversion if
necessary.
3) Add the document’s MD5 to the value table as a 16 byte
string (binary representation of the
digest) (ID 1). This could be used as a secondary check for
incremental indexing (i.e. if the
file was touched but not changed don’t replace it) and also
to collapse duplicates (COLLAPSE=1).
The md5 source code is from the GNU testutils-2.1 package.
4) For files that require command line utility processing
(i.e. pdftotext) I have added a
--copylocal option. This allows the file to be digested
while being copied to the local drive and
then the command line utility processes the local file
saving multiple reads across the network.
If we want to expand this it could be used to build a local
cache/backup/repository. For my use I
was thinking of putting the files under source control (svn)
but that is another discussion
thread.
5) I would also recommend storing the full filename in the
document data.
file=/mnt/vol1/www/sample.html. I have a purge utility that
cleans out documents that are no
longer found on the file system using this information.
FYI: I am currently migrating to a MySQL
metadata repository that will move information like this out
of the search index; it also
preserves metadata on complete index rebuilds and allows
users to add additional information that
may not be contained in the actual document.
Future Items
============
6) Stream indexer. Instead of reading the entire file into
memory, process it line by line. This
should make indexing large files more efficient.
7) Clean up the fixme’s in mime type handlers i.e. // FIXME:
run pdfinfo once and parse the output
ourselves. I woudl use pcre to extract the desired text.
8) Change the way stemmed terms are added to the database.
Remove the R prefix from raw terms and
only write stemmed terms to the DB if they differ from the
original term, prefixing them with Z?.
If stemming was set to none this would reduce the current
term tables (termlist, postlist, and
position) by about 50%. The query parser would have to be
modified to use the same rules.
Let me know if you are interested in including any of these
changes in Xapian.
Thanks,
Trink
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-11 06:45:02 |
Michael Trinkala schrieb:
> Proposed changes to omindex
>
> Currently Available Items
> =========================
>
> 1) Have the Q prefix contain the 16 byte MD5 of the
full file name used for document lookup during
> indexing.
>
> 2) Add the document’s last modified time to the value
table (ID 0). This would allow incremental
> indexing based on the timestamp and also sorting by
date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string
(left zero padded UNIX time string) i.e.
> 0969492426
> b. However, for maximum space savings it could be
stored as a 4 byte string in big endian format
> with a get/set utility function to handle the
conversion if necessary.
>
> 3) Add the document’s MD5 to the value table as a 16
byte string (binary representation of the
> digest) (ID 1). This could be used as a secondary
check for incremental indexing (i.e. if the
> file was touched but not changed don’t replace it) and
also to collapse duplicates (COLLAPSE=1).
> The md5 source code is from the GNU testutils-2.1
package.
>
> 4) For files that require command line utility
processing (i.e. pdftotext) I have added a
> --copylocal option. This allows the file to be
digested while being copied to the local drive and
> then the command line utility processes the local file
saving multiple reads across the network.
> If we want to expand this it could be used to build a
local cache/backup/repository. For my use I
> was thinking of putting the files under source control
(svn) but that is another discussion
> thread.
I already have a cache_dir option in my omega.conf and
successfully use
it in omindex for recursive local zip/rar/msg/pst
"virtual directories",
last_mod checked. MSVC not supported, sorry.
I'll clean it up and post it here.
Your idea to cache the output of costly extracters, like
xls2cvs and
pdftotext seems to be also promising. But with the
implemented last_mod
check not really needed IMHO.
> 5) I would also recommend storing the full filename in
the document data.
> file=/mnt/vol1/www/sample.html. I have a purge utility
that cleans out documents that are no
> longer found on the file system using this information.
FYI: I am currently migrating to a MySQL
> metadata repository that will move information like
this out of the search index; it also
> preserves metadata on complete index rebuilds and
allows users to add additional information that
> may not be contained in the actual document.
>
> Future Items
> ============
> 6) Stream indexer. Instead of reading the entire file
into memory, process it line by line. This
> should make indexing large files more efficient.
>
> 7) Clean up the fixme’s in mime type handlers i.e. //
FIXME: run pdfinfo once and parse the output
> ourselves. I woudl use pcre to extract the desired
text.
>
> 8) Change the way stemmed terms are added to the
database. Remove the R prefix from raw terms and
> only write stemmed terms to the DB if they differ from
the original term, prefixing them with Z?.
> If stemming was set to none this would reduce the
current term tables (termlist, postlist, and
> position) by about 50%. The query parser would have to
be modified to use the same rules.
>
> Let me know if you are interested in including any of
these changes in Xapian.
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/
http://helsinki.at/ http://spacemovie.mur.at/
a>
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-19 18:22:10 |
On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala
wrote:
> 1) Have the Q prefix contain the 16 byte MD5 of the
full file name
> used for document lookup during indexing.
I don't think this is generally useful, for reasons
previously given:
omega/omindex are really targetted to indexing and searching
web
sites, where the URI is the identifier. A filename used to
provide a
representation of that resource isn't at all interesting to
omega, and
is only partly interesting to omindex (ie: there are other
ways of
doing it). omindex is pretty limited in any case, and if
you're doing
anything funky you'll be using scriptindex or your own
indexer. Within
that, how you generate Q-terms and manage your documents is
of course
entirely up to you.
> 4) For files that require command line utility
processing
> (i.e. pdftotext) I have added a --copylocal option.
This allows the
> file to be digested while being copied to the local
drive and then
> the command line utility processes the local file
saving multiple
> reads across the network. If we want to expand this it
could be used
> to build a local cache/backup/repository. For my use I
was thinking
> of putting the files under source control (svn) but
that is another
> discussion thread.
This is neat. I agree that for anything more complex it's
not actually
going to solve all the requirements, but for remote files it
can
work. (Although any decent network fs has built-in caching,
and in any
case you could rely on the OS buffers - if you open() first,
then dup
the filedes, then use fdopen() to turn it into a FILE* -
twice -
there's very little reason you'll have to hit the network
twice, even
on a lame net fs. Do you have any timing data on how much
this
improves things for you?)
> 5) I would also recommend storing the full filename in
the document
> data. file=/mnt/vol1/www/sample.html. I have a purge
utility that
> cleans out documents that are no longer found on the
file system
> using this information. FYI: I am currently migrating
to a MySQL
> metadata repository that will move information like
this out of the
> search index; it also preserves metadata on complete
index rebuilds
> and allows users to add additional information that may
not be
> contained in the actual document.
omindex has its own mechanism for purging documents that no
longer
exist. Again, the separation from logical URI to physical
storage
pushes me in the direction of not wanting this in omindex.
One idea I've talked to someone about is separating omindex
into
something that drives scriptindex, which in theory would
allow you to
use the file spider in omindex with whatever indexing
strategy you
wanted.
Speaking of metadata, what I'd really like is a
Xapian-indexable RDF
store. I doubt anyone else wants one of those though
> 8) Change the way stemmed terms are added to the
database. Remove
> the R prefix from raw terms and only write stemmed
terms to the DB
> if they differ from the original term, prefixing them
with Z?. If
> stemming was set to none this would reduce the current
term tables
> (termlist, postlist, and position) by about 50%. The
query parser
> would have to be modified to use the same rules.
Currently, you only get dual terms if the initial letter is
a
capital. On a sample database I have here of an old blog, I
have:
24535 terms in total
8157 R-terms
1718 other prefixed terms
So we'd get a saving of 33% by dropping R-terms when
stemming; however
we'd then lose much if not all of that saving (which I
can't calculate
without passing over the original data again) by having to
put stemmed
versions back in again, whether an R-term would have been
generated or
not. Mind you, a *very* quick test suggests that on some of
my data,
no more than 25% of words actually stem to something
different. I
suspect this is because there are lots of short words in
everyday
English. So there could be some saving here.
If you're not using stemming, and are content to force
everything into
lowercase (modulo the excitement that causes with Unicode),
dropping
R-terms seems a good strategy. I'd certainly favour having
a way of
running the query parser that didn't need R-terms, and then
perhaps a
way of driving omindex/scriptindex to not generate them in
the first
place. It's a pretty easy change, in
index_text.cc:index_text().
I think this all comes down to whether you think stemming is
a good
default or not. If you're more concerned about stemmed
forms, you want
them to be obvious and probably unprefixed. (It's certainly
easier to
debug this way.)
> Let me know if you are interested in including any of
these changes
> in Xapian.
I think the best thing is to wait until Olly's back and has
a chance
to digest all these and comment on them himself. It's
really up to him
what goes in anyway
James
--
/-----------------------------------------------------------
---------------\
James Aylett
xapian.org
james tartarus.org
uncertaintydivision.org
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-26 21:56:47 |
On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala
wrote:
> Proposed changes to omindex
One suggestion before I go into details - even if some of
these patches
may not be things we'd want to include in the mainstream
releases right
now, they may still be of interest to some other users. So
I'd
encourage you to offer them for download, or just post them
here if they
aren't too big. The same goes for other people with
patches they're
happy to share.
> Currently Available Items
> =========================
>
> 1) Have the Q prefix contain the 16 byte MD5 of the
full file name
> used for document lookup during indexing.
There are two issues here really.
The first is if the unique id should be based on the file
path or the
URL. Currently omindex uses the URL, but the file path
could be used
instead. The main difference I can see is that it would
allow the URL
mappings to be changed without a reindex (providing the
omega CGI
applied the mappings at search time) but I'm not sure how
useful that
really is - I can't remember the last time I reconfigured
the url to
file mappings on any webserver I maintain.
On the flip side, currently you can move the physical
locations of
files around and change the URL mappings in the web server
so the URLs
remain the same, and omindex won't have to reindex a thing.
That
actually seems a more likely scenario to me (though again I
can't
remember the last time I've actually done this).
As James has said, we've discussed this before and ended up
staying with
how things are, mostly because there didn't seem to be any
particular
advantage to changing.
The other issue is that terms have an upper length limit, so
you need a
way to cope with overly long URLs/file paths when build UID
terms.
Currently we only hash if the URL is over about 240
characters. The
problem with always using only a hash is that you can get
collisions
even with modest length URLs/paths.
While this might seem a bit of a theoretical risk, there are
"only"
256^16 MD5 sums, but more than 255^240 file names which
would easily fit
in a term - that's more than 10^539 file names per MD5 sum,
so really a
very large number of possible collisions! Even if you only
consider
filenames including alphanumerics, "_",
"-", and "/", it's still more
than 10^395 file names per MD5 sum.
> 2) Add the document’s last modified time to the value
table (ID 0).
> This would allow incremental
> indexing based on the timestamp and also sorting by
date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string
(left zero
> padded UNIX time string) i.e. 0969492426
> b. However, for maximum space savings it could be
stored as a 4 byte
> string in big endian format with a get/set utility
function to handle
> the conversion if necessary.
I think this would be very useful. I tend to think storing
the number
in 4 bytes (or perhaps 5 to take us past 2038...) is worth
the effort
since you have to convert the number when storing and
retrieving as a
string anyway. The functions needed are available already
(on Unix
at least) as htonl and ntohl.
> 3) Add the document’s MD5 to the value table as a 16
byte string
> (binary representation of the digest) (ID 1). This
could be used as a
> secondary check for incremental indexing (i.e. if the
file was touched
> but not changed don’t replace it) and also to collapse
duplicates
> (COLLAPSE=1).
> The md5 source code is from the GNU testutils-2.1
package.
I think this would be useful too.
It'd be marginally better to use a non-GPL md5
implementation (we're
trying to eliminate unrelicensable GPL code from the core
library, but
it'd be nice to be able to relicense Omega too). A quick
Google
reveals at least a couple of candidates, though I've not
looked at
either in any detail:
http://so
urceforge.net/projects/libmd5-rfc/ zlib/libpng License
(BSD-ish)
http://www.fourmilab.ch/
md5/ public domain
But unless the md5 api is complex, I imagine it'd be easy
enough to drop
one of these in instead at a later date. The GNU version
should be very
well tested at least, whereas the above implementations may
be less so.
> 4) For files that require command line utility
processing (i.e.
> pdftotext) I have added a --copylocal option. This
allows the
> file to be digested while being copied to the local
drive and then the
> command line utility processes the local file saving
multiple reads
> across the network.
Have you actually benchmarked this?
A decent OS should cache the file's contents and avoid the
multiple
reads across the network, so this could end up being slower
than just
reading the remote file twice (because the file needs to be
written and
flushed to local disk before the filter program gets run).
If it really does help, it seems a useful addition.
> If we want to expand this it could be used to
> build a local cache/backup/repository. For my use I
was thinking of
> putting the files under source control (svn) but that
is another
> discussion thread.
I think backup and source control are really outside of the
scope of
omindex, unless I misunderstand what you're suggesting
here.
> 5) I would also recommend storing the full filename in
the document
> data. file=/mnt/vol1/www/sample.html. I have a purge
utility that
> cleans out documents that are no longer found on the
file system using
> this information.
As James says, we have an different approach to purging
removed files
during indexing which doesn't require this field. I don't
object
strongly to adding this if it's actually useful though.
> FYI: I am currently migrating to a MySQL metadata
repository that will
> move information like this out of the search index; it
also preserves
> metadata on complete index rebuilds and allows users to
add additional
> information that may not be contained in the actual
document.
There's certainly something to be said for keeping
information useful
for (re)indexing but not for search in a separate place.
The downside
is that it's hard to flush the Xapian index and metadata
store
atomically so you need a robust strategy to cope with
indexing being
interrupted when the two aren't in sync.
> Future Items
> ============
> 6) Stream indexer. Instead of reading the entire file
into memory,
> process it line by line. This should make indexing
large files
> more efficient.
Line-by-line isn't much better - it's not unusual to find
long HTML
documents which are all on one line (e.g. those produced on
an old Mac
where the end of line character is different, or those
generated by
a script).
But some sort of chunked reading isn't a bad idea. The
HTML parser
currently relies on indefinite lookahead which may be
awkward to do
while dealing with chunks, but that can probably be fixed
without
changing how HTML documents parse in cases which actually
matter.
> 7) Clean up the fixme’s in mime type handlers i.e. //
FIXME: run
> pdfinfo once and parse the output ourselves. I woudl
use pcre to
> extract the desired text.
Even PCRE is really overkill as you're looking for a
constant string
in every case. It's just sheer laziness that I didn't do
it right to
start with. Sorry.
> 8) Change the way stemmed terms are added to the
database. Remove the
> R prefix from raw terms and only write stemmed terms to
the DB if they
> differ from the original term, prefixing them with Z?.
"Z?" doesn't match our existing conventions for
prefixes, but the choice
of prefix is just cosmetic.
This would mean that a search for "words which stem to
'foo'" would
become foo OR Z?foo, which will be slower and give less
accurate
statistics (though they'll probably be some speed gain from
reduced
VM file cache pressure in many cases).
So are you suggesting we should generate the non-stemmed
terms from
every word? Currently R terms are only generated for
capitalised
words, which is really done to allow searches for a proper
nouns
without problems caused by stemming. However, this feature
is
sometimes problematic itself - people type in capitalised
words
in queries without knowing about the feature and sometimes
the
results returned aren't great.
> If stemming
> was set to none this would reduce the current term
tables (termlist,
> postlist, and position) by about 50%. The query parser
would
> have to be modified to use the same rules.
Is it really as much as 50% for all of them? We only
generate R terms
for capitalised words, so this suprises me.
I've actually been thinking of reworking how we handle
indexing of
stemmed and unstemmed forms myself. No firm conclusions,
but I've
been wondering about indexing all words unstemmed with
positional
information and all stemmed forms without. This would mean
that
we could still support phrase searching as we currently
implement it,
and NEAR for unstemmed words. A capitalised word in a query
could
search for an unstemmed form, and a non-capitalised word for
a
stemmed form. Also stemming could be turned off at query
time.
This would save slightly more space in the position table to
your
approach, but not as much in the termlist or postlist table.
Perhaps some combination of our ideas would work. I think I
need to
mull it over more.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-26 22:13:37 |
On Sat, Aug 19, 2006 at 07:22:10PM +0100, James Aylett
wrote:
> (Although any decent network fs has built-in caching,
and in any
> case you could rely on the OS buffers - if you open()
first, then dup
> the filedes, then use fdopen() to turn it into a FILE*
- twice -
> there's very little reason you'll have to hit the
network twice, even
> on a lame net fs.
Some of the format conversion filters want a filename for
the input, so
you can't open the file once and dup the file descriptor
(pdftotext for
example). Those that can read from stdin (e.g. antiword)
could be
handled this way if it actually helps.
> One idea I've talked to someone about is separating
omindex into
> something that drives scriptindex, which in theory
would allow you to
> use the file spider in omindex with whatever indexing
strategy you
> wanted.
Perhaps that was me, or possibly we've both discussed it
with Richard
separately?
Anyway, it's an interesting idea, though it might add
measurable
overhead. A step towards it is that I've recently added a
"load"
command to scriptindex which allows you to write an index
script which
takes a filename to read and index the contents of.
> I'd certainly favour having a way of running the query
parser that
> didn't need R-terms, [...]
There already is: QueryParser::set_stemming_strategy() can
be called
with STEM_NONE or STEM_ALL (the default is STEM_SOME).
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-27 08:24:42 |
The tar file can be found here: htt
ps://www.trinkala.com/xapian/sort_collapse.tgz
Change summary for omega
------------------------
- Added the document’s last modified time to the value table
(ID 0). It is stored as a 4 byte
string in big endian format
- Added the document’s MD5 to the value table (ID 1) as a 16
byte string and C term prefix to
allow collapsed documents to be easily expanded/searched
Tar file contents
-----------------
diff.txt - SVN diff of all the changes against revision 7156
Added the following files from the GNU testutils-2.1 package
md5.c
md5.h
unlocked-io.h
utils.h
utils.cc
- added an enum for the value id constants
- added md5 and big endian date conversion functions
omindex.cc
- add the two new value items and the new C prefix term
during indexing
docs/omegascript.txt
query.cc
- added $md5 and $valuedate commands/documentation
docs/termprefixes.txt
- added C prefix documentation
configure.ac
- added a check for endianness
Makefile.am
- added the md5.c file to the SOURCES list
Thanks,
Trink
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-27 14:27:08 |
On Sat, Aug 26, 2006 at 11:13:37PM +0100, Olly Betts wrote:
> Some of the format conversion filters want a filename
for the input, so
> you can't open the file once and dup the file
descriptor (pdftotext for
> example). Those that can read from stdin (e.g.
antiword) could be
> handled this way if it actually helps.
Well, strictly speaking we can LD_PRELOAD filters that
can't act as
stream filters to death, although that only works on modern
Unices. We
shouldn't really rely on that, though
Most filters would accept a patch to work from stdin if they
don't
already, and it wouldn't be too difficult to do. That would
benefit
everyone, if we run into some common ones.
I've no idea whether it actually will help, in practice. I
suspect
that in most cases, it's not actually going to win you much
because
the file buffering will do the right thing already.
> > One idea I've talked to someone about is
separating omindex into
> > something that drives scriptindex, which in theory
would allow you to
> > use the file spider in omindex with whatever
indexing strategy you
> > wanted.
>
> Perhaps that was me, or possibly we've both discussed
it with Richard
> separately?
I've no idea
> Anyway, it's an interesting idea, though it might add
measurable
> overhead. A step towards it is that I've recently
added a "load"
> command to scriptindex which allows you to write an
index script which
> takes a filename to read and index the contents of.
If we retain omindex's approach for HTML (which it
understands
natively) and anything that filters to plain text, and just
allow
people to write filters that generate scriptindex input
files (with
the filter being associated with an index script), then we
get more
flexibility in omindex without having to sacrifice
efficiency of
indexing in the common case.
That would also allow decent indexing of anything that
embedded XMP,
incidentally. This is considered A Good Thing, at least by
me.
> > I'd certainly favour having a way of running the
query parser that
> > didn't need R-terms, [...]
>
> There already is: QueryParser::set_stemming_strategy()
can be called
> with STEM_NONE or STEM_ALL (the default is STEM_SOME).
Ah, excellent. Is this documented anywhere? Can't remember
seeing it...
James
--
/-----------------------------------------------------------
---------------\
James Aylett
xapian.org
james tartarus.org
uncertaintydivision.org
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-27 15:05:22 |
On Sat, Aug 26, 2006 at 10:56:47PM +0100, Olly Betts wrote:
> > Proposed changes to omindex
>
> One suggestion before I go into details - even if some
of these patches
> may not be things we'd want to include in the
mainstream releases right
> now, they may still be of interest to some other users.
So I'd
> encourage you to offer them for download, or just post
them here if they
> aren't too big. The same goes for other people with
patches they're
> happy to share.
Michael and I discussed briefly having a bit more detailed
"outreach"
links on the xapian website. The only reason we don't have
more at the
moment is that we haven't really started tracking all the
extensions
and uses that people have done (and it's only very recently
that
xapian use has really started snowballing, if you'll excuse
the
unintentional pun).
I'm thinking of a kind of directory, with links
categorised: "useful
patches", and "useful libraries",
"useful helper programs" (like
filters for indexing), "systems that integrate
xapian", howtos,
whatever. If this seems a good idea, I'm happy to be the
contact for
submissions and updates for this. (Does xapian.org
auto-update like
snowball does?)
> The first is if the unique id should be based on the
file path or the
> URL. Currently omindex uses the URL, but the file path
could be used
> instead. The main difference I can see is that it
would allow the URL
> mappings to be changed without a reindex (providing the
omega CGI
> applied the mappings at search time) but I'm not sure
how useful that
> really is - I can't remember the last time I
reconfigured the url to
> file mappings on any webserver I maintain.
But: Cool URIs Don't Change. So you might radically
rearrange the way
you serve your website (moving from static serve to
rendering-driven
XML, or to a CMS), but it would be nice if you didn't have
to reindex
the whole lot.
I'm aware that I'm unusual in insisting on this sort of
thing; I have
to wage small wars at work to get people to believe me. On
my side of
the argument are some fairly hefty WebArch names, though
> > 2) Add the document’s last modified time to the
value table (ID 0).
>
> I think this would be very useful. I tend to think
storing the number
> in 4 bytes (or perhaps 5 to take us past 2038...) is
worth the effort
> since you have to convert the number when storing and
retrieving as a
> string anyway. The functions needed are available
already (on Unix
> at least) as htonl and ntohl.
htonl / ntohl won't work with 5 bytes, and indeed I'd
recommend we
either use 4 bytes or 8. (htonll / ntohll exist on Solaris,
and there
should be equivalents lying around somewhere on other 64 bit
platforms.)
We *could* start with 4 bytes and then auto-upgrade. Not
sure if the
space saving over 8 bytes is actually worth the hassle of
maintaining
BC code after 2038 though.
> It'd be marginally better to use a non-GPL md5
implementation (we're
> trying to eliminate unrelicensable GPL code from the
core library, but
> it'd be nice to be able to relicense Omega too).
>
> But unless the md5 api is complex, I imagine it'd be
easy enough to drop
> one of these in instead at a later date. The GNU
version should be very
> well tested at least, whereas the above implementations
may be less so.
Is md5 the right hash for us? I suspect it is, because we
don't
actually need strong cryptographic hash properties, but
it's worth
thinking about.
> > 4) For files that require command line utility
processing (i.e.
> > pdftotext) I have added a --copylocal option.
>
> If it really does help, it seems a useful addition.
I'd like an option to turn it off, if we do include it.
I'm not 100%
certain why I think this, though.
[filename in data field]
> As James says, we have an different approach to purging
removed files
> during indexing which doesn't require this field. I
don't object
> strongly to adding this if it's actually useful
though.
I think it has definite advantages. More generally, it's a
source
identifier, which could be:
* filename of source file
* SQL database table primary key
* Object database lookup key
* URI of resource with metadata in RDF database
It would be nice to *either* have a separate source type
field, *or*
just agree that if you need it, you should probably always
stuff
fully-qualified URIs in the field, so you can create your
own
private-use URI schemes as needed.
> > FYI: I am currently migrating to a MySQL metadata
repository that will
> > move information like this out of the search
index; it also preserves
> > metadata on complete index rebuilds and allows
users to add additional
> > information that may not be contained in the
actual document.
>
> There's certainly something to be said for keeping
information useful
> for (re)indexing but not for search in a separate
place. The downside
> is that it's hard to flush the Xapian index and
metadata store
> atomically so you need a robust strategy to cope with
indexing being
> interrupted when the two aren't in sync.
If you have a really sophisticated setup and really, really
need this
kind of thing, with some work to tidy things up on rollback
you can
use a distributed transaction mechanism such as JTA.
My feeling is that omega out of the box should just neatly
work out of
the Xapian db (with some sort of config file that describes
your
setup), and then if you want to do something much more
interesting we
should provide a bit of guidance on how to approach it. In
complex
systems, having multiple EIS is almost always going to be
the right
practical approach, at least at the moment.
> So are you suggesting we should generate the
non-stemmed terms from
> every word? Currently R terms are only generated for
capitalised
> words, which is really done to allow searches for a
proper nouns
> without problems caused by stemming. However, this
feature is
> sometimes problematic itself - people type in
capitalised words
> in queries without knowing about the feature and
sometimes the
> results returned aren't great.
I think the problem here is more to do with that. Could we
have an
option to lowercase the query string beforehand, just a CGI
param you
can punt into omega?
James
--
/-----------------------------------------------------------
---------------\
James Aylett
xapian.org
james tartarus.org
uncertaintydivision.org
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-27 18:00:44 |
On Sun, Aug 27, 2006 at 04:05:22PM +0100, James Aylett
wrote:
> On Sat, Aug 26, 2006 at 10:56:47PM +0100, Olly Betts
wrote:
>
> > One suggestion before I go into details - even if
some of these patches
> > may not be things we'd want to include in the
mainstream releases right
> > now, they may still be of interest to some other
users. So I'd
> > encourage you to offer them for download, or just
post them here if they
> > aren't too big. The same goes for other people
with patches they're
> > happy to share.
>
> Michael and I discussed briefly having a bit more
detailed "outreach"
> links on the xapian website. The only reason we don't
have more at the
> moment is that we haven't really started tracking all
the extensions
> and uses that people have done (and it's only very
recently that
> xapian use has really started snowballing, if you'll
excuse the
> unintentional pun).
I pick up new uses of Xapian and note them in a file which I
go through
periodically and add to users.php, but it's a suprisingly
time-consuming
job.
> I'm thinking of a kind of directory, with links
categorised: "useful
> patches", and "useful libraries",
"useful helper programs" (like
> filters for indexing), "systems that integrate
xapian", howtos,
> whatever. If this seems a good idea, I'm happy to be
the contact for
> submissions and updates for this. (Does xapian.org
auto-update like
> snowball does?)
It's supposed to be automatic, but actually you have to
press this
button...
In reality, it's a pain for others to update as things
currently are,
mostly because the search index is updated by the script and
that's
owned by my userid. Nobody else has shown any desire to
update pages,
so I've not worried about it so far. You can just copy new
files
directly to the web tree though (I often do for trivial
changes), but
make sure the permissions are sane if you do (and check the
changes in
too or they'll get lost!)
It might be better to put this directory on the wiki anyway
- it's the
sort of thing we created the wiki for, and it would allow
people to just
add their own entries. Then your job would just be to make
sure that
things stay tidy and sort out links which go dead.
> > The first is if the unique id should be based on
the file path or the
> > URL. Currently omindex uses the URL, but the file
path could be used
> > instead. [...]
>
> But: Cool URIs Don't Change. So you might radically
rearrange the way
> you serve your website (moving from static serve to
rendering-driven
> XML, or to a CMS), but it would be nice if you didn't
have to reindex
> the whole lot.
FWIW, this argues for the status quo.
> > > 2) Add the document’s last modified time to
the value table (ID 0).
> >
> > I think this would be very useful. I tend to
think storing the number
> > in 4 bytes (or perhaps 5 to take us past 2038...)
is worth the effort
> > since you have to convert the number when storing
and retrieving as a
> > string anyway. The functions needed are available
already (on Unix
> > at least) as htonl and ntohl.
>
> htonl / ntohl won't work with 5 bytes, and indeed I'd
recommend we
> either use 4 bytes or 8.
No, but it's easy to handle the extra byte yourself (and it
can just be
a zero right now anyway).
> We *could* start with 4 bytes and then auto-upgrade.
Not sure if the
> space saving over 8 bytes is actually worth the hassle
of maintaining
> BC code after 2038 though.
The auto-upgrade would be rather painful for a large
database (though to
be honest I'd be astonished if we don't have an
incompatible database
format change in the next 32 years anyway), which is why I
suggested we
might want to put the extra byte in ahead of time.
8 really is overkill - we are considering dates on files
here, so it's
only dates which have happened which are relevant, and 5
bytes takes you
to 36443!
Anyway, I think it's sanest just to go with 4 bytes for
now.
> > It'd be marginally better to use a non-GPL md5
implementation (we're
> > trying to eliminate unrelicensable GPL code from
the core library, but
> > it'd be nice to be able to relicense Omega too).
> >
> > But unless the md5 api is complex, I imagine it'd
be easy enough to drop
> > one of these in instead at a later date. The GNU
version should be very
> > well tested at least, whereas the above
implementations may be less so.
>
> Is md5 the right hash for us? I suspect it is, because
we don't
> actually need strong cryptographic hash properties, but
it's worth
> thinking about.
I had already considered this - the only concern I can see
is that
somebody malicious might create a document with an identical
MD5
checksum to one that they don't want you to find. This
seems a very
artificial situation though.
The problem is that any cryptographic hash gets less secure
as computing
power increases (and as researchers discover shortcuts to
attack it with
less complexity than brute-force requires). But we don't
want to pick
something fantastically secure for decades to come but
currently insanely
computationally intensive as we need to run it on every file
we index
or consider for reindexing.
A quick test suggests the SHA-1 is a bit more than 50%
slower than MD5,
and SHA-1 isn't looking particularly future-proof.
So I think MD5 is probably an appropriate choice currently.
> > > 4) For files that require command line
utility processing (i.e.
> > > pdftotext) I have added a --copylocal option.
> >
> > If it really does help, it seems a useful
addition.
>
> I'd like an option to turn it off, if we do include
it. I'm not 100%
> certain why I think this, though.
For the case where you're indexing from local disk already!
I suspect
this is much more common than indexing from a network drive.
> [filename in data field]
> > As James says, we have an different approach to
purging removed files
> > during indexing which doesn't require this field.
I don't object
> > strongly to adding this if it's actually useful
though.
>
> I think it has definite advantages. More generally,
it's a source
> identifier, which could be:
>
> * filename of source file
> * SQL database table primary key
> * Object database lookup key
> * URI of resource with metadata in RDF database
But omindex isn't this general - it indexes files forming a
website.
There's nothing to stop people who are indexing from other
sources
(whether with scriptindex or a custom indexer) adding a
source
identifier if they find it useful, but let's consider
whether it's
generally useful for omindex to do it rather than looking at
other
situations.
> It would be nice to *either* have a separate source
type field, *or*
> just agree that if you need it, you should probably
always stuff
> fully-qualified URIs in the field, so you can create
your own
> private-use URI schemes as needed.
If you're taking the "every URI is sacred"
view, the source identifier
can change while the URI doesn't and the document doesn't
get reindexed.
So it could be stale information anyway.
> > However, this feature is sometimes problematic
itself - people type
> > in capitalised words in queries without knowing
about the feature
> > and sometimes the results returned aren't great.
>
> I think the problem here is more to do with that. Could
we have an
> option to lowercase the query string beforehand, just a
CGI param you
> can punt into omega?
You can already achieve the same end result rather less
crudely by
using $set{stem_all,true} in the query template. If you
want it
conditional on a CGI parameter, just use:
$if{$eq{$cgi,yes},$set{stem_all,true}}
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Proposed changes to omindex |

|
2006-08-27 18:12:45 |
On Sun, Aug 27, 2006 at 03:27:08PM +0100, James Aylett
wrote:
> Most filters would accept a patch to work from stdin if
they don't
> already, and it wouldn't be too difficult to do. That
would benefit
> everyone, if we run into some common ones.
Not all file formats can be sanely decoded without seeking
though (and
some are more efficient to decode if you can seek).
> I've no idea whether it actually will help, in
practice. I suspect
> that in most cases, it's not actually going to win you
much because
> the file buffering will do the right thing already.
Indeed.
> If we retain omindex's approach for HTML (which it
understands
> natively) and anything that filters to plain text, and
just allow
> people to write filters that generate scriptindex input
files (with
> the filter being associated with an index script), then
we get more
> flexibility in omindex without having to sacrifice
efficiency of
> indexing in the common case.
I'm not sure I can visualise how a merged indexer would
look right
now, but I think this isn't something for the short term
anyway -
sorting out utf-8 and flint are more important currently.
> > > I'd certainly favour having a way of running
the query parser that
> > > didn't need R-terms, [...]
> >
> > There already is:
QueryParser::set_stemming_strategy() can be called
> > with STEM_NONE or STEM_ALL (the default is
STEM_SOME).
>
> Ah, excellent. Is this documented anywhere? Can't
remember seeing it...
Hmm, only rather tersely:
http://ww
w.xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.htm
l#c7dc3b55b6083bd3ff98fc8b2726c8fd
I'll try to flesh that out.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
|
|