|
List Info
Thread: Incremental indexing limitations
|
|
| Incremental indexing limitations |
  Israel |
2007-10-11 11:51:50 |
Assuming we have a server with 2GB memory and 500GB disk.
We want to use it for indexing of an always updating
database of documents.
And lets say each document is 2KB of text.
We have a process that constantly indexes data into a Xapian
database.
This process flushes updated every 10K documents (to make
sure they are
searchable and successfully stored) and after such an
update, marks the
documents as indexed.
Few observations until now.
1. Size of such a database will actually be about 3K per
document. That
is bigger than the text of the documents themselves. This is
quite
surprising really, as what Xapian is supposed to be storing
is just the
term IDs and the positions. Any ideas why its bigger than
the original
size as opposed to 1/3 of the size lets say?
2. Such an indexing process will start very fast, but when
reaching a
database size of 2M or 3M documents, each flush will take
2-4 minutes.
This is already very slow for a flush of 10K small
documents. Changing
flushes to every 1K document doesn't help. It seems the time
a flush
takes is not related so directly to the size of the flush
itself but
does strongly relate to the size of the database itself. How
come? What
happens during a flush?
3. If the time it takes to flush 10K documents when the DB
is 3M
documents, is 2.5 minutes, does it actually mean that when
the database
is 100M documents, each flush will take over an hour? If so,
that is
extremely "painful', isn't it?
So.. thinking about how to maybe optimize this process, the
concept of
using a "live" small database for updates and then
merge it into a big
database comes to mind.
However, there are two issues here:
1. Such a merge is slow. It will take quite a lot of time
(many hours)
to compact/merge each such live database into a main one. If
this
process has to be done hourly lets say, and the process
takes more than
an hour, we are faced with a critical problem.
2. Such a merge process seems to take quite a lot of
resources from the
system, limiting CPU, I/O and memory for the more urgent
task.. indexing
and searching.
3. It also means that we can never use more than 50% of the
diskspace on
the server. In fact less than 40% or 45% to be safe. This is
because the
compact process is merging the big database with the small
one, into a
new database. So the new database will be bigger than the
original big
one. So just because of this process, the server diskspace
can not be
utilized effectively.
Any thoughts and insights about the matter are greatly
appreciated.
Ron
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  United Kingdom |
2007-10-11 20:43:34 |
On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass wrote:
> 1. Size of such a database will actually be about 3K
per document. That
> is bigger than the text of the documents themselves.
This is quite
> surprising really, as what Xapian is supposed to be
storing is just the
> term IDs and the positions. Any ideas why its bigger
than the original
> size as opposed to 1/3 of the size lets say?
It's usually smaller, but it does depend on the nature of
the source
data. If you're indexing with positional information, I
wouldn't expect
to achieve 1/3 the size. It might be interesting to see the
output of
"ls -l" on the database directory.
Part of the database space will actually be unused (what
xapian-compact
does is to minimise this unused space).
The Btree implementation has two update modes - linear and
non-linear.
In linear mode, each block is filled as full as it will go
before the
next block is started, so blocks end up nearly 100% used
(apart for the
last block updated in a run of linear updates). In
non-linear mode,
blocks fill until an update won't fit in the remaining space
in the
block where it needs to go, at which point that block is
split into 2
each ~50% full, so on average non-linear update results in
~75% use of
blocks. This is actually how a B-tree achieves its
amortised update
cost.
If you're just appending documents with add_document(), then
for most
tables you'll only get linear update, and they end up pretty
compact.
The exception is the postlist table - the first flush will
be entirely
linear, and you may get some linear update during subsequent
flushes
but there's inevitably going to be non-linear update. If
the flush
threshold is larger, you'll increase the scope for (and size
of) spells
of linear updating.
There's certainly scope for improvement here though - for
example, we
currently store the document lengths in every posting list
entry, but it
would save quite a bit of space to store them just once, and
it's likely
to be faster overall when you consider the extra I/O and
caching
overhead.
I'm also working on a replacement Btree manager which
reduces the
amount of disk space which the "wood" of the Btree
takes. That's
particularly an issue for the positions table which contains
a lot of
items, many of which are very small.
> 2. Such an indexing process will start very fast, but
when reaching a
> database size of 2M or 3M documents, each flush will
take 2-4 minutes.
> This is already very slow for a flush of 10K small
documents. Changing
> flushes to every 1K document doesn't help.
For greater effiency, you'll want to make the flush
threshold larger,
not smaller.
> It seems the time a flush
> takes is not related so directly to the size of the
flush itself but
> does strongly relate to the size of the database
itself. How come? What
> happens during a flush?
Most of the work is updating the inverted file. This is
done by making
a pass in sorted order over the postlist table, applying
changes.
Because we make a pass over the table, this is much more
efficient per
document if you use a larger flush threshold (but not so
large that the
buffered changes are fighting with the OS cache of blocks
from the
Xapian database).
Ideally this would self-tune, but at present if you want to
maximise
throughput you need to set XAPIAN_FLUSH_THRESHOLD, and
experiment a bit
with the value you set it to.
> 3. If the time it takes to flush 10K documents when the
DB is 3M
> documents, is 2.5 minutes, does it actually mean that
when the database
> is 100M documents, each flush will take over an hour?
It's not a linear relationship.
> So.. thinking about how to maybe optimize this process,
the concept of
> using a "live" small database for updates and
then merge it into a big
> database comes to mind.
That's certainly an approach that can work for some
situations.
> However, there are two issues here:
> 1. Such a merge is slow. It will take quite a lot of
time (many hours)
> to compact/merge each such live database into a main
one. If this
> process has to be done hourly lets say, and the process
takes more than
> an hour, we are faced with a critical problem.
> 2. Such a merge process seems to take quite a lot of
resources from the
> system, limiting CPU, I/O and memory for the more
urgent task.. indexing
> and searching.
Note that if your system has a quiet time of day (as most
do) you can
perform the merge then, and still allow searching of new
documents
between merges by using Xapian's ability to search over
multiple
databases together.
> 3. It also means that we can never use more than 50% of
the diskspace on
> the server. In fact less than 40% or 45% to be safe.
This is because the
> compact process is merging the big database with the
small one, into a
> new database. So the new database will be bigger than
the original big
> one. So just because of this process, the server
diskspace can not be
> utilized effectively.
Disk is pretty cheap these days though (until you run out of
drive
bays!) But you're right that this is a drawback of this
approach,
although it can be useful to have that sort of amount of
spare space for
other reasons - for example you're a bit stuck otherwise if
you want to
change how you index and need to rebuild the index from
scratch while
still allowing searches of the existing database while the
rebuild
happens.
Cheers,
Olly
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  Israel |
2007-10-12 07:34:37 |
Olly Betts wrote:
> On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass
wrote:
>
>> 1. Size of such a database will actually be about
3K per document. That
>> is bigger than the text of the documents
themselves. This is quite
>> surprising really, as what Xapian is supposed to be
storing is just the
>> term IDs and the positions. Any ideas why its
bigger than the original
>> size as opposed to 1/3 of the size lets say?
>>
>
> It's usually smaller, but it does depend on the nature
of the source
> data. If you're indexing with positional information,
I wouldn't expect
> to achieve 1/3 the size. It might be interesting to
see the output of
> "ls -l" on the database directory.
>
-rw------- 1 apache apache 0 Oct 12 00:09
flintlock
-rw-r--r-- 1 apache apache 12 Oct 12 10:11 iamflint
-rw-r--r-- 1 apache apache 103204 Oct 12 10:02
position.baseA
-rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11
position.DB
-rw-r--r-- 1 apache apache 114206 Oct 12 10:02
postlist.baseA
-rw-r--r-- 1 apache apache 113034 Oct 12 09:48
postlist.baseB
-rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02
postlist.DB
-rw-r--r-- 1 apache apache 2095 Oct 12 10:02
record.baseA
-rw-r--r-- 1 apache apache 138035200 Oct 12 10:11
record.DB
-rw-r--r-- 1 apache apache 46899 Oct 12 10:02
termlist.baseA
-rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11
termlist.DB
-rw-r--r-- 1 apache apache 16588 Oct 12 10:02
value.baseA
-rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11 value.DB
This is for:
number of documents = 6499500
average document length = 202.329
> Part of the database space will actually be unused
(what xapian-compact
> does is to minimise this unused space).
>
When running compact we save about 25%, so our below
estimate is right
(about 75% of the space is used)
here is a sample compress (from an earlier stage of the
data):
postlist: Reduced by 56.0786% 2661936K (4746792K ->
2084856K)
record: Reduced by 2.94789% 2272K (77072K -> 74800K)
termlist: Reduced by 5.68267% 97560K (1716800K ->
1619240K)
position: Reduced by 2.5952% 104056K (4009552K ->
3905496K)
value: Reduced by 50.6533% 300840K (593920K -> 293080K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)
Note however, that such a compress takes a lot of time and
would doubtly
be able to run daily, as it would probably take more than a
day to run
when its big (i.e. when its 100M documents long).
> The Btree implementation has two update modes - linear
and non-linear.
> In linear mode, each block is filled as full as it will
go before the
> next block is started, so blocks end up nearly 100%
used (apart for the
> last block updated in a run of linear updates). In
non-linear mode,
> blocks fill until an update won't fit in the remaining
space in the
> block where it needs to go, at which point that block
is split into 2
> each ~50% full, so on average non-linear update results
in ~75% use of
> blocks. This is actually how a B-tree achieves its
amortised update
> cost.
>
> If you're just appending documents with add_document(),
then for most
> tables you'll only get linear update, and they end up
pretty compact.
> The exception is the postlist table - the first flush
will be entirely
> linear, and you may get some linear update during
subsequent flushes
> but there's inevitably going to be non-linear update.
If the flush
> threshold is larger, you'll increase the scope for (and
size of) spells
> of linear updating.
>
Documents are actually added with replace_document_by_term.
This is done
so that unique external document IDs are preserved.
Before, we used replace_document and used to specify docIDs,
but since
merge doesn't preserve the docIDs, we shifted to the above
stated approach.
> There's certainly scope for improvement here though -
for example, we
> currently store the document lengths in every posting
list entry, but it
> would save quite a bit of space to store them just
once, and it's likely
> to be faster overall when you consider the extra I/O
and caching
> overhead.
>
> I'm also working on a replacement Btree manager which
reduces the
> amount of disk space which the "wood" of the
Btree takes. That's
> particularly an issue for the positions table which
contains a lot of
> items, many of which are very small.
>
It would be interesting to test it and see the effect on
data like ours.
>
>> 2. Such an indexing process will start very fast,
but when reaching a
>> database size of 2M or 3M documents, each flush
will take 2-4 minutes.
>> This is already very slow for a flush of 10K small
documents. Changing
>> flushes to every 1K document doesn't help.
>>
>
> For greater effiency, you'll want to make the flush
threshold larger,
> not smaller.
>
>
>> It seems the time a flush
>> takes is not related so directly to the size of the
flush itself but
>> does strongly relate to the size of the database
itself. How come? What
>> happens during a flush?
>>
>
> Most of the work is updating the inverted file. This
is done by making
> a pass in sorted order over the postlist table,
applying changes.
> Because we make a pass over the table, this is much
more efficient per
> document if you use a larger flush threshold (but not
so large that the
> buffered changes are fighting with the OS cache of
blocks from the
> Xapian database).
>
> Ideally this would self-tune, but at present if you
want to maximise
> throughput you need to set XAPIAN_FLUSH_THRESHOLD, and
experiment a bit
> with the value you set it to.
>
We found that setting the auto threshold to a very high
number and then
deciding on the flush time in-code is somewhat smarter.
This is mainly so we can flush data not only on number of
documents
added, but also after a specific time (so that documents
don't wait too
long to show up in the index)
>> 3. If the time it takes to flush 10K documents when
the DB is 3M
>> documents, is 2.5 minutes, does it actually mean
that when the database
>> is 100M documents, each flush will take over an
hour?
>>
>
> It's not a linear relationship.
>
While not exactly linear, we do see a constant rise.. here
is a sample
of the last 2 million documents flushes. The database is not
7.5 million
documents.
Note that when the database was 3 million documents, flush
tool around
130 seconds. So, so far relatively linear. Certainly growing
all the time..
(note: at this point the database is 5.4M docs)
...
Flushing 100000 documents ...[Done: 294 sec]
Flushing 100000 documents ...[Done: 296 sec]
Flushing 100000 documents ...[Done: 241 sec]
Flushing 100000 documents ...[Done: 329 sec]
Flushing 100000 documents ...[Done: 354 sec]
Flushing 100000 documents ...[Done: 317 sec]
Flushing 100000 documents ...[Done: 322 sec]
Flushing 100000 documents ...[Done: 309 sec]
Flushing 100000 documents ...[Done: 344 sec]
Flushing 100000 documents ...[Done: 337 sec]
Flushing 100000 documents ...[Done: 335 sec]
Flushing 100000 documents ...[Done: 318 sec]
Flushing 100000 documents ...[Done: 346 sec]
Flushing 100000 documents ...[Done: 343 sec]
Flushing 100000 documents ...[Done: 341 sec]
Flushing 100000 documents ...[Done: 355 sec]
Flushing 100000 documents ...[Done: 364 sec]
Flushing 100000 documents ...[Done: 369 sec]
Flushing 100000 documents ...[Done: 384 sec]
Flushing 100000 documents ...[Done: 378 sec]
...
(note: at this point the database is 7.4M docs)
On the other hand, it seems that the actual indexing of
documents is
getting speedier recently, maybe due to the fact that the
more data in
the database, the less new terms are presented and needs to
be added to
the 'dictionary'?
>
>> So.. thinking about how to maybe optimize this
process, the concept of
>> using a "live" small database for updates
and then merge it into a big
>> database comes to mind.
>>
>
> That's certainly an approach that can work for some
situations.
>
>
>> However, there are two issues here:
>> 1. Such a merge is slow. It will take quite a lot
of time (many hours)
>> to compact/merge each such live database into a
main one. If this
>> process has to be done hourly lets say, and the
process takes more than
>> an hour, we are faced with a critical problem.
>> 2. Such a merge process seems to take quite a lot
of resources from the
>> system, limiting CPU, I/O and memory for the more
urgent task.. indexing
>> and searching.
>>
>
> Note that if your system has a quiet time of day (as
most do) you can
> perform the merge then, and still allow searching of
new documents
> between merges by using Xapian's ability to search over
multiple
> databases together.
>
>
>> 3. It also means that we can never use more than
50% of the diskspace on
>> the server. In fact less than 40% or 45% to be
safe. This is because the
>> compact process is merging the big database with
the small one, into a
>> new database. So the new database will be bigger
than the original big
>> one. So just because of this process, the server
diskspace can not be
>> utilized effectively.
>>
>
> Disk is pretty cheap these days though (until you run
out of drive
> bays!) But you're right that this is a drawback of
this approach,
> although it can be useful to have that sort of amount
of spare space for
> other reasons - for example you're a bit stuck
otherwise if you want to
> change how you index and need to rebuild the index from
scratch while
> still allowing searches of the existing database while
the rebuild
> happens.
>
> Cheers,
> Olly
>
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |

|
2007-10-13 08:54:22 |
If you read my previous posts I brought this issue several
months ago
about slow indexing using flint database. The indexing used
to run so
fast when using quartz database that I was indexing 20
million records
in 1 hour or less. However developers did not like the large
size of
database and without testing large amount of data we moved
to flint
database that from my experience is the worst that happen to
Xapian
recently.
We must make a priority what to do here.
- fast indexing
- fast searches
- anything else has low priority
I am placing highest priority on indexing than searching
size of index
is a low priority tomorrow you will be able to by 1 Terabyte
hard
drive for good lunch in Ritz Carlton. Why to focus on small
index when
we want to have fastest indexing. However developers tend to
complain
about the size of index but never complaint that Xapian was
indexing
approx 10 times faster than Lucene, MySQL or MSSQL. We got
ridd of the
fastest indexing I have ever seen in my life. Why?
Developers
complaint about the size of Quartz database that we use to
have as
default.
You cannot make happy everyone therefore we must put
priority. What is
the highest priority?
I have installed new Xapian 1.0.3 and my search engine has
been
broken, tried to install back 1.0.2 search engine still
broken. When I
watch index each user is trying to modify iamflit file. Why?
Because
someone have complaint that cannot search while indexing or
something
similar. Now the whole release is broken.
--
Cheers
Kevin Duraj
http://MyHealthcare.com
Los Angeles, California
On 10/12/07, Ron Kass <ron pidgintech.com> wrote:
> Olly Betts wrote:
>
> > On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass
wrote:
> >
> >> 1. Size of such a database will actually be
about 3K per document. That
> >> is bigger than the text of the documents
themselves. This is quite
> >> surprising really, as what Xapian is supposed
to be storing is just the
> >> term IDs and the positions. Any ideas why its
bigger than the original
> >> size as opposed to 1/3 of the size lets say?
> >>
> >
> > It's usually smaller, but it does depend on the
nature of the source
> > data. If you're indexing with positional
information, I wouldn't expect
> > to achieve 1/3 the size. It might be interesting
to see the output of
> > "ls -l" on the database directory.
> >
> -rw------- 1 apache apache 0 Oct 12 00:09
flintlock
> -rw-r--r-- 1 apache apache 12 Oct 12 10:11
iamflint
> -rw-r--r-- 1 apache apache 103204 Oct 12 10:02
position.baseA
> -rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11
position.DB
> -rw-r--r-- 1 apache apache 114206 Oct 12 10:02
postlist.baseA
> -rw-r--r-- 1 apache apache 113034 Oct 12 09:48
postlist.baseB
> -rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02
postlist.DB
> -rw-r--r-- 1 apache apache 2095 Oct 12 10:02
record.baseA
> -rw-r--r-- 1 apache apache 138035200 Oct 12 10:11
record.DB
> -rw-r--r-- 1 apache apache 46899 Oct 12 10:02
termlist.baseA
> -rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11
termlist.DB
> -rw-r--r-- 1 apache apache 16588 Oct 12 10:02
value.baseA
> -rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11
value.DB
>
> This is for:
> number of documents = 6499500
> average document length = 202.329
> > Part of the database space will actually be unused
(what xapian-compact
> > does is to minimise this unused space).
> >
> When running compact we save about 25%, so our below
estimate is right
> (about 75% of the space is used)
> here is a sample compress (from an earlier stage of the
data):
> postlist: Reduced by 56.0786% 2661936K (4746792K ->
2084856K)
> record: Reduced by 2.94789% 2272K (77072K ->
74800K)
> termlist: Reduced by 5.68267% 97560K (1716800K ->
1619240K)
> position: Reduced by 2.5952% 104056K (4009552K ->
3905496K)
> value: Reduced by 50.6533% 300840K (593920K ->
293080K)
> spelling: Size unchanged (0K)
> synonym: Size unchanged (0K)
>
> Note however, that such a compress takes a lot of time
and would doubtly
> be able to run daily, as it would probably take more
than a day to run
> when its big (i.e. when its 100M documents long).
> > The Btree implementation has two update modes -
linear and non-linear.
> > In linear mode, each block is filled as full as it
will go before the
> > next block is started, so blocks end up nearly
100% used (apart for the
> > last block updated in a run of linear updates).
In non-linear mode,
> > blocks fill until an update won't fit in the
remaining space in the
> > block where it needs to go, at which point that
block is split into 2
> > each ~50% full, so on average non-linear update
results in ~75% use of
> > blocks. This is actually how a B-tree achieves
its amortised update
> > cost.
> >
> > If you're just appending documents with
add_document(), then for most
> > tables you'll only get linear update, and they end
up pretty compact.
> > The exception is the postlist table - the first
flush will be entirely
> > linear, and you may get some linear update during
subsequent flushes
> > but there's inevitably going to be non-linear
update. If the flush
> > threshold is larger, you'll increase the scope for
(and size of) spells
> > of linear updating.
> >
> Documents are actually added with
replace_document_by_term. This is done
> so that unique external document IDs are preserved.
> Before, we used replace_document and used to specify
docIDs, but since
> merge doesn't preserve the docIDs, we shifted to the
above stated approach.
> > There's certainly scope for improvement here
though - for example, we
> > currently store the document lengths in every
posting list entry, but it
> > would save quite a bit of space to store them just
once, and it's likely
> > to be faster overall when you consider the extra
I/O and caching
> > overhead.
> >
> > I'm also working on a replacement Btree manager
which reduces the
> > amount of disk space which the "wood" of
the Btree takes. That's
> > particularly an issue for the positions table
which contains a lot of
> > items, many of which are very small.
> >
> It would be interesting to test it and see the effect
on data like ours.
> >
> >> 2. Such an indexing process will start very
fast, but when reaching a
> >> database size of 2M or 3M documents, each
flush will take 2-4 minutes.
> >> This is already very slow for a flush of 10K
small documents. Changing
> >> flushes to every 1K document doesn't help.
> >>
> >
> > For greater effiency, you'll want to make the
flush threshold larger,
> > not smaller.
> >
> >
> >> It seems the time a flush
> >> takes is not related so directly to the size
of the flush itself but
> >> does strongly relate to the size of the
database itself. How come? What
> >> happens during a flush?
> >>
> >
> > Most of the work is updating the inverted file.
This is done by making
> > a pass in sorted order over the postlist table,
applying changes.
> > Because we make a pass over the table, this is
much more efficient per
> > document if you use a larger flush threshold (but
not so large that the
> > buffered changes are fighting with the OS cache of
blocks from the
> > Xapian database).
> >
> > Ideally this would self-tune, but at present if
you want to maximise
> > throughput you need to set XAPIAN_FLUSH_THRESHOLD,
and experiment a bit
> > with the value you set it to.
> >
> We found that setting the auto threshold to a very high
number and then
> deciding on the flush time in-code is somewhat
smarter.
> This is mainly so we can flush data not only on number
of documents
> added, but also after a specific time (so that
documents don't wait too
> long to show up in the index)
> >> 3. If the time it takes to flush 10K documents
when the DB is 3M
> >> documents, is 2.5 minutes, does it actually
mean that when the database
> >> is 100M documents, each flush will take over
an hour?
> >>
> >
> > It's not a linear relationship.
> >
> While not exactly linear, we do see a constant rise..
here is a sample
> of the last 2 million documents flushes. The database
is not 7.5 million
> documents.
> Note that when the database was 3 million documents,
flush tool around
> 130 seconds. So, so far relatively linear. Certainly
growing all the time..
>
> (note: at this point the database is 5.4M docs)
>
> ...
> Flushing 100000 documents ...[Done: 294 sec]
> Flushing 100000 documents ...[Done: 296 sec]
> Flushing 100000 documents ...[Done: 241 sec]
> Flushing 100000 documents ...[Done: 329 sec]
> Flushing 100000 documents ...[Done: 354 sec]
> Flushing 100000 documents ...[Done: 317 sec]
> Flushing 100000 documents ...[Done: 322 sec]
> Flushing 100000 documents ...[Done: 309 sec]
> Flushing 100000 documents ...[Done: 344 sec]
> Flushing 100000 documents ...[Done: 337 sec]
> Flushing 100000 documents ...[Done: 335 sec]
> Flushing 100000 documents ...[Done: 318 sec]
> Flushing 100000 documents ...[Done: 346 sec]
> Flushing 100000 documents ...[Done: 343 sec]
> Flushing 100000 documents ...[Done: 341 sec]
> Flushing 100000 documents ...[Done: 355 sec]
> Flushing 100000 documents ...[Done: 364 sec]
> Flushing 100000 documents ...[Done: 369 sec]
> Flushing 100000 documents ...[Done: 384 sec]
> Flushing 100000 documents ...[Done: 378 sec]
> ...
>
> (note: at this point the database is 7.4M docs)
>
> On the other hand, it seems that the actual indexing of
documents is
> getting speedier recently, maybe due to the fact that
the more data in
> the database, the less new terms are presented and
needs to be added to
> the 'dictionary'?
>
> >
> >> So.. thinking about how to maybe optimize this
process, the concept of
> >> using a "live" small database for
updates and then merge it into a big
> >> database comes to mind.
> >>
> >
> > That's certainly an approach that can work for
some situations.
> >
> >
> >> However, there are two issues here:
> >> 1. Such a merge is slow. It will take quite a
lot of time (many hours)
> >> to compact/merge each such live database into
a main one. If this
> >> process has to be done hourly lets say, and
the process takes more than
> >> an hour, we are faced with a critical
problem.
> >> 2. Such a merge process seems to take quite a
lot of resources from the
> >> system, limiting CPU, I/O and memory for the
more urgent task.. indexing
> >> and searching.
> >>
> >
> > Note that if your system has a quiet time of day
(as most do) you can
> > perform the merge then, and still allow searching
of new documents
> > between merges by using Xapian's ability to search
over multiple
> > databases together.
> >
> >
> >> 3. It also means that we can never use more
than 50% of the diskspace on
> >> the server. In fact less than 40% or 45% to be
safe. This is because the
> >> compact process is merging the big database
with the small one, into a
> >> new database. So the new database will be
bigger than the original big
> >> one. So just because of this process, the
server diskspace can not be
> >> utilized effectively.
> >>
> >
> > Disk is pretty cheap these days though (until you
run out of drive
> > bays!) But you're right that this is a drawback
of this approach,
> > although it can be useful to have that sort of
amount of spare space for
> > other reasons - for example you're a bit stuck
otherwise if you want to
> > change how you index and need to rebuild the index
from scratch while
> > still allowing searches of the existing database
while the rebuild
> > happens.
> >
> > Cheers,
> > Olly
> >
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
>
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  United Kingdom |
2007-10-13 22:45:07 |
On Sat, Oct 13, 2007 at 06:54:22AM -0700, Kevin Duraj
wrote:
> If you read my previous posts I brought this issue
several months ago
> about slow indexing using flint database. The indexing
used to run so
> fast when using quartz database that I was indexing 20
million records
> in 1 hour or less. However developers did not like the
large size of
> database and without testing large amount of data we
moved to flint
> database that from my experience is the worst that
happen to Xapian
> recently.
Kevin, please stop spreading this misinformation.
Both the developers and several users have tested flint with
large
amounts of data, and have found it to perform at least as
well as
quartz, and generally much better.
To my knowledge, there's one exception. One person who has
reported
that flint doesn't perform as well for them as quartz does.
And that
person is you! Clearly something is different about what
you are
doing, so the way forward is to identify what that is, and
work out
how we can make flint perform better that quartz for
everyone.
The problem is that each time you report a problem, you only
complain
in very general terms. The lack of detail, coupled with the
fact that
we simply don't see this ourselves and don't have access to
the data you
are indexing (nor indeed the machine you are indexing on),
means we
simply aren't able to reproduce what you report.
We'd love to help you resolve your issues, but you need to
help us to
do this. But if we ask you for more information, or suggest
things you
might try in order to identify what is happening, you either
don't
respond, or you reply doesn't contain any useful extra
information.
As an example, here I suggest you could try oprofile to find
where
time is spent, and also suggest you could try tuning the
compression
threshold to see what the optimum value is for your
situation:
http://thread.gmane.org/gmane.comp.searc
h.xapian.general/4602/focus=4606
However, your reply doesn't address either of these points:
http://thread.gmane.org/gmane.comp.searc
h.xapian.general/4602/focus=4642
And this isn't the only case like this, just the first I
found in
the archives.
I actually thought you'd solved your problem and it turned
out to be
that you weren't exporting XAPIAN_FLUSH_THRESHOLD correctly,
but that
was only a guess because you didn't reply when I explicitly
asked about
it.
> We must make a priority what to do here.
> - fast indexing
> - fast searches
> - anything else has low priority
These are your priorities. Other people have different
priorities.
For example, at the start of this thread Ron was suprised by
the size of
the index, so it would seem that he's worried about index
size to at
least some extent. Simply stating boldly that your
priorities must be
ours won't change what other people's priorities are.
Overall, I think fast indexing and searching are important,
but so is
index size, quality of results, useful features, and ease of
use (both
to end users, and of the API). This list probably isn't
exhaustive.
We need to consider everyone's needs, or else you'd be the
only Xapian
user.
> I am placing highest priority on indexing than
searching size of index
> is a low priority tomorrow you will be able to by 1
Terabyte hard
> drive for good lunch in Ritz Carlton. Why to focus on
small index when
> we want to have fastest indexing. However developers
tend to complain
> about the size of index but never complaint that Xapian
was indexing
> approx 10 times faster than Lucene, MySQL or MSSQL. We
got ridd of the
> fastest indexing I have ever seen in my life. Why?
Developers
> complaint about the size of Quartz database that we use
to have as
> default.
Also factor in the cost of backup for that 1TB, and the
extra RAM you'll
need to cache enough of it for searches to run fast. Also,
a disk that
size is likely to run hotter and draw more power (as will
the extra
RAM), so your may need a beefier power supply and better
fans. The
extra electricity will tend to increase your hosting costs
too. And
being cutting edge technology, a 1TB drive will probably be
more liable
to failure.
And feel free to buy me lunch at the Ritz Carlton sometime!
> You cannot make happy everyone therefore we must put
priority. What is
> the highest priority?
Speed of indexing is important, as is speed of searching.
To most
people, the size of the index is an issue too (but I know it
isn't to
you). But as I've tried to explain before, reducing the
size of the
index on disk means reduced I/O and reduced VM pressure, so
a smaller
index will often be faster. Hurrah! Sometimes we can have
the best of
all worlds.
In both my tests and those of other people, flint indexes at
least as
fast as quartz, runs searches faster than quartz, and
produces databases
significantly smaller than quartz.
And yes, we know it doesn't seem to for you. But if you
want to have a
hope of addressing this situation, you need to actually help
me
understand *why* this is different for you. Index exactly
the same data
with a recent version of Xapian, once with quartz and once
with flint,
so we have some scientifically valid numbers to go on. Run
it again
with profiling so I can see where the extra time is spent.
> I have installed new Xapian 1.0.3 and my search engine
has been
> broken, tried to install back 1.0.2 search engine still
broken. When I
> watch index each user is trying to modify iamflit file.
Why? Because
> someone have complaint that cannot search while
indexing or something
> similar. Now the whole release is broken.
This is a bug in 1.0.3. I'm sorry that this bug slipped
through, and
I've produced a patch to fix it which someone else has
confirmed fixes
identical symptoms, and which I pointed you to several days
ago.
Xapian has supported searching during indexing for many
years, and it
has nothing to do with this bug. The change which caused
the bug was
actually adding support for user metadata.
Cheers,
Olly
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  United Kingdom |
2007-10-13 23:27:24 |
On Fri, Oct 12, 2007 at 02:34:37PM +0200, Ron Kass wrote:
> Olly Betts wrote:
> >It might be interesting to see the output of
> >"ls -l" on the database directory.
> >
> -rw------- 1 apache apache 0 Oct 12 00:09
flintlock
> -rw-r--r-- 1 apache apache 12 Oct 12 10:11
iamflint
> -rw-r--r-- 1 apache apache 103204 Oct 12 10:02
position.baseA
> -rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11
position.DB
> -rw-r--r-- 1 apache apache 114206 Oct 12 10:02
postlist.baseA
> -rw-r--r-- 1 apache apache 113034 Oct 12 09:48
postlist.baseB
> -rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02
postlist.DB
> -rw-r--r-- 1 apache apache 2095 Oct 12 10:02
record.baseA
> -rw-r--r-- 1 apache apache 138035200 Oct 12 10:11
record.DB
> -rw-r--r-- 1 apache apache 46899 Oct 12 10:02
termlist.baseA
> -rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11
termlist.DB
> -rw-r--r-- 1 apache apache 16588 Oct 12 10:02
value.baseA
> -rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11
value.DB
>
> This is for:
> number of documents = 6499500
> average document length = 202.329
Nothing very suprising there. The postlist.DB seems larger
than I'd
expect, but that may be down to hebrew terms and UTF-8
encoding. Also
from what you report below it compacts a lot (assuming the
above are
uncompacted numbers).
> here is a sample compress (from an earlier stage of the
data):
> postlist: Reduced by 56.0786% 2661936K (4746792K ->
2084856K)
> record: Reduced by 2.94789% 2272K (77072K ->
74800K)
> termlist: Reduced by 5.68267% 97560K (1716800K ->
1619240K)
> position: Reduced by 2.5952% 104056K (4009552K ->
3905496K)
> value: Reduced by 50.6533% 300840K (593920K ->
293080K)
I'm suprised the value table has much slack in it, but it's
comparatively small anyway.
> Note however, that such a compress takes a lot of time
and would doubtly
> be able to run daily, as it would probably take more
than a day to run
> when its big (i.e. when its 100M documents long).
I'd be suprised if it took that long. The gmane database of
~50M
documents is 116GB and takes ~2h20 to compact. That machine
has 3GB
of RAM and an Athlon 64 3000+, so it's not ridiculously
beefy.
> Documents are actually added with
replace_document_by_term. This is done
> so that unique external document IDs are preserved.
Are there document duplicates during the initial build of
the database?
If not, you may find it faster to have a special "build
from scratch"
mode which just uses add_document() and then switches to
replace_document_by_term() once things are running
incrementally. I've
not profiled this though - it's possible the overhead of
checking is
too small to matter.
> Before, we used replace_document and used to specify
docIDs, but since
> merge doesn't preserve the docIDs, we shifted to the
above stated approach.
Specifying docids out of sequence will also be slower. The
case of
sequentially appending new documents (which you'll be mostly
doing even
with replace_document_by_term() I imagine) is well
optimised, and likely
to improve further in the future.
> >I'm also working on a replacement Btree manager
which reduces the
> >amount of disk space which the "wood" of
the Btree takes. That's
> >particularly an issue for the positions table which
contains a lot of
> >items, many of which are very small.
>
> It would be interesting to test it and see the effect
on data like ours.
Unfortunately it's not currently a drop-in replacement.
I've been
developing it as a standalone bit of code so far so I can
easily test it
as I go along. All the basic stuff works, but it doesn't
yet support
splitting up large tags or versioning the Btree.
> We found that setting the auto threshold to a very high
number and then
> deciding on the flush time in-code is somewhat
smarter.
> This is mainly so we can flush data not only on number
of documents
> added, but also after a specific time (so that
documents don't wait too
> long to show up in the index)
Note that you can just set XAPIAN_FLUSH_THRESHOLD, and still
call
flush() explicitly after a specific time. The explicit
flush() will
reset the counter for the next automatic one.
> >>3. If the time it takes to flush 10K documents
when the DB is 3M
> >>documents, is 2.5 minutes, does it actually
mean that when the database
> >>is 100M documents, each flush will take over an
hour?
> >
> >It's not a linear relationship.
>
> While not exactly linear, we do see a constant rise..
It'll be slower with more data, but in general it won't be
twice as slow
with twice as much data, so you can't simply extrapolate
linearly.
You'll also probably see some cliffs and plateaus in the
graph as the
working set stops fitting in the CPU cache, the disk cache,
the OS VM
cache, etc.
> On the other hand, it seems that the actual indexing of
documents is
> getting speedier recently, maybe due to the fact that
the more data in
> the database, the less new terms are presented and
needs to be added to
> the 'dictionary'?
It may be slightly more expensive to add a term for the
first time,
though I'd be suprised if it was noticable like this. I'm
not sure I
can think of a better explanation though.
Cheers,
Olly
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  United Kingdom |
2007-10-14 07:22:37 |
On Sun, Oct 14, 2007 at 04:45:07AM +0100, Olly Betts wrote:
[Impact of disk size of db]
> Also factor in the cost of backup for that 1TB, and the
extra RAM you'll
> need to cache enough of it for searches to run fast.
Also, a disk that
> size is likely to run hotter and draw more power (as
will the extra
> RAM), so your may need a beefier power supply and
better fans. The
> extra electricity will tend to increase your hosting
costs too. And
> being cutting edge technology, a 1TB drive will
probably be more liable
> to failure.
The electricity is a bigger thing than anyone realised just
a year or
so ago. A recent survey (I think done by Sun) found that
electricity
costs were higher than hardware costs in some enterprises.
Many major
data centres with good peering are in places with poor
electricity
supply compared to demand: it can cost real money to do high
density
computing with more than minimal power requirements.
Everything Olly
says would raise warning signals on a data centre manager
[1]. And
cooling costs electricity as well: vicious circle :-(
[1] I think of them as human-shaped robots with
blinkenlights. But
then I rarely meet them face to face
J
--
/-----------------------------------------------------------
---------------
James Aylett
xapian.org
james tartarus.org
uncertaintydivision.org
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  Israel |
2007-10-14 08:12:19 |
>
>> We must make a priority what to do here.
>> - fast indexing
>> - fast searches
>> - anything else has low priority
>>
>
> These are your priorities. Other people have different
priorities.
>
> For example, at the start of this thread Ron was
suprised by the size of
> the index, so it would seem that he's worried about
index size to at
> least some extent. Simply stating boldly that your
priorities must be
> ours won't change what other people's priorities are.
>
> Overall, I think fast indexing and searching are
important, but so is
> index size, quality of results, useful features, and
ease of use (both
> to end users, and of the API). This list probably
isn't exhaustive.
>
> We need to consider everyone's needs, or else you'd be
the only Xapian
> user.
>
>
>> I am placing highest priority on indexing than
searching size of index
>> is a low priority tomorrow you will be able to by 1
Terabyte hard
>> drive for good lunch in Ritz Carlton. Why to focus
on small index when
>> we want to have fastest indexing. However
developers tend to complain
>> about the size of index but never complaint that
Xapian was indexing
>> approx 10 times faster than Lucene, MySQL or MSSQL.
We got ridd of the
>> fastest indexing I have ever seen in my life. Why?
Developers
>> complaint about the size of Quartz database that we
use to have as
>> default.
>>
>
> Also factor in the cost of backup for that 1TB, and the
extra RAM you'll
> need to cache enough of it for searches to run fast.
Also, a disk that
> size is likely to run hotter and draw more power (as
will the extra
> RAM), so your may need a beefier power supply and
better fans. The
> extra electricity will tend to increase your hosting
costs too. And
> being cutting edge technology, a 1TB drive will
probably be more liable
> to failure.
>
>
Hi Olly And Kevin
It would actually be interesting to run a poll of a sort to
see how
people prioritize the importance of different features in
Xapian.
I would say the most important for us is speed of search. It
might have
implication on the size of index, or might not but thats the
most important.
Third is the size of index, indeed it is important to us.
Mainly because
we are interesting in testing Xapian on a huge dataset. And
by huge I
mean immensely, mind bogglingly gigantic. I guess if we
could anyway fit
the entire index on one big raid, we wouldn't care if its
3TB or 5TB.
But when you have to take it few leaps ahead, "size
does matter".
Third will be features actually. Which we feed Xapian covers
quite
extensively if taking out speed factor. But maybe others
feel there are
missing fundamental ones, so would be interesting to ask.
Only then speed of indexing. But let me explain this one..
there are two
speeds here.. first-time indexing speed and ongoing indexing
speed. Both
actually are less critical (unless we are talking about a
huge
difference). If it took twice as much to do the first
indexing, that
would be acceptable. I would say the same about ongoing
indexing, BUT
the most important factor here is that the indexing will be
faster than
incoming data AND that it wont take new data to appear in
the index much
longer than people expect it to. For example, if it takes a
day (after
initial crawling) to show a page google's index, thats no
problem (even
a week). But if it took even 8 hours, to show a news piece
in google's
news search, that would be way too much.
Our main concern right now is search speed. Reading some
exchanges here
about people who deal with >minute searches, that would
be an issue.
Now, of course you can throw more RAM on the thing and get
better
results, but thats not really the right criteria for speed.
What is
important here is to get better speed than alternative with
the SAME
hardware. This is the winning card here.
Now, for this we have two ideas we are happy to share and
hear feedback on..
1. Auto warm-up: In many discussions it is stated that a
warmed up
database is faster. Of course it is. One nice thing would be
to have a
warmup (or auto-warmup) mechanism that will automatically
load
critical/popular parts of the BTREEs into memory even before
first
searches are executed. It is better to let the machine do
that than
letting the first users do that and see slow searches for a
while.
2. Graceful Time-out on searches: To allow a search to run a
MAXIMUM of
X seconds, and then return whatever results it has EVEN if
not
complete/perfect. In many applications, it is better to show
a partial
result after 5 seconds than a complete/accurate one after
50. Users
don't wait 50 seconds for a result. (most don't wait even 5,
so maybe a
timeout of 0.5 should be used by some. Anyway, it should a
parameter per
query).
3. I know that Xapian's model is to trust the OS's cache.
But, was this
assumption tested to prove that it really is logical? I am
sure it saves
effort, but I am just wondering at what expense. If indeed
the speed
difference is marginal, then its a smart choice. But its a
question still.
Any feedback on those 3 ideas/issues will be greatly
appreciated.
(Last note, maybe we will test Quartz in the future, just so
we have
something (data) to contribute to this 'argument'. If we do,
we will
share our test results here.)
Ron
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  United Kingdom |
2007-10-17 18:37:07 |
On Sun, Oct 14, 2007 at 03:12:19PM +0200, Ron Kass wrote:
> Our main concern right now is search speed. Reading
some exchanges here
> about people who deal with >minute searches, that
would be an issue.
I guess you must mean the "xapian performance"
thread from last
November:
http://thread.gmane.org/gmane.comp.search.xapian.ge
neral/3542
The result of that thread was that we now handle
OP_PHRASE/OP_NEAR with
three or more operands more efficiently, we now pack and
unpack
positional data much more efficiently (138 times faster,
though that's
a microbenchmark).
Searches taking more than a minute is truly exceptional, and
the likely
explanation is either that the hardware is being
overwhelmed, or that
there's a bottleneck somewhere in the code.
If you think you've found a bottleneck in the code and
you're on Linux,
running it under oprofile and producing a callgraph will
usually show
where the problem is, and generally it's easy to make
dramatic
improvements in such cases.
Any decent profiling tool will give some idea, but oprofile
looks at the
whole system, so we see time spent in the kernel, and doing
I/O. Also,
it has a low overhead.
I ought to write up an "oprofile how to" for the
wiki, but meanwhile I
believe a 2.6 kernel is needed for best results - just run
your indexer
under oprofile with the callgraph enabled (call opcontrol
with
--callgraph=12 or some suitable stack depth, then run
opreport with
--callgraph). The reports are big, so please don't post
them to the
list as attachments. The best thing to do is open a bug
report so they
don't get lost.
Incidentally, the next release will speed up three or more
operand
OP_AND, OP_PHRASE, and OP_NEAR. Tests on wikipedia data
with real
queries shows a 16-17% reduction in time for OP_AND.
Searches from
tweakers.net's "slow search" log were sped up by
about 8% on average.
Also, queries with a ValueRangeProcessor filter are now
about 3 times
faster. And I'm currently testing another patch which
should improve
OP_FILTER performance when used on an OP_AND subexpression.
> Now, of course you can throw more RAM on the thing and
get better
> results, but thats not really the right criteria for
speed. What is
> important here is to get better speed than alternative
with the SAME
> hardware. This is the winning card here.
There is an absolute criterion too - whether the performance
from any
alternative is good enough from the hardware that you can
afford for
a particular application.
> Now, for this we have two ideas we are happy to share
and hear feedback on..
> 1. Auto warm-up: In many discussions it is stated that
a warmed up
> database is faster. Of course it is. One nice thing
would be to have a
> warmup (or auto-warmup) mechanism that will
automatically load
> critical/popular parts of the BTREEs into memory even
before first
> searches are executed. It is better to let the machine
do that than
> letting the first users do that and see slow searches
for a while.
It's pretty simple to do this anyway - just pick out some of
the top
queries from your search logs and run them.
I'm not sure we could do better than that. Having to track
the popular
parts of the Btrees would incur overhead on every search,
and require
searches to be able to write to the database directory. The
other
approach would be to just read the skeleton of each Btree,
but for a
large database that may not all be cacheable, so we'd end up
flushing
the start of the "warm up" read blocks from the
cache. To make it work
I think you'd need to look at how much cache space there was
and
> 2. Graceful Time-out on searches: To allow a search to
run a MAXIMUM of
> X seconds, and then return whatever results it has EVEN
if not
> complete/perfect. In many applications, it is better to
show a partial
> result after 5 seconds than a complete/accurate one
after 50. Users
> don't wait 50 seconds for a result. (most don't wait
even 5, so maybe a
> timeout of 0.5 should be used by some. Anyway, it
should a parameter per
> query).
The problem is that this messes up paging through results,
because the
next query will probably time out differently (or not at
all, given
we'll probably have most of the blocks we need cached).
> 3. I know that Xapian's model is to trust the OS's
cache. But, was this
> assumption tested to prove that it really is logical? I
am sure it saves
> effort, but I am just wondering at what expense. If
indeed the speed
> difference is marginal, then its a smart choice. But
its a question still.
All the evidence I've seen suggests it's actually faster
than trying to
manage caching yourself (and under Linux at least, you may
find the VM
system sometimes actually pages little used parts of your
cache in
favour of its own!)
I've not tried to implement such a cache for Xapian though,
so I can
only go on indirect evidence (e.g. comparing with Xapian's
proprietary
precursor, which did have a block cache).
It's not just saved effort either - if you maintain your own
cache, you
need something persistent to manage it, and some way to get
blocks from
the cache. The lack of a "xapian daemon" makes
Xapian easier to use in
many situations.
But if you (or anyone else) thinks a block cache would be
faster, feel
free to hack up a prototype.
Cheers,
Olly
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
| Re: Incremental indexing limitations |
  Israel |
2007-10-18 01:46:44 |
>> 2. Graceful Time-out on searches: To allow a search
to run a MAXIMUM of
>> X seconds, and then return whatever results it has
EVEN if not
>> complete/perfect. In many applications, it is
better to show a partial
>> result after 5 seconds than a complete/accurate one
after 50. Users
>> don't wait 50 seconds for a result. (most don't
wait even 5, so maybe a
>> timeout of 0.5 should be used by some. Anyway, it
should a parameter per
>> query).
>>
>
> The problem is that this messes up paging through
results, because the
> next query will probably time out differently (or not
at all, given
> we'll probably have most of the blocks we need
cached).
Since almost 100% of surfers to a search page will press the
refresh
button if a search took more than 15 seconds, even if timing
out a
search after 10 or 15 seconds messes up results, I have a
strong feeling
that its better to show some results, even inaccurate ones,
than no
results.
Your thoughts?
Ron
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|
|
|
|