List Info

Thread: PHP Fatal error while indexing Wikipedia




PHP Fatal error while indexing Wikipedia
user name
2007-12-31 18:37:09
Hi,

I'm indexing a Wikipedia dump as a way of getting to grips
with how
Xapian works but I'm hitting a problem. Indexing fails with
the error
pasted below. Although I haven't managed to nail down
exactly which
Wikipedia article is causing the error, I am pretty sure it
is the
same one each time. I will try to find out exactly which one
it is
causing the problem but I was wondering if anyone has come
across this
problem before. The only thing I can think it may be is a
dodgy
character, is this something which might make Xapian
stumble?

PHP Fatal error:  No matching function for overloaded
'TermGenerator_index_text' in /usr/local/lib/php/xapian.php
on line
1482

Thanks
Rob

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-01 04:35:17
On Tue, Jan 01, 2008 at 12:37:09AM +0000, Robert Young
wrote:
> I'm indexing a Wikipedia dump as a way of getting to
grips with how
> Xapian works but I'm hitting a problem. Indexing fails
with the error
> pasted below. Although I haven't managed to nail down
exactly which
> Wikipedia article is causing the error, I am pretty
sure it is the
> same one each time. I will try to find out exactly
which one it is
> causing the problem but I was wondering if anyone has
come across this
> problem before. The only thing I can think it may be is
a dodgy
> character, is this something which might make Xapian
stumble?

No, Xapian should handle arbitrary data.  The UTF-8 parsing
copes with
broken UTF-8 too.

> PHP Fatal error:  No matching function for overloaded
> 'TermGenerator_index_text' in
/usr/local/lib/php/xapian.php on line
> 1482

Which xapian-bindings version is this?  Line 1482 doesn't
seem to match
up with my tree.

It sounds like either you're passing in parameters with the
wrong type,
or it's a bug in the wrappers SWIG is generating, but it's
hard to know
which without seeing your indexer code.  The generated code
looks OK to
me at least.

My best guess is that maybe you are passing a string for the
weight
parameter in some case - as the documentation says:

http://www.x
apian.org/docs/bindings/php/

    One thing to be aware of though is that SWIG implements
dispatch
    functions for overloaded methods based on the types of
the
    parameters, so you can't always pass in a string
containing a number
    (e.g. "42") where a number is expected as you
usually can in PHP.
    You need to explicitly convert to the type required -
e.g. use (int)
    to convert to an integer, (string) to string, (double)
to a floating
    point number.

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-01 17:50:35
> Which xapian-bindings version is this?  Line 1482
doesn't seem to match
> up with my tree.
1.0.4
The function is
  function index_text($text,$weight=1,$prefix=null) {
    switch (func_num_args()) {
    case 1: case 2:
TermGenerator_index_text($this->_cPtr,$text,$weight);
break; // <--
line 1482
    default:
TermGenerator_index_text($this->_cPtr,$text,$weight,$pref
ix);
    }
  }

> It sounds like either you're passing in parameters with
the wrong type,
> or it's a bug in the wrappers SWIG is generating, but
it's hard to know
> which without seeing your indexer code.  The generated
code looks OK to
> me at least.
>
> My best guess is that maybe you are passing a string
for the weight
> parameter in some case - as the documentation says:
>
> http://www.x
apian.org/docs/bindings/php/
As you can see from the code above, it doesn't look like it
can be
getting anything other than 1 or 2 for the weight. Also,
about 150,000
documents go in without a problem before this one
in exactly the same way.

>     One thing to be aware of though is that SWIG
implements dispatch
>     functions for overloaded methods based on the types
of the
>     parameters, so you can't always pass in a string
containing a number
>     (e.g. "42") where a number is expected as
you usually can in PHP.
>     You need to explicitly convert to the type required
- e.g. use (int)
>     to convert to an integer, (string) to string,
(double) to a floating
>     point number.
Ahh, that may be something. If the field value is empty it
may be null
rather than an empty string. I guess I shouldn't be indexing
it either
way really. I won't get another chance to run this until
tomorrow but
I'll let you know how it goes. Thanks for the pointer.

Cheers
Rob

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-01 18:22:21
On Tue, Jan 01, 2008 at 11:50:35PM +0000, Robert Young
wrote:
> > Which xapian-bindings version is this?  Line 1482
doesn't seem to match
> > up with my tree.
> 1.0.4
> The function is
>   function index_text($text,$weight=1,$prefix=null) {
>     switch (func_num_args()) {
>     case 1: case 2:
>
TermGenerator_index_text($this->_cPtr,$text,$weight);
break; // <--
> line 1482
>     default:
TermGenerator_index_text($this->_cPtr,$text,$weight,$pref
ix);
>     }
>   }
> 
> > It sounds like either you're passing in parameters
with the wrong type,
> > or it's a bug in the wrappers SWIG is generating,
but it's hard to know
> > which without seeing your indexer code.  The
generated code looks OK to
> > me at least.
> >
> > My best guess is that maybe you are passing a
string for the weight
> > parameter in some case - as the documentation
says:
>
> As you can see from the code above, it doesn't look
like it can be
> getting anything other than 1 or 2 for the weight.

I think you must be misreading the code - the switch is on
"func_num_args()", which in PHP returns the number
of parameters which
were passed to the current function/method.  So $weight is
either 1 or
whatever was passed to the index_text method.

> >     One thing to be aware of though is that SWIG
implements dispatch
> >     functions for overloaded methods based on the
types of the
> >     parameters, so you can't always pass in a
string containing a number
> >     (e.g. "42") where a number is
expected as you usually can in PHP.
> >     You need to explicitly convert to the type
required - e.g. use (int)
> >     to convert to an integer, (string) to string,
(double) to a floating
> >     point number.
> Ahh, that may be something. If the field value is empty
it may be null
> rather than an empty string. I guess I shouldn't be
indexing it either
> way really. I won't get another chance to run this
until tomorrow but
> I'll let you know how it goes. Thanks for the pointer.

That sounds plausible.

It would be nice to give a better error for such cases, but
it's not
easily to do because the handling code is generated by SWIG
from the
parameter lists of the overloaded forms of the function or
method.

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 02:04:36
On Jan 2, 2008 12:22 AM, Olly Betts <ollysurvex.com> wrote:
> On Tue, Jan 01, 2008 at 11:50:35PM +0000, Robert Young
wrote:
> > > Which xapian-bindings version is this?  Line
1482 doesn't seem to match
> > > up with my tree.
> > 1.0.4
> > The function is
> >   function
index_text($text,$weight=1,$prefix=null) {
> >     switch (func_num_args()) {
> >     case 1: case 2:
> >
TermGenerator_index_text($this->_cPtr,$text,$weight);
break; // <--
> > line 1482
> >     default:
TermGenerator_index_text($this->_cPtr,$text,$weight,$pref
ix);
> >     }
> >   }
> >
> > > It sounds like either you're passing in
parameters with the wrong type,
> > > or it's a bug in the wrappers SWIG is
generating, but it's hard to know
> > > which without seeing your indexer code.  The
generated code looks OK to
> > > me at least.
> > >
> > > My best guess is that maybe you are passing a
string for the weight
> > > parameter in some case - as the documentation
says:
> >
> > As you can see from the code above, it doesn't
look like it can be
> > getting anything other than 1 or 2 for the
weight.
>
> I think you must be misreading the code - the switch is
on
> "func_num_args()", which in PHP returns the
number of parameters which
> were passed to the current function/method.  So $weight
is either 1 or
> whatever was passed to the index_text method.
Ooops! Sorry, yes, hrm... I'll put that down to a late night
response.
What I should have noticed however, is that I'm only passing
one
parameter into index_text anyway.

> > >     One thing to be aware of though is that
SWIG implements dispatch
> > >     functions for overloaded methods based on
the types of the
> > >     parameters, so you can't always pass in a
string containing a number
> > >     (e.g. "42") where a number is
expected as you usually can in PHP.
> > >     You need to explicitly convert to the
type required - e.g. use (int)
> > >     to convert to an integer, (string) to
string, (double) to a floating
> > >     point number.
> > Ahh, that may be something. If the field value is
empty it may be null
> > rather than an empty string. I guess I shouldn't
be indexing it either
> > way really. I won't get another chance to run this
until tomorrow but
> > I'll let you know how it goes. Thanks for the
pointer.
>
> That sounds plausible.
>
> It would be nice to give a better error for such cases,
but it's not
> easily to do because the handling code is generated by
SWIG from the
> parameter lists of the overloaded forms of the function
or method.
I was going to ask actually, is it possible to configure
SWIG to throw
exceptions rather than fatal errors in PHP5+? The main
problem is that
PHP fatal errors cannot be caught and cannot even be handled
by custom
error handling so they're very difficult to debug.

Cheers
Rob

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 13:13:09
Hi,

Excellent, thanks Olly, that seems to get rid of the fatal
error,
however, things are still not quite right. First, I think a
bit of
background may be helpful. As I mentioned before I'm
indexing a
Wikipedia dump (approx 13Gb), I'm skipping all redirect
entries which
cuts out quite a lot. For the entries which I am indexing
(the
articles), I am indexing the page id, the title, and the
article text,
then I am setting as data, a serialized object containing
just the id
and title. The Wikipedia dump is being processed in 2Gb
chunks due to
limitations in PHP.

There are a number of things I'm noticing which I'm not sure
are normal;
- The position and postlist files seem to be growing at a
tremendous
rate. The indexer hasn't even got past the first 2.0Gb chunk
and
already both the position.DB and postlist.DB are each over
1.2Gb. I
have tried to find out exactly what each of the files does
but haven't
had much luck. A brief addition to each of the table pages
on the wiki
on what the table actually does would be really helpfull
and
gratefully recieved.
- As the index gets bigger the disk gets hammered. Now,
obviously this
is to be expected to an extend but things are getting really
bad,
looking at 90-95% cpu waiting on IO. I'm guessing this is in
part due
to the fact that I'm doing this on my laptop with it's
crappy laptop
disk and partly due to using replace_document so that it has
to do a
query on each update. Is there any way of making queries
optimized for
querying uids? Would having an auxhiliary index just for uid
to docid
lookups help so that I only need call replace_document on
documents I
know are in the index?
- Indexing performance really really drops off as the index
grows.
It's not great at any rate as it's running on my laptop but
it's been
running for over 12 hours now and it's still not indexed the
first 2Gb
chunk. I'm guessing this is related to the second point.

Cheers
Rob

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 14:15:40
On Wed, Jan 02, 2008 at 07:13:09PM +0000, Robert Young
wrote:

> - The position and postlist files seem to be growing at
a tremendous
> rate. The indexer hasn't even got past the first 2.0Gb
chunk and
> already both the position.DB and postlist.DB are each
over 1.2Gb. I
> have tried to find out exactly what each of the files
does but haven't
> had much luck. A brief addition to each of the table
pages on the wiki
> on what the table actually does would be really
helpfull and
> gratefully recieved.

Post list is the index that goes from a term to the list of
documents
that term indexes. Position list is the list of positions
within a
given document that a term appears at.

Disabling positions in your index will remove the need for
the
position list. You can't avoid the post list, as it's the
main thing
you need 

If you have a large number of unique terms being generated,
you'll get
a large database. There may be something to do with your
term
generation that's unexpected here - you can dump a list of
terms with
a little PHP script to find out what's going on, perhaps.
(Maybe run a
couple of documents only in and see if you're getting an
expected list
of terms.)

> - As the index gets bigger the disk gets hammered. Now,
obviously
> this is to be expected to an extend but things are
getting really
> bad, looking at 90-95% cpu waiting on IO. I'm guessing
this is in
> part due to the fact that I'm doing this on my laptop
with it's
> crappy laptop disk

Partly of course that you're probably using one fairly slow
spindle
for both reading the wikipedia data and writing the database
(with
four or so tables). You probably don't have enough main
memory to let
write-behind take care of the database tables efficiently -
explicitly
flushing more often (or lowering the default flush
threshold) may help
here.

I assume you aren't actually swapping out the indexing
process? If you
are waiting too many documents to flush, there's a danger
that the
index process code will be fighting with its data (and the
kernel
buffers) for memory. If you're in that situation, again
lowering the
default flush threshold may work, but other than that or
buying more
memory you may simply be stuck.

> and partly due to using replace_document so that it has
to do a
> query on each update. Is there any way of making
queries optimized for
> querying uids? Would having an auxhiliary index just
for uid to docid
> lookups help so that I only need call replace_document
on documents I
> know are in the index?

I don't actually know how replace_document works precisely
when given
a unique identifying term (which is what I assume you mean
by
UID). What it'll do under those circumstances is to check
the posting
list for that term; it should be pretty fast at *finding*
the entry in
the posting list (because that's kind of the point of
Xapian's backend
, but
will slow down dramatically if you can't get all the
relevant
btree blocks into memory. Specifically, if you can't keep
all the
'trunk' blocks that govern the 'U'-prefixed area (assuming
you're
using 'U' as your unique term prefix) in memory, this is
going to be
horrendously slow.

Note that even if you avoid the replace_document() call
somehow that's
memory efficient, you still aren't going to index fast if
you can't
keep the trunk blocks of the posting list in memory, because
you're
going to need a lot of them in order to write a new document
to
disk. (On write, some of them may then become unused, but
that's fine
- again providing you have enough memory.)

> - Indexing performance really really drops off as the
index grows.
> It's not great at any rate as it's running on my laptop
but it's been
> running for over 12 hours now and it's still not
indexed the first 2Gb
> chunk. I'm guessing this is related to the second
point.

It may be. If you're sitting in iowait 90-95% of the time,
you're
basically not doing anything. iostat(1) (invoked as say
`iostat -x 2`)
on most Unixoids will verify that it's your disk getting
thrashed (and
will give you an idea of svctm and await or similar, the
average time to
service an IO request and the average time spent from
entering the
wait queue to complete of IO service, which probably won't
help in
this case but is often useful to know).

Something like slabtop(1) will let you look at the usage of
various
memory caches and buffers, if you're on linux (sorry, can't
remember
if you've said this). If you've run out of core for the OS
buffering
to work efficiently, that may help you track down where it's
gone and
come up with a way round it.

At the end of the day, though, indexing large quantities of
data on a
laptop is ambitious 

J

-- 
/-----------------------------------------------------------
---------------
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 17:53:58
Hey,

Thanks, lots of really interesting information.

On Jan 2, 2008 8:15 PM, James Aylett <james-xapiantartarus.org> wrote:
> Post list is the index that goes from a term to the
list of documents
> that term indexes. Position list is the list of
positions within a
> given document that a term appears at.
>
> Disabling positions in your index will remove the need
for the
> position list. You can't avoid the post list, as it's
the main thing
> you need 
>
> If you have a large number of unique terms being
generated, you'll get
> a large database. There may be something to do with
your term
> generation that's unexpected here - you can dump a list
of terms with
> a little PHP script to find out what's going on,
perhaps. (Maybe run a
> couple of documents only in and see if you're getting
an expected list
> of terms.)
Yes, this may be an issue, I'm getting a couple of strange
things happen;
- It doesn't look like the stemmer is doing anything, just
as one
example of many, surely woman and women should have the same
stem?
- How can I have 's removed from the end of terms?
- Wikipedia has lots of words in other languages (completely
different
character sets) is there a way of getting the indexer to
ignore terms
with characters outside a given range?
- There are lots of things getting indexed which I would not
have
expeted to be indexed such as numbers and number string
combinations
- All terms which start with a letter seem to be duplicated
in
Z-prefixed terms with the same frequency as the unprefixed
term,
what's this for?

I've had a read of the rest of your comments and they are
very
interesting and informative. I'm not, however, going to take
another
look at the other problems and possible solutions until I've
managed
to reduce the number of terms being generated. Does that
sound like a
sensible order?

Cheers
Rob

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 18:12:38
On Wed, Jan 02, 2008 at 08:04:36AM +0000, Robert Young
wrote:
> I was going to ask actually, is it possible to
configure SWIG to throw
> exceptions rather than fatal errors in PHP5+? The main
problem is that
> PHP fatal errors cannot be caught and cannot even be
handled by custom
> error handling so they're very difficult to debug.

We already throw PHP exceptions for Xapian exceptions for
PHP5, so it's
certainly possible to implement.

SWIG doesn't currently directly support throwing exceptions
for errors
like "incorrect parameters".  The best (and
perhaps only) way to
implement this would be to modify SWIG.  Perhaps it should
always throw
exceptions for such cases rather than it being an option.

I doubt I'll have time to look at this for a while, but if
someone comes
up with a suitable patch for SWIG, I can apply it (since I
moonlight as
SWIG's PHP maintainer!)

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Re: PHP Fatal error while indexing Wikipedia
user name
2008-01-02 18:24:21
On Wed, Jan 02, 2008 at 11:53:58PM +0000, Robert Young
wrote:

> Yes, this may be an issue, I'm getting a couple of
strange things happen;
> - It doesn't look like the stemmer is doing anything,
just as one
> example of many, surely woman and women should have the
same stem?

Ideally, but actually the English stemmer doesn't do this at
the
moment (in the most convenient online wordlist I have to
hand, only
about one in six -men words should stem to the same as the
equivalent
word ending -man). Try something like 'happiness', which
should stem
to 'happi'.

> - How can I have 's removed from the end of terms?

How are you generating your terms? (Again, you may have
mentioned this
already - sorry if so.) From later comments, I assume you're
using
TermGenerator, probably directly from scriptindex or
omindex. If this
is the case, then at the moment you can't: we're including
single
apostrophes in terms because otherwise you end up with a lot
of 'junk'
words (eg "didn't" => "didn" and
"t", which isn't helpful). It also
enabled a chance in the way searches including an
apostrophised word
were managed, which improved the speed of them.

If you just want to kill them always, you probably need a
custom
stemmer. It shouldn't be too hard, but you're putting
yourself in for
more work doing that.

> - Wikipedia has lots of words in other languages
(completely different
> character sets) is there a way of getting the indexer
to ignore terms
> with characters outside a given range?

You'll probably need a custom indexer for this. You need to
think
carefully about your index plan as well at this point - do
you
genuinely want to just drop those words? Are they marked up
correctly
in some way to indicate the source language (Wikipedia's
output is
HTML, so they really should, but I wouldn't be surprised if
they
aren't)?

> - There are lots of things getting indexed which I
would not have
> expeted to be indexed such as numbers and number string
combinations

The default term generator indexes a lot of things which
have proven
useful in the past. We'd like to make it more flexible in
the future,
so if you have a particular way you'd like it to work, let
us know.

> - All terms which start with a letter seem to be
duplicated in
> Z-prefixed terms with the same frequency as the
unprefixed term,
> what's this for?

This because of the way we do phrase matching (and some
other
things). The Z-prefixed terms should be the stemmed
variants. There's
more detail at <ht
tp://www.xapian.org/docs/termgenerator.html>.

> I've had a read of the rest of your comments and they
are very
> interesting and informative. I'm not, however, going to
take another
> look at the other problems and possible solutions until
I've managed
> to reduce the number of terms being generated. Does
that sound like a
> sensible order?

Yes, that sounds reasonable. It's generally a good idea to
get the
results you want before trying to optimise 

J

-- 
/-----------------------------------------------------------
---------------
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

[1-10] [11-15]

about | contact  Other archives ( Real Estate discussion Medical topics )