|
List Info
Thread: ICU
|
|
| ICU |

|
2006-04-10 03:44:21 |
I've just been looking at ICU with an eye to reworking the
unicode
queryparser patch to use it. A few things have jumped out
so far which
make we wonder if it's the best option. I don't really
know what the
alternatives are though (currently QueryParser uses glib's
unicode
routines).
The first is that there seems to be bad version skew.
Ubuntu breezy
(the latest release) has ICU 2.1 and 2.8 packaged, as does
Debian sarge
(the latest stable release). The latest ICU version is
3.4.1 (and
debian unstable only has this version). I can't seem to
find what's
changed between ICU versions (except for release notes for
3.2 and 3.4
versions), so I worry this is going to be a hassle.
The second is that all the multi-statement macro definitions
in their
headers are just enclosed in a block "{...}"
instead of using the
familiar "do {...} while (0)" trick to avoid
suprise when used in
places where an extra ";" matters.
This doesn't seem to rate a mention in the user guide, but
e.g.
/usr/include/unicode/utf8.h says:
* <em>Usage:</em>
* ICU coding guidelines for if() statements should be
followed when using these macros.
* Compound statements (curly braces {}) must be used for
if-else-while...
* bodies and all macro statements should be terminated with
semicolon.
I don't really like the attitude that *I* have to follow
*their* coding
guidelines in my own code! If I'm contributing code to
their project
then I agree it's reasonable to expect adherence to their
coding
standards, but not just to use their library.
By eschewing the standard idiom for wrapping multiline macro
calls,
they're forcing the risk of silent miscompilation on their
users.
Finally, they use UTF-16 as their internal representation
whereas we
want to use UTF-8. For the queryparser, this isn't an
issue as
there are macros for decoding UTF-8 characters and for
saying if a
unicode code point is upper case, etc. But in omindex we
want to
be able to convert between encodings, and it looks like we
have to
go via UTF-16. I suspect we'd end up writing our own
ISO-8859-1
to UTF-8 convertor (that's probably the most common
conversion we'd
need).
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-11 01:55:08 |
On Mon, Apr 10, 2006 at 04:44:21AM +0100, Olly Betts wrote:
> By eschewing the standard idiom for wrapping multiline
macro calls,
> they're forcing the risk of silent miscompilation on
their users.
I've been thinking about this, and I think there's
actually no silent
miscompilation risk here (I was thinking of the
"dangling else" issue
but this is a rather different situation).
I think the extra semicolon can only stop code which looks
like it
should work from compiling (and on the flip side code which
looks like
it shouldn't compile can - e.g. if you accidentally omit
the semicolon).
But I still think it's a bloody-minded attitude. A macro
which looks
like a function should work as one.
Michael Schlenker pointed me at the unicode handling code in
Tcl. It
looks very well done - the source file which provides utf8
handling and
unicode codepoint identification for the BMP (i.e.
codepoints below
0x10000) compiles to a 28K object file (x86-64) which I
think is all
the unicode support which the QueryParser class needs. I
think here
the evils of cut-and-paste code reuse are less than the
annoyance of
adding a large library dependency to the core library.
For Omega we also need encoding conversion, which I think
inevitably
needs a large bit of code or data. Tcl's code for this is
compact, but
has 1.3MB of data files. I don't see so much an issue with
adding a large
library dependency to omega, be it ICU, glib, using Tcl's
code, or using
an installed version of Tcl. Or something else.
What's a good option partly comes down to "what are
people likely to
have installed anyway". Looking at the debian
"popcon" results, the
answer seems to be glib, then Tcl, then ICU. But the spread
isn't
great and ICU is pretty common (openoffice uses it I
believe). Not sure
how representative the numbers are though, and they may be
rather
different for non-Linux platforms.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-12 06:38:42 |
Olly Betts wrote:
> For Omega we also need encoding conversion, which I
think inevitably
> needs a large bit of code or data. Tcl's code for
this is compact, but
> has 1.3MB of data files. I don't see so much an
issue with adding a large
> library dependency to omega, be it ICU, glib, using
Tcl's code, or using
> an installed version of Tcl. Or something else.
What's wrong with iconv for encoding conversion ?
jf
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-12 12:56:39 |
On Wed, Apr 12, 2006 at 08:38:42AM +0200, Jean-Francois
Dockes wrote:
> What's wrong with iconv for encoding conversion ?
The main problem is iconv_open. As the Linux iconv_open man
page puts it:
The values permitted for fromcode and tocode and the
supported
combinations are system dependent.
The problem is that there's no standard accompanying API
for discovering
what values are supported or which combinations. So perhaps
on some
platform I can't convert from encoding X to utf-8, but I
could convert
from encoding X to Y and then Y to utf-8. Or utf-8 may not
be supported
at all. I've read before that these are genuine problems
with trying to
use iconv.
It's also not portably documented how to spell any
particular encoding -
for GNU libiconv, it appears utf-8 is "UTF-8",
but there's no assurance
that name will work on another implementation even if utf-8
is supported.
The GNU implementation seems pretty decent - it supports a
lot of
encodings and can convert between any given pair. So one
option is to
use iconv where it's known to be decent, but use other code
elsewhere.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-12 15:00:00 |
Olly Betts wrote:
> For Omega we also need encoding conversion, which I
think inevitably
> needs a large bit of code or data. Tcl's code for
this is compact, but
> has 1.3MB of data files. I don't see so much an issue
with adding a large
> library dependency to omega, be it ICU, glib, using
Tcl's code, or using
> an installed version of Tcl. Or something else.
Another potential option is Simon Tatham's
"libcharset". I mainly
mention this for completeness: I'm not sure how actively
he's developing
/ supporting this code, and it's unlikely to be installed
already on
someone's system. On the plus side, it's easy to contact
the author, I
think it more than provides the encoding conversion routines
we'd need,
it's portable, and its utf8 conversion code is very
compact. (The full
library compiles to about 650K on my machine, but most of
that is
compiled data tables - the utf8 code compiles to about a 2K
object file).
There isn't a webpage for it: the subversion repository is
at
svn://tartarus.org/main/charset/
and is web-viewable at
http://www.tartarus.org/~simon-anonsvn/viewcvs.cgi/ch
arset/
--
Richard
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-13 07:36:39 |
Olly Betts writes:
> On Wed, Apr 12, 2006 at 08:38:42AM +0200,
Jean-Francois Dockes wrote:
> > What's wrong with iconv for encoding conversion
?
>
> The main problem is iconv_open. As the Linux
iconv_open man page puts it:
>
> The values permitted for fromcode and tocode and
the supported
> combinations are system dependent.
True, I think it's part a more general problem with
locale/charsets naming
(for example, would you believe that on Solaris, the charset
name returned
by nl_langinfo(CODESET) in the C locale is "646"
...)
> The problem is that there's no standard accompanying
API for discovering
> what values are supported or which combinations. So
perhaps on some
> platform I can't convert from encoding X to utf-8,
but I could convert
> from encoding X to Y and then Y to utf-8. Or utf-8
may not be supported
> at all. I've read before that these are genuine
problems with trying to
> use iconv.
Is there really a reasonably current platform with no
support for
conversion to utf-8 ? What do you want to support beyond
Linux/xBSD/Solaris/AIX/HP-UX ?
> It's also not portably documented how to spell any
particular encoding -
> for GNU libiconv, it appears utf-8 is
"UTF-8", but there's no assurance
> that name will work on another implementation even if
utf-8 is supported.
It's also probably true that the encoding names that you
retrieve from the
source documents will be quite variable too.
> The GNU implementation seems pretty decent - it
supports a lot of
> encodings and can convert between any given pair. So
one option is to
> use iconv where it's known to be decent, but use
other code elsewhere.
Another option might be to always use iconv, but carry GNU
libiconv as a
dependency on systems where the native implementation proves
to be really
deficient ? In any case, encoding conversion can be wrapped
in a few method
calls, so it might not be a big issue to switch to ICU if
really needed.
I think that glib relies on libiconv, so it's only a
candidate as a
wrapper (at least libiconv is required for
building/installing glib on
FreeBSD).
After having a look at the ICU documentation, it does appear
to be much
more complete than anything else, but also quite a large
dependency to
carry.
Do you know how the different web browsers handle this issue
? I think that
openoffice uses ICU, and Mozilla uses all plus internal code
Jf
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| ICU |

|
2006-04-13 14:14:15 |
On Thu, Apr 13, 2006 at 09:36:39AM +0200, Jean-Francois
Dockes wrote:
> Olly Betts writes:
> > The problem is that there's no standard
accompanying API for discovering
> > what values are supported or which combinations.
So perhaps on some
> > platform I can't convert from encoding X to
utf-8, but I could convert
> > from encoding X to Y and then Y to utf-8. Or
utf-8 may not be supported
> > at all. I've read before that these are genuine
problems with trying to
> > use iconv.
>
> Is there really a reasonably current platform with no
support for
> conversion to utf-8 ?
I don't know. I'd have to try every reasonably current
platform to find
out, and (thanks to the way iconv is specified) I need to
try converting
every supported encoding to utf-8 on each platform to be
sure, except
there's no API to discover the names of every supported
encoding. Or
what "utf-8" is called.
It's not insurmountable (and I'd hope that each iconv
implementation has
documentation to say what the supported encodings are and
perhaps even
which pairs of conversions are supported), but it really
ought to have
been standardised.
> What do you want to support beyond
Linux/xBSD/Solaris/AIX/HP-UX ?
Incidentally I've never heard from anyone who's tried
Xapian on AIX (and
IBM don't seem to have any sort of "developer
access" program, which is
a little suprising given they seem very supportive of Open
Source in
other ways).
But there's also Darwin/OS X (which I guess might be
covered by xBSD),
DEC OSF1, IRIX, and MS Windows.
I do have access to most of these except AIX, IRIX, and MS
Windows. And
SourceForge's OS X boxes have been offline for months now.
> > It's also not portably documented how to spell
any particular encoding -
> > for GNU libiconv, it appears utf-8 is
"UTF-8", but there's no assurance
> > that name will work on another implementation
even if utf-8 is supported.
>
> It's also probably true that the encoding names that
you retrieve from the
> source documents will be quite variable too.
But at least there are standards which specify most of
those, and it's
not a potentially different problem on every platform.
> > The GNU implementation seems pretty decent - it
supports a lot of
> > encodings and can convert between any given pair.
So one option is to
> > use iconv where it's known to be decent, but use
other code elsewhere.
>
> Another option might be to always use iconv, but carry
GNU libiconv as a
> dependency on systems where the native implementation
proves to be really
> deficient ?
That's definitely worth considering.
> In any case, encoding conversion can be wrapped in a
few method
> calls, so it might not be a big issue to switch to ICU
if really needed.
Yeah, that was my plan. I wonder if ultimately we might
want to have
wrappers to support several different alternatives as they
probably each
have good and bad points.
> After having a look at the ICU documentation, it does
appear to be much
> more complete than anything else, but also quite a
large dependency to
> carry.
Yes, it does seem very comprehensive.
> Do you know how the different web browsers handle this
issue ? I think that
> openoffice uses ICU, and Mozilla uses all plus internal
code
OpenOffice definitely uses ICU. I don't know about
anything else.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Xapian-core on AIX (was: ICU) |

|
2006-04-24 14:15:31 |
Olly Betts writes:
> Incidentally I've never heard from anyone who's tried
Xapian on AIX (and
> IBM don't seem to have any sort of "developer
access" program, which is
> a little suprising given they seem very supportive of
Open Source in
> other ways).
I have an old IBM workstation around and, just for the
"fun" of it, I built
xapian-core 0.9.2 on AIX 5.2 with gcc 3.3.2 (from the IBM
AIX support site).
The software builds without problems, except for the link of
the shared
library which does not work at all. Libtool fails to build
the .so file
out of the .a archives. I tried both with libtool 1.3 and
1.5.22.
I was able to build the shared library "by hand"
(listing the object files
on the command line), and the result seemed to work normally
(I only
performed minimal testing).
Jf
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Xapian-core on AIX |

|
2006-04-24 15:28:05 |
On Mon, Apr 24, 2006 at 04:15:31PM +0200, Jean-Francois
Dockes wrote:
> I have an old IBM workstation around and, just for the
"fun" of it, I built
> xapian-core 0.9.2 on AIX 5.2 with gcc 3.3.2 (from the
IBM AIX support site).
Thanks for trying.
> The software builds without problems, except for the
link of the shared
> library which does not work at all. Libtool fails to
build the .so file
> out of the .a archives. I tried both with libtool 1.3
and 1.5.22.
Hmm, Xapian 0.9.2 was bootstrapped with libtool 1.5.18. Did
you force
it to use libtool 1.5.22, or did you just install libtool
1.5.22? If
the latter, it'll still be using libtool 1.5.18. I can't
imagine it
would even slightly work with libtool 1.3.
The thing to check is "./libtool --version" from
the unpacked
xapian-core source directory.
> I was able to build the shared library "by
hand" (listing the object files
> on the command line), and the result seemed to work
normally (I only
> performed minimal testing).
Did you try "make check"?
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Xapian-core on AIX |

|
2006-04-24 22:04:12 |
Olly Betts writes:
> The thing to check is "./libtool
--version" from the unpacked
> xapian-core source directory.
For 0.9.2, ./libtool --version says 1.5.18
> Did you try "make check"?
Just did it. It doesn't want to link 'apitest'. Here is
what I get:
ld: 0711-317 ERROR: Undefined symbol: vtable for
Xapian::TradWeight
ld: 0711-317 ERROR: Undefined symbol: vtable for
Xapian::BM25Weight
This is quite strange because recollindex and delve do link,
as do some of
the other xxxtest programs ...
Anyway, this is probably not interesting because I redid the
build with
xapian-core 0.9.5 and this goes through normally (this seems
to use libtool
1.5.22, and it apparently fixes the problem).
However, make check doesn't pass. There is a core file in
the tests
directory.
I'm appending the output from make check. I'm not sure if
it's worth
pursuing the issue until someone actually asks for xapian on
AIX ?
By the way, large file support does not work (cstdio does
not compile),
this appears to be a known problem of gnu STL on AIX, I
don't know if there
is a workaround. I configured xapian with
--disable-largefile.
jf
gmake[2]: Entering directory
`/home/softs/xapian-core-0.9.5/tests'
Running test: table1...FAIL: btreetest
Running test: open1...FAIL: quartztest
/home/softs/xapian-core-0.9.5/tests/.libs/lt-remotetest
completed test run: All 4 tests passed.
PASS: remotetest
Running nodb tests with void backend...
Running test: emptyquery1...FAIL: apitest
/home/softs/xapian-core-0.9.5/tests/.libs/lt-internaltest
completed test run: All 6 tests passed.
PASS: internaltest
The random seed is 42
Please report the seed when reporting a test failure.
Running tests with none stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with danish stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with dutch stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with english stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with finnish stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with french stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with german stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with italian stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with norwegian stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with portuguese stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with russian stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with spanish stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with swedish stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with english_lovins stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
Running tests with english_porter stemmer...
Running test: stemdict... SKIPPED
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest
completed test run: All 2 tests passed, 1 skipped.
/home/softs/xapian-core-0.9.5/tests/.libs/lt-stemtest total:
All 30 tests passed, 15 skipped.
PASS: stemtest
Running test: queryparser1...FAIL: queryparsertest
===================
4 of 7 tests failed
===================
gmake[2]: *** [check-TESTS] Error 1
gmake[2]: Leaving directory
`/home/softs/xapian-core-0.9.5/tests'
gmake[1]: *** [check-am] Error 2
gmake[1]: Leaving directory
`/home/softs/xapian-core-0.9.5/tests'
gmake: *** [check-recursive] Error 1
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
|
|