List Info

Thread: Character position differences




Character position differences
user name
2006-02-08 02:29:57
Hello,

I observed the following feature when working with
documents.
We have a document with SGML tags, named ABC.sgml
We read it into GATE, and it is identified as GATE document
ABC.sgml_XXXXX (XXXXX is some id).
We process it with a pipeline and now we do evaluation using
AnnotationDiff.
If we do it on the same GATE document (so both key and
response are 
ABC_XXXXX, but key one is set to use "original
markup" and response the 
"default set"), everything is fine.
But in case we compare this ABC_XXXXX with another GATE
document 
ABC.sgml_XXXYZ (but created from the same physical file
ABC.sgml!) there 
are some character position changes!
For example, an entity that has been identified by JAPE
rules
is in key document from 1935 to 1957, but in response from
1927 to 1949.
So the lengths are OK, just shifted by 8 characters (but the
lexical 
form displayed is the same, so the JAPE rules work fine).
In this way AnnotationDiff shows that the entity is
"Partially Correct" 
not "Correct". However, the difference in
character positions concerns 
only some part of the entities: I recognize only two
entities in the 
document now and the first entity at position 39 is OK (no
misplacement 
between key and response documents).

This also makes impossible to run pipeline on a document
without 
annotations (SGML/XML) and evaluate it on the file with
those annotations.

Because it also happens when reading the same physical file
as two GATE 
documents, it looks as if, some of the LR in the pipeline
changes sth to 
the document. My pipeline is:

gate.creole.tokeniser.DefaultTokeniser
gate.creole.gazetteer.DefaultGazetteer
gate.creole.splitter.SentenceSplitter
JAPE Transducer like in NER

Has anybody also faced this problem? Is there a way of
solving it?
I am using GATE 3/1846, java 1.5.04, windows.

rgds,
Pawel

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )