Hello,
I observed the following feature when working with
documents.
We have a document with SGML tags, named ABC.sgml
We read it into GATE, and it is identified as GATE document
ABC.sgml_XXXXX (XXXXX is some id).
We process it with a pipeline and now we do evaluation using
AnnotationDiff.
If we do it on the same GATE document (so both key and
response are
ABC_XXXXX, but key one is set to use "original
markup" and response the
"default set"), everything is fine.
But in case we compare this ABC_XXXXX with another GATE
document
ABC.sgml_XXXYZ (but created from the same physical file
ABC.sgml!) there
are some character position changes!
For example, an entity that has been identified by JAPE
rules
is in key document from 1935 to 1957, but in response from
1927 to 1949.
So the lengths are OK, just shifted by 8 characters (but the
lexical
form displayed is the same, so the JAPE rules work fine).
In this way AnnotationDiff shows that the entity is
"Partially Correct"
not "Correct". However, the difference in
character positions concerns
only some part of the entities: I recognize only two
entities in the
document now and the first entity at position 39 is OK (no
misplacement
between key and response documents).
This also makes impossible to run pipeline on a document
without
annotations (SGML/XML) and evaluate it on the file with
those annotations.
Because it also happens when reading the same physical file
as two GATE
documents, it looks as if, some of the LR in the pipeline
changes sth to
the document. My pipeline is:
gate.creole.tokeniser.DefaultTokeniser
gate.creole.gazetteer.DefaultGazetteer
gate.creole.splitter.SentenceSplitter
JAPE Transducer like in NER
Has anybody also faced this problem? Is there a way of
solving it?
I am using GATE 3/1846, java 1.5.04, windows.
rgds,
Pawel
|