Hi,
Univ. of Amsterdam has provided a downloadable version of a
language
modelling version of Lucene. Their language model is not
BM25 but is
quite similar in nature. The version is at:
http:/
/ilps.science.uva.nl/Resources/#lm-lucen
I have worked on their version a bit, they have created new
classes:
TermQueryLanguageModel, TermScorerLanguageModel,
IndexSearcherLanguageModel, LanguageModelIndexReader etc. I
think their
work can be useful to you.
If you have a successful implementation of BM25, would you
be happy to
share with us?
Jianhan
-----Original Message-----
From: beatriz ramos [mailto:beatriz.ramos.moreno gmail.com]
Sent: 25 October 2006 16:01
To: java-dev
Subject: wrong BM25 implementation in Lucene
Hello, this is BM25 algorithm I implement in Lucene.
it doen't work because I have compaired my results with the
results of
MG4J (with the same documents set)
I don't know if I have a wrong formule or there are another
mistake
Could you help me ?
------------------------------------------------------------
------------
--------------------------------------------------------
public class BM25Scorer extends Scorer {
private final static double EPSILON_SCORE =
1.000000082240371E-9;
private final static double DEFAULT_K1 = 0.75d;
private final static double DEFAULT_B = 0.95d;
private double b = DEFAULT_B;
private double k1 = DEFAULT_K1;
private IndexReader reader;
private Term term;
private Hits hits;
private int position; // document position in hits
private IndexSearcher searcher;
private int cooc = 0; // How many times a term
appears in the
document
private float idf;
public float score() throws IOException {
TermFreqVector tfv = reader.getTermFreqVector(
hits.id(position),
term.field() );
String[] terms = tfv.getTerms();
int[] freqs = tfv.getTermFrequencies();
for (int i = 0 ; i < terms.length ; i++) {
if( terms[i].equalsIgnoreCase(term.text()) ){
cooc = freqs[i];
}
}
idf = searcher.getSimilarity().idf(term, searcher);
Document document = (Document)hits.doc(position);
String[] values =
document.getValues("DOCUMENT_LENGTH"); //
document length is a field of my index
long docLength =
Long.valueOf(values[0]).longValue(); //
document lenght (number of words)
long averageLength = 200;
double loga = Math.max( EPSILON_SCORE, new
Float(idf
).doubleValue());
double score = ( loga * (k1 + 1) * cooc ) / (cooc +
k1*( (1-b) +
(b*docLength/averageLength) ) );
return new Float(score).floatValue();
}
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|