List Info

Thread: wrong BM25 implementation in Lucene




wrong BM25 implementation in Lucene
user name
2006-10-26 09:40:23
Hi,

Univ. of Amsterdam has provided a downloadable version of a
language
modelling version of Lucene. Their language model is not
BM25 but is
quite similar in nature. The version is at:
http:/
/ilps.science.uva.nl/Resources/#lm-lucen

I have worked on their version a bit, they have created new
classes:
TermQueryLanguageModel, TermScorerLanguageModel,
IndexSearcherLanguageModel, LanguageModelIndexReader etc. I
think their
work can be useful to you.

If you have a successful implementation of BM25, would you
be happy to
share with us?

Jianhan

-----Original Message-----
From: beatriz ramos [mailto:beatriz.ramos.morenogmail.com] 
Sent: 25 October 2006 16:01
To: java-dev
Subject: wrong BM25 implementation in Lucene

Hello, this is BM25 algorithm I implement in Lucene.

it doen't work because I have compaired my results with the
results of
MG4J (with the same documents set)

I don't know if I have a wrong formule or there are another
mistake

Could you help me ?

------------------------------------------------------------
------------
--------------------------------------------------------

public class BM25Scorer extends Scorer {

    private final static double EPSILON_SCORE =
1.000000082240371E-9;
    private final static double DEFAULT_K1 = 0.75d;
    private final static double DEFAULT_B = 0.95d;
    private double b = DEFAULT_B;
    private double k1 = DEFAULT_K1;

    private IndexReader reader;
    private Term term;
    private Hits hits;
    private int position;   // document position in hits
    private IndexSearcher searcher;

    private int cooc = 0;    // How many times a term
appears in the
document
    private float idf;


    public float score() throws IOException {
        TermFreqVector tfv = reader.getTermFreqVector(
hits.id(position),
term.field() );

        String[] terms = tfv.getTerms();
        int[] freqs = tfv.getTermFrequencies();
        for (int i = 0 ; i < terms.length ; i++) {
            if( terms[i].equalsIgnoreCase(term.text()) ){
                cooc = freqs[i];
            }
        }

        idf = searcher.getSimilarity().idf(term, searcher);

        Document document = (Document)hits.doc(position);
        String[] values =
document.getValues("DOCUMENT_LENGTH");  //
document length is a field of my index

        long docLength =
Long.valueOf(values[0]).longValue();  //
document lenght (number of words)
        long averageLength = 200;

        double loga =  Math.max( EPSILON_SCORE, new
Float(idf
).doubleValue());
        double score = ( loga * (k1 + 1) * cooc ) / (cooc +
k1*( (1-b) +
(b*docLength/averageLength) ) );

        return new Float(score).floatValue();
    }

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )