List Info

Thread: Re: Xapian and research in IR: a few suggestions from experience




Re: Xapian and research in IR: a few suggestions from experience
country flaguser name
Switzerland
2007-09-05 11:45:01
> Do you have some pointers to the models you have in
mind so I can get
> an idea what sort of data we might be talking about? 
(I can see this
> could be useful for storing a "reputation"
score for each document,
> derived from link analysis, user clicks, etc.)

I am now working with PLSA (Probabilistic Latent Semantic
Analysis), which 
assumes that documents (d) and terms (w) are associated with
categories (z), 
and represents the data as mixture models of P(z), P(d|z)
and P(w|z). I have 
implemented this model using objects built around Xapian.
There is also some 
work done on Naive bayesian models, and Latent Dirichlet
Allocation. 

All these models would call for doubles, or vectors of
doubles, to be 
associated with Documents, TermIterators and Databases.

> > For a less important and fundamental suggestion,
I'd like to mention that
> > in research, it is often important to have unique
and determined
> > identifiers (strings) for documents. I have seen
this done by using
> > prefixed terms (which is not very clean) or by
using the "data" field of
> > documents (which lacks an iterator: one cannot
jump to one particular
> > document easily this way). It might be interesting
to do something on
> > this level (maybe simply by wrapping the
"prefixed term" way into
> > something cleaner).
>
> What do you have in mind?  You can already add/replace
or delete a
> document by term.  An overloaded version of
get_document() which could
> retrieve the first document matching a particular term
would be fairly
> easy to add and might save some internal work over
creating a
> PostingIterator.

I was thinking of the toolchain of a scientist working on
TREC, for instance: 
documents identified by string docIds are indexed, retrieval
is applied, then 
the programme outputs a codified list of documents (the
string docIds) which 
is used to evaluation with trec_eval. 

Presently I store the string docIds in the "data"
field, so there would be no 
elegant way for me to retrieve a document given its string
docId. But I have 
not felt the need for this yes, so it's a nicety, really.

Cheers !

-- 
Emmanuel Eckard                              
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse                     
+41 21 693 66 97       

()  ascii ribbon campaign - against html mail 
/                        - against microsoft attachments   
   

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )