> Do you have some pointers to the models you have in
mind so I can get
> an idea what sort of data we might be talking about?
(I can see this
> could be useful for storing a "reputation"
score for each document,
> derived from link analysis, user clicks, etc.)
I am now working with PLSA (Probabilistic Latent Semantic
Analysis), which
assumes that documents (d) and terms (w) are associated with
categories (z),
and represents the data as mixture models of P(z), P(d|z)
and P(w|z). I have
implemented this model using objects built around Xapian.
There is also some
work done on Naive bayesian models, and Latent Dirichlet
Allocation.
All these models would call for doubles, or vectors of
doubles, to be
associated with Documents, TermIterators and Databases.
> > For a less important and fundamental suggestion,
I'd like to mention that
> > in research, it is often important to have unique
and determined
> > identifiers (strings) for documents. I have seen
this done by using
> > prefixed terms (which is not very clean) or by
using the "data" field of
> > documents (which lacks an iterator: one cannot
jump to one particular
> > document easily this way). It might be interesting
to do something on
> > this level (maybe simply by wrapping the
"prefixed term" way into
> > something cleaner).
>
> What do you have in mind? You can already add/replace
or delete a
> document by term. An overloaded version of
get_document() which could
> retrieve the first document matching a particular term
would be fairly
> easy to add and might save some internal work over
creating a
> PostingIterator.
I was thinking of the toolchain of a scientist working on
TREC, for instance:
documents identified by string docIds are indexed, retrieval
is applied, then
the programme outputs a codified list of documents (the
string docIds) which
is used to evaluation with trec_eval.
Presently I store the string docIds in the "data"
field, so there would be no
elegant way for me to retrieve a document given its string
docId. But I have
not felt the need for this yes, so it's a nicety, really.
Cheers !
--
Emmanuel Eckard
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse
+41 21 693 66 97
() ascii ribbon campaign - against html mail
/ - against microsoft attachments
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|