Hi, Xin, in my understanding , the document in Lucene is a
term of
collection of fields, while a field is pair of keyword and
value, tough it
can be indexed or stored or both. That is plain structure.
if you wanna
index a deep tree structure such as complex objects and keep
those
relationship inside, i guess we need do some tricky of that.
so in my
mentioned solution, i will do something on the keyword of a
document(here, a
document represent a object...) . the score problem you
mentioned in your
question is similar, i mean, score is actually an attribute
of mesh object,
so u wanna index the information which has a tree-like
structure (i met the
similar problem when i indexing xml-based pages. esp. for
those have lots of
deep element nodes, deep index needed for deep searching).
correct me if i was wrong or there are some better
solutions...
On 8/25/06, Zhao, Xin <xzhao9 jhmi.edu> wrote:
>
> now. i have a second thought about one meah term per
document. the scoring
> formula(hits too) is based on document, right? does it
mean that we
> shouldn't have more than one document for each object
indexed?
> for example, i try to index a publication, for some of
the information,
> like title, abstract i would like to store and index
them using default
> similarity, while the other information i would like to
use customized
> similarity. i probably should use a different indexing
directory and
> writer
> instead of two documents in the same index, right?
> thank you for helping me. you could see that i am in
the early learning
> stage now.
> xin
>
>
>
> ----- Original Message -----
> From: "Zhao, Xin" <xzhao9 jhmi.edu>
> To: <java-user lucene.apache.org>
> Sent: Friday, August 25, 2006 10:21 AM
> Subject: Re: controlled vocabulary
>
>
> > Hi,
> > Thank you for your reply. I had thought about the
first two solutions
> > before. If we apply one doc for each MeSH term, it
would be 26 docs for
> > each item digested(we actually need the top 25
MeSH terms generated),
> > would it be any problem if there are too many
documents? If we apply
> field
> > name like "mesh_1",
"mesh_2"..., when it comes to search, we will
have
> to
> > generate a loop for each single one of the query
terms( there will be
> more
> > than 20-30 terms on average, since we are using
sematic web to implement
> > concept search), do you think it would affect the
performance in a very
> > bad way?
> > Regards,
> > Xin
> >
> >
> > ----- Original Message -----
> > From: "Dedian Guo" <gdedian gmail.com>
> > To: <java-user lucene.apache.org>;
"Zhao, Xin" <xzhao jhu.edu>
> > Sent: Thursday, August 24, 2006 4:22 PM
> > Subject: Re: controlled library
> >
> >
> >> in my solution, you can apply one doc for each
mesh term, or apply
> >> different
> >> keyword such as
"mesh_1"...."mesh_10" for your top
10 terms...or u can
> >> group
> >> your mesh terms as one string then add into a
field, which requires a
> >> simple
> >> string parser for the group string when you
wanna read the terms...
> >>
> >> not sure if that works or answers your
question...
> >>
> >> On 8/24/06, Zhao, Xin <xzhao9 jhmi.edu> wrote:
> >>>
> >>> Hi,
> >>> I have a design question. Here is what we
try to do for indexing:
> >>> We designed an indexing tool to generate
standard MeSH terms from
> >>> medical
> >>> citations, and then use Lucene to save the
terms and citations for
> >>> future
> >>> search. The information we need to save
are:
> >>> a) the exact mesh terms (top 10)
> >>> b) the score for each term
> >>> so the codings are like
> >>> -----------------------------------
> >>> for the top 10 MeSH Terms
> >>> myField=Field.Keyword("mesh",
mesh.toLowerCase());
> >>> myField.setBoost(score);
> >>> doc.add(myFiled);
> >>> end for
> >>> ------------------------------------
> >>> as you could see we generate all the terms
under named field "mesh".
> If
> >>> I
> >>> understand correctly, all the fields under
the same name would
> >>> eventually save into one field, with all
the scores be normalized
> into
> >>> filed boost. In this case, we wouldn't be
able to save separate score,
> >>> so
> >>> the information is lost. Am I correct? Is
there anyway we could change
> >>> it? I
> >>> understand Lucene is for keyword search,
and what we try to do is
> >>> Controlled
> >>> Vocabulary search, Any other tool we could
use?
> >>>
> >>> Thank you,
> >>> Xin
> >>>
> >>>
> >>>
> >>
> >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> > For additional commands, e-mail:
java-user-help lucene.apache.org
> >
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-user-help lucene.apache.org
>
>
|