List Info

Thread: Queries spanning paragraphs




Queries spanning paragraphs
country flaguser name
Ireland
2007-10-22 06:31:29
Hi all,

I need the ability to match documents that have two terms
that occur 
within n paragraphs of each other. I had a look through the
archives, 
and although many people have explained ways to implement
per-sentence 
or per-paragraph indexing & searching, no seems to have
tackeled this 
one yet.

The only idea I can up up with is this:

I will index the entire document, as normal, but also index
the 
paragraphs seperately, numbering them accoring to the order
they occur 
in. (Storage space isn't an issue). When searching, I will
first find 
all documents that have both terms, using the full-content
field.

Then I can get all the paragraphs that are part of that doc,
and have 
either of the search terms. I would still have to implement
a bit of 
logic to check which paragraphs have which term, and check
the distance 
between them (from the order info I kept when indexing).

I'm sure this would work, but it would be very slow. I can't
help 
feeling there's a better solution, that might involve
inserting 
paragraph tags into the content in a special field in my
index, and 
somehow using SpanQueries to find matches that have a given
number of 
paragraph marks in between... but I don't know if that's
possible.

Does anyone have any ideas?

Thanks!
John B.




------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Queries spanning paragraphs
country flaguser name
United States
2007-10-22 06:42:31
I implemented this for my qsol query parser:
myhardshadow.com/qsol

Uses a modified SpanNotQuery that takes another parameter
saying how 
many times the span can cross the specified marker. Index a
special 
paragraph marker with your text to delimit paragraphs and
then the rest 
is easy.

- Mark

public class SpanWithinQuery extends SpanQuery {
 
    private SpanQuery include;
    private SpanQuery exclude;
    private int proximity;

    /** Construct a SpanWithinQuery matching spans from 
<code>include</code> which
     *  overlap with spans from
<code>exclude</code> up to 
<code>proximity</code> times.*/
    public SpanWithinQuery(SpanQuery include, SpanQuery
exclude, int 
proximity) {
        this.include = include;
        this.exclude = exclude;
        this.proximity = proximity;

        if (!include.getField().equals(exclude.getField()))
{
            throw new IllegalArgumentException("Clauses
must have same 
field.");
        }
    }

    /** Return the SpanQuery whose matches are filtered. */
    public SpanQuery getInclude() {
        return include;
    }

    /** Return the SpanQuery whose matches must not overlap
those 
returned. */
    public SpanQuery getExclude() {
        return exclude;
    }

    public String getField() {
        return include.getField();
    }

    /** Returns a collection of all terms matched by this
query.
     * deprecated use extractTerms instead
     * see #extractTerms(Set)
     */
    public Collection getTerms() {
        return include.getTerms();
    }

    public void extractTerms(Set terms) {
        include.extractTerms(terms);
    }

    public String toString(String field) {
        StringBuffer buffer = new StringBuffer();
        buffer.append("spanWithin(");
        buffer.append(include.toString(field));
        buffer.append(", ");
        buffer.append(proximity + " ,");
        buffer.append(exclude.toString(field));
        buffer.append(")");
        buffer.append(ToStringUtils.boost(getBoost()));

        return buffer.toString();
    }

    public Spans getSpans(final IndexReader reader) throws
IOException {
        return new Spans() {
                private Spans includeSpans =
include.getSpans(reader);
                private boolean moreInclude = true;
                private Spans excludeSpans =
exclude.getSpans(reader);
                private boolean moreExclude = true;

                public boolean next() throws IOException {
                    if (moreInclude) { // move to next
include
                        moreInclude = includeSpans.next();
                    }

                    while (moreInclude &&
moreExclude) {
                        if (includeSpans.doc() >
excludeSpans.doc()) { 
// skip exclude
                            moreExclude = 
excludeSpans.skipTo(includeSpans.doc());
                        }

                        int count = 0;

                        while (moreExclude // while exclude
is before
                                
&&(includeSpans.doc() == 
excludeSpans.doc())) {
                            if ((!(excludeSpans.end() <=

includeSpans.start()))) {
                                count += 1;

                                if (count > proximity) {
                                    break;
                                }
                            }

                            moreExclude =
excludeSpans.next(); // 
increment exclude
                        }

                        if (!moreExclude // if no
intersection
                                 ||(includeSpans.doc() != 
excludeSpans.doc()) ||
                                (includeSpans.end() <= 
excludeSpans.start())) {
                            break; // we found a match
                        }

                        moreInclude = includeSpans.next();
// 
intersected: keep scanning
                    }

                    return moreInclude;
                }

                public boolean skipTo(int target) throws
IOException {
                    if (moreInclude) { // skip include
                        moreInclude =
includeSpans.skipTo(target);
                    }

                    if (!moreInclude) {
                        return false;
                    }

                    if (moreExclude // skip exclude
                             &&(includeSpans.doc()
> excludeSpans.doc())) {
                        moreExclude = 
excludeSpans.skipTo(includeSpans.doc());
                    }

                    int count = 0;

                    while (moreExclude // while exclude is
before
                             &&(includeSpans.doc()
== excludeSpans.doc())) {
                        if ((!(excludeSpans.end() <= 
includeSpans.start()))) {
                            count += 1;

                            if (count > proximity) {
                                break;
                            }
                        }

                        moreExclude = excludeSpans.next();
// increment 
exclude
                    }

                    if (!moreExclude // if no intersection
                             ||(includeSpans.doc() !=
excludeSpans.doc()) ||
                            (includeSpans.end() <=
excludeSpans.start())) {
                        return true; // we found a match
                    }

                    boolean returnboolean = next();

                    return returnboolean; // scan to next
match
                }

                public int doc() {
                    return includeSpans.doc();
                }

                public int start() {
                    return includeSpans.start();
                }

                public int end() {
                    return includeSpans.end();
                }

                public String toString() {
                    return "spans(" +
SpanWithinQuery.this.toString() + ")";
                }
            };
    }

    public Query rewrite(IndexReader reader) throws
IOException {
        SpanWithinQuery clone = null;

        SpanQuery rewrittenInclude = (SpanQuery)
include.rewrite(reader);

        if (rewrittenInclude != include) {
            clone = (SpanWithinQuery) this.clone();
            clone.include = rewrittenInclude;
        }

        SpanQuery rewrittenExclude = (SpanQuery)
exclude.rewrite(reader);

        if (rewrittenExclude != exclude) {
            if (clone == null) {
                clone = (SpanWithinQuery) this.clone();
            }

            clone.exclude = rewrittenExclude;
        }

        if (clone != null) {
            return clone; // some clauses rewrote
        } else {
            return this; // no clauses rewrote
        }
    }

    /** Returns true iff <code>o</code> is equal
to this. */
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }

        if (!(o instanceof SpanWithinQuery)) {
            return false;
        }

        SpanWithinQuery other = (SpanWithinQuery) o;

        return this.include.equals(other.include)
&&
        this.exclude.equals(other.exclude) &&
        (this.getBoost() == other.getBoost()) &&
(proximity == 
other.proximity);
    }

    public int hashCode() {
        int h = include.hashCode();
        h = (h << 1) | (h >>> 31); // rotate
left
        h ^= exclude.hashCode();
        h = (h << 1) | (h >>> 31); // rotate
left
        h ^= Float.floatToRawIntBits(getBoost());
        h ^= proximity;

        return h;
    }
}


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Queries spanning paragraphs
country flaguser name
Ireland
2007-10-22 08:05:21
Thanks for that, that's exactly what I needed.

Actually, I hadn't heard of qsol, but it seems to solve a
few other 
problems I have as well - correct highlighting, configurable
operators, 
sentence recogition. Is it distributed under the Apache
license? and is 
it currently stable enough to use out-of-the-box?

Cheers,
John


Mark Miller wrote:
> I implemented this for my qsol query parser:
myhardshadow.com/qsol
>
> Uses a modified SpanNotQuery that takes another
parameter saying how 
> many times the span can cross the specified marker.
Index a special 
> paragraph marker with your text to delimit paragraphs
and then the 
> rest is easy.
>
> - Mark
>
> public class SpanWithinQuery extends SpanQuery {
>
>    private SpanQuery include;
>    private SpanQuery exclude;
>    private int proximity;
>
>    /** Construct a SpanWithinQuery matching spans from

> <code>include</code> which
>     *  overlap with spans from
<code>exclude</code> up to 
> <code>proximity</code> times.*/
>    public SpanWithinQuery(SpanQuery include, SpanQuery
exclude, int 
> proximity) {
>        this.include = include;
>        this.exclude = exclude;
>        this.proximity = proximity;
>
>        if
(!include.getField().equals(exclude.getField())) {
>            throw new
IllegalArgumentException("Clauses must have same 
> field.");
>        }
>    }
>
>    /** Return the SpanQuery whose matches are filtered.
*/
>    public SpanQuery getInclude() {
>        return include;
>    }
>
>    /** Return the SpanQuery whose matches must not
overlap those 
> returned. */
>    public SpanQuery getExclude() {
>        return exclude;
>    }
>
>    public String getField() {
>        return include.getField();
>    }
>
>    /** Returns a collection of all terms matched by
this query.
>     * deprecated use extractTerms instead
>     * see #extractTerms(Set)
>     */
>    public Collection getTerms() {
>        return include.getTerms();
>    }
>
>    public void extractTerms(Set terms) {
>        include.extractTerms(terms);
>    }
>
>    public String toString(String field) {
>        StringBuffer buffer = new StringBuffer();
>        buffer.append("spanWithin(");
>        buffer.append(include.toString(field));
>        buffer.append(", ");
>        buffer.append(proximity + " ,");
>        buffer.append(exclude.toString(field));
>        buffer.append(")");
>        buffer.append(ToStringUtils.boost(getBoost()));
>
>        return buffer.toString();
>    }
>
>    public Spans getSpans(final IndexReader reader)
throws IOException {
>        return new Spans() {
>                private Spans includeSpans =
include.getSpans(reader);
>                private boolean moreInclude = true;
>                private Spans excludeSpans =
exclude.getSpans(reader);
>                private boolean moreExclude = true;
>
>                public boolean next() throws IOException
{
>                    if (moreInclude) { // move to next
include
>                        moreInclude =
includeSpans.next();
>                    }
>
>                    while (moreInclude &&
moreExclude) {
>                        if (includeSpans.doc() >
excludeSpans.doc()) { 
> // skip exclude
>                            moreExclude = 
> excludeSpans.skipTo(includeSpans.doc());
>                        }
>
>                        int count = 0;
>
>                        while (moreExclude // while
exclude is before
>                                
&&(includeSpans.doc() == 
> excludeSpans.doc())) {
>                            if ((!(excludeSpans.end()
<= 
> includeSpans.start()))) {
>                                count += 1;
>
>                                if (count >
proximity) {
>                                    break;
>                                }
>                            }
>
>                            moreExclude =
excludeSpans.next(); // 
> increment exclude
>                        }
>
>                        if (!moreExclude // if no
intersection
>                                 ||(includeSpans.doc()
!= 
> excludeSpans.doc()) ||
>                                (includeSpans.end()
<= 
> excludeSpans.start())) {
>                            break; // we found a match
>                        }
>
>                        moreInclude =
includeSpans.next(); // 
> intersected: keep scanning
>                    }
>
>                    return moreInclude;
>                }
>
>                public boolean skipTo(int target) throws
IOException {
>                    if (moreInclude) { // skip include
>                        moreInclude =
includeSpans.skipTo(target);
>                    }
>
>                    if (!moreInclude) {
>                        return false;
>                    }
>
>                    if (moreExclude // skip exclude
>                            
&&(includeSpans.doc() > 
> excludeSpans.doc())) {
>                        moreExclude = 
> excludeSpans.skipTo(includeSpans.doc());
>                    }
>
>                    int count = 0;
>
>                    while (moreExclude // while exclude
is before
>                            
&&(includeSpans.doc() == 
> excludeSpans.doc())) {
>                        if ((!(excludeSpans.end() <=

> includeSpans.start()))) {
>                            count += 1;
>
>                            if (count > proximity) {
>                                break;
>                            }
>                        }
>
>                        moreExclude =
excludeSpans.next(); // increment 
> exclude
>                    }
>
>                    if (!moreExclude // if no
intersection
>                             ||(includeSpans.doc() != 
> excludeSpans.doc()) ||
>                            (includeSpans.end() <= 
> excludeSpans.start())) {
>                        return true; // we found a
match
>                    }
>
>                    boolean returnboolean = next();
>
>                    return returnboolean; // scan to
next match
>                }
>
>                public int doc() {
>                    return includeSpans.doc();
>                }
>
>                public int start() {
>                    return includeSpans.start();
>                }
>
>                public int end() {
>                    return includeSpans.end();
>                }
>
>                public String toString() {
>                    return "spans(" +
SpanWithinQuery.this.toString() + 
> ")";
>                }
>            };
>    }
>
>    public Query rewrite(IndexReader reader) throws
IOException {
>        SpanWithinQuery clone = null;
>
>        SpanQuery rewrittenInclude = (SpanQuery)
include.rewrite(reader);
>
>        if (rewrittenInclude != include) {
>            clone = (SpanWithinQuery) this.clone();
>            clone.include = rewrittenInclude;
>        }
>
>        SpanQuery rewrittenExclude = (SpanQuery)
exclude.rewrite(reader);
>
>        if (rewrittenExclude != exclude) {
>            if (clone == null) {
>                clone = (SpanWithinQuery) this.clone();
>            }
>
>            clone.exclude = rewrittenExclude;
>        }
>
>        if (clone != null) {
>            return clone; // some clauses rewrote
>        } else {
>            return this; // no clauses rewrote
>        }
>    }
>
>    /** Returns true iff <code>o</code> is
equal to this. */
>    public boolean equals(Object o) {
>        if (this == o) {
>            return true;
>        }
>
>        if (!(o instanceof SpanWithinQuery)) {
>            return false;
>        }
>
>        SpanWithinQuery other = (SpanWithinQuery) o;
>
>        return this.include.equals(other.include)
&&
>        this.exclude.equals(other.exclude) &&
>        (this.getBoost() == other.getBoost()) &&
(proximity == 
> other.proximity);
>    }
>
>    public int hashCode() {
>        int h = include.hashCode();
>        h = (h << 1) | (h >>> 31); //
rotate left
>        h ^= exclude.hashCode();
>        h = (h << 1) | (h >>> 31); //
rotate left
>        h ^= Float.floatToRawIntBits(getBoost());
>        h ^= proximity;
>
>        return h;
>    }
> }
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
>


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Queries spanning paragraphs
country flaguser name
United States
2007-10-22 08:32:46
It is stable...give it a whirl. I use it at about 5 or 6
different 
heavily used installs at the moment and know of about a
dozen others 
that use it (many others have downloaded, but who knows what
for). If 
you notice anything off with it, I will fix immediately as I
use it 
heavily in production environments (newspaper business).
Just send me 
the query that does not work as you would expect.

It is Apache License. If you have any comments and/or
requests, I am 
currently working on a second version and am happy to
receive any feedback.

Qsol was started a little over a year ago and the first beta
release was 
in January. 1.0 was released in June. An early version was
even 
translated to C# by a guy that needed it. Because of some
heavy changes 
I plan, I am now moving on to 2.0.

Do keep in mind though...the Highlighter extension I wrote
does not 
require Qsol: http
s://issues.apache.org/jira/browse/LUCENE-794, though 
it does complement the span support nicely.

Also, the default Lucene QueryParser does have limited
configurable 
operators I believe.

- Mark

John Byrne wrote:
> Thanks for that, that's exactly what I needed.
>
> Actually, I hadn't heard of qsol, but it seems to solve
a few other 
> problems I have as well - correct highlighting,
configurable 
> operators, sentence recogition. Is it distributed under
the Apache 
> license? and is it currently stable enough to use
out-of-the-box?
>
> Cheers,
> John
>
>
> Mark Miller wrote:
>> I implemented this for my qsol query parser:
myhardshadow.com/qsol
>>
>> Uses a modified SpanNotQuery that takes another
parameter saying how 
>> many times the span can cross the specified marker.
Index a special 
>> paragraph marker with your text to delimit
paragraphs and then the 
>> rest is easy.
>>
>> - Mark
>>
>> public class SpanWithinQuery extends SpanQuery {
>>
>>    private SpanQuery include;
>>    private SpanQuery exclude;
>>    private int proximity;
>>
>>    /** Construct a SpanWithinQuery matching spans
from 
>> <code>include</code> which
>>     *  overlap with spans from
<code>exclude</code> up to 
>> <code>proximity</code> times.*/
>>    public SpanWithinQuery(SpanQuery include,
SpanQuery exclude, int 
>> proximity) {
>>        this.include = include;
>>        this.exclude = exclude;
>>        this.proximity = proximity;
>>
>>        if
(!include.getField().equals(exclude.getField())) {
>>            throw new
IllegalArgumentException("Clauses must have same 
>> field.");
>>        }
>>    }
>>
>>    /** Return the SpanQuery whose matches are
filtered. */
>>    public SpanQuery getInclude() {
>>        return include;
>>    }
>>
>>    /** Return the SpanQuery whose matches must not
overlap those 
>> returned. */
>>    public SpanQuery getExclude() {
>>        return exclude;
>>    }
>>
>>    public String getField() {
>>        return include.getField();
>>    }
>>
>>    /** Returns a collection of all terms matched by
this query.
>>     * deprecated use extractTerms instead
>>     * see #extractTerms(Set)
>>     */
>>    public Collection getTerms() {
>>        return include.getTerms();
>>    }
>>
>>    public void extractTerms(Set terms) {
>>        include.extractTerms(terms);
>>    }
>>
>>    public String toString(String field) {
>>        StringBuffer buffer = new StringBuffer();
>>        buffer.append("spanWithin(");
>>        buffer.append(include.toString(field));
>>        buffer.append(", ");
>>        buffer.append(proximity + " ,");
>>        buffer.append(exclude.toString(field));
>>        buffer.append(")");
>>       
buffer.append(ToStringUtils.boost(getBoost()));
>>
>>        return buffer.toString();
>>    }
>>
>>    public Spans getSpans(final IndexReader reader)
throws IOException {
>>        return new Spans() {
>>                private Spans includeSpans =
include.getSpans(reader);
>>                private boolean moreInclude = true;
>>                private Spans excludeSpans =
exclude.getSpans(reader);
>>                private boolean moreExclude = true;
>>
>>                public boolean next() throws
IOException {
>>                    if (moreInclude) { // move to
next include
>>                        moreInclude =
includeSpans.next();
>>                    }
>>
>>                    while (moreInclude &&
moreExclude) {
>>                        if (includeSpans.doc() >
excludeSpans.doc()) { 
>> // skip exclude
>>                            moreExclude = 
>> excludeSpans.skipTo(includeSpans.doc());
>>                        }
>>
>>                        int count = 0;
>>
>>                        while (moreExclude // while
exclude is before
>>                                
&&(includeSpans.doc() == 
>> excludeSpans.doc())) {
>>                            if
((!(excludeSpans.end() <= 
>> includeSpans.start()))) {
>>                                count += 1;
>>
>>                                if (count >
proximity) {
>>                                    break;
>>                                }
>>                            }
>>
>>                            moreExclude =
excludeSpans.next(); // 
>> increment exclude
>>                        }
>>
>>                        if (!moreExclude // if no
intersection
>>                                
||(includeSpans.doc() != 
>> excludeSpans.doc()) ||
>>                                (includeSpans.end()
<= 
>> excludeSpans.start())) {
>>                            break; // we found a
match
>>                        }
>>
>>                        moreInclude =
includeSpans.next(); // 
>> intersected: keep scanning
>>                    }
>>
>>                    return moreInclude;
>>                }
>>
>>                public boolean skipTo(int target)
throws IOException {
>>                    if (moreInclude) { // skip
include
>>                        moreInclude =
includeSpans.skipTo(target);
>>                    }
>>
>>                    if (!moreInclude) {
>>                        return false;
>>                    }
>>
>>                    if (moreExclude // skip exclude
>>                            
&&(includeSpans.doc() > 
>> excludeSpans.doc())) {
>>                        moreExclude = 
>> excludeSpans.skipTo(includeSpans.doc());
>>                    }
>>
>>                    int count = 0;
>>
>>                    while (moreExclude // while
exclude is before
>>                            
&&(includeSpans.doc() == 
>> excludeSpans.doc())) {
>>                        if ((!(excludeSpans.end()
<= 
>> includeSpans.start()))) {
>>                            count += 1;
>>
>>                            if (count >
proximity) {
>>                                break;
>>                            }
>>                        }
>>
>>                        moreExclude =
excludeSpans.next(); // 
>> increment exclude
>>                    }
>>
>>                    if (!moreExclude // if no
intersection
>>                             ||(includeSpans.doc()
!= 
>> excludeSpans.doc()) ||
>>                            (includeSpans.end()
<= 
>> excludeSpans.start())) {
>>                        return true; // we found a
match
>>                    }
>>
>>                    boolean returnboolean = next();
>>
>>                    return returnboolean; // scan to
next match
>>                }
>>
>>                public int doc() {
>>                    return includeSpans.doc();
>>                }
>>
>>                public int start() {
>>                    return includeSpans.start();
>>                }
>>
>>                public int end() {
>>                    return includeSpans.end();
>>                }
>>
>>                public String toString() {
>>                    return "spans(" +
SpanWithinQuery.this.toString() 
>> + ")";
>>                }
>>            };
>>    }
>>
>>    public Query rewrite(IndexReader reader) throws
IOException {
>>        SpanWithinQuery clone = null;
>>
>>        SpanQuery rewrittenInclude = (SpanQuery)
include.rewrite(reader);
>>
>>        if (rewrittenInclude != include) {
>>            clone = (SpanWithinQuery) this.clone();
>>            clone.include = rewrittenInclude;
>>        }
>>
>>        SpanQuery rewrittenExclude = (SpanQuery)
exclude.rewrite(reader);
>>
>>        if (rewrittenExclude != exclude) {
>>            if (clone == null) {
>>                clone = (SpanWithinQuery)
this.clone();
>>            }
>>
>>            clone.exclude = rewrittenExclude;
>>        }
>>
>>        if (clone != null) {
>>            return clone; // some clauses rewrote
>>        } else {
>>            return this; // no clauses rewrote
>>        }
>>    }
>>
>>    /** Returns true iff <code>o</code>
is equal to this. */
>>    public boolean equals(Object o) {
>>        if (this == o) {
>>            return true;
>>        }
>>
>>        if (!(o instanceof SpanWithinQuery)) {
>>            return false;
>>        }
>>
>>        SpanWithinQuery other = (SpanWithinQuery)
o;
>>
>>        return this.include.equals(other.include)
&&
>>        this.exclude.equals(other.exclude)
&&
>>        (this.getBoost() == other.getBoost())
&& (proximity == 
>> other.proximity);
>>    }
>>
>>    public int hashCode() {
>>        int h = include.hashCode();
>>        h = (h << 1) | (h >>> 31); //
rotate left
>>        h ^= exclude.hashCode();
>>        h = (h << 1) | (h >>> 31); //
rotate left
>>        h ^= Float.floatToRawIntBits(getBoost());
>>        h ^= proximity;
>>
>>        return h;
>>    }
>> }
>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-user-helplucene.apache.org
>>
>>
>>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )