|
List Info
Thread: Created: (SOLR-380) The problem: If we index a monograph in Solr, there's no way to convert
|
|
| Created: (SOLR-380) The problem: If we
index a monograph in Solr, there's no
way to convert |
  United States |
2007-10-15 22:51:50 |
The problem: If we index a monograph in Solr, there's no way
to convert search results into page-level hits. The
solution: have a "paged-text" fieldtype which
keeps track of page divisions as it indexes, and reports
page-level hits in the search results.
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------
Key: SOLR-380
URL: https:
//issues.apache.org/jira/browse/SOLR-380
Project: Solr
Issue Type: New Feature
Components: search
Reporter: Tricia Williams
Priority: Minor
"Paged-Text" FieldType for Solr
>
> A chance to dig into the guts of Solr. The problem: If
we index a
> monograph in Solr, there's no way to convert search
results into
> page-level hits. The solution: have a
"paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports
page-level hits in the
> search results.
>
> The input would contain page milestones: <page
id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and
filters), it would
> concurrently build a structural map of the item,
indicating which term
> position marked the beginning of which page: <page
id="234"
> firstterm="14324"/>. This map would be
stored in an unindexed field in
> some efficient format.
>
> At search time, Solr would retrieve term positions for
all hits that are
> returned in the current request, and use the stored map
to determine page
> ids for each term position. The results would imitate
the results for
> highlighting, something like:
>
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (SOLR-380) There's no way to
convert search results into page-level
hits of a "str |
  United States |
2007-10-15 22:55:50 |
[
https://issues.apache.org/jira/browse/SOLR-380?page=com.atla
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tricia Williams updated SOLR-380:
---------------------------------
Description:
"Paged-Text" FieldType for Solr
A chance to dig into the guts of Solr. The problem: If we
index a monograph in Solr, there's no way to convert search
results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
At search time, Solr would retrieve term positions for all
hits that are returned in the current request, and use the
stored map to determine page ids for each term position. The
results would imitate the results for highlighting,
something like:
<lst name="pages">
<lst name="doc1">
<int
name="pageid">234</int>
<int
name="pageid">236</int>
</lst>
<lst name="doc2">
<int
name="pageid">19</int>
</lst>
</lst>
<lst name="hitpos">
<lst name="doc1">
<lst name="234">
<int
name="pos">14325</int>
</lst>
</lst>
...
</lst>
was:
"Paged-Text" FieldType for Solr
>
> A chance to dig into the guts of Solr. The problem: If
we index a
> monograph in Solr, there's no way to convert search
results into
> page-level hits. The solution: have a
"paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports
page-level hits in the
> search results.
>
> The input would contain page milestones: <page
id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and
filters), it would
> concurrently build a structural map of the item,
indicating which term
> position marked the beginning of which page: <page
id="234"
> firstterm="14324"/>. This map would be
stored in an unindexed field in
> some efficient format.
>
> At search time, Solr would retrieve term positions for
all hits that are
> returned in the current request, and use the stored map
to determine page
> ids for each term position. The results would imitate
the results for
> highlighting, something like:
>
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
Summary: There's no way to convert search results
into page-level hits of a "structured document".
(was: The problem: If we index a monograph in Solr, there's
no way to convert search results into page-level hits. The
solution: have a "paged-text" fieldtype which
keeps track of page divisions as it indexes, and reports
page-level hits in the search results.)
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-16 11:25:50 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535288 ]
Peter Binkley commented on SOLR-380:
------------------------------------
I've been wondering about what's required to get this output
added to the response. It appears that a response writer
isn't the answer: those are for different formats (xml,
json, etc.). Is everything we need included in the FieldType
methods (write(), etc.)? The highlighting functionality is
probably a good model for what we want to do.
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-16 11:33:50 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535290 ]
Ryan McKinley commented on SOLR-380:
------------------------------------
I don't totally understand how a field type solves your
problem (I'm sure it can... i just don't quite follow)
But - If you want your search results to return pages, why
not just index each page as a new SolrDocument?
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-16 11:44:50 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535296 ]
Peter Binkley commented on SOLR-380:
------------------------------------
The problem with the page-as-SorlDocument approach is that
you then have to group the pages back under their container
documents to present a unified result to the user (like
this: http://tinyurl.com/yt2a25
a> ). I want the primary unit of granularity in search
results to be the book, and the pages to be only a secondary
layer. I also want to be able to do proximity searches that
bridge page boundaries, have relevance ranking consider the
whole book text and not just that page, etc.: i.e. treat the
text as continuous for searching purposes. So I gain a lot
by treating the book as the SolrDocument; I just need that
extra bit of work to resolve the page positions to have it
all.
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
> <int
name="pageid">234</int>
> <int
name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int
name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (SOLR-380) There's no way to
convert search results into page-level
hits of a "str |
  United States |
2007-10-16 12:43:50 |
[
https://issues.apache.org/jira/browse/SOLR-380?page=com.atla
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Binkley updated SOLR-380:
-------------------------------
Description:
"Paged-Text" FieldType for Solr
A chance to dig into the guts of Solr. The problem: If we
index a monograph in Solr, there's no way to convert search
results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
At search time, Solr would retrieve term positions for all
hits that are returned in the current request, and use the
stored map to determine page ids for each term position. The
results would imitate the results for highlighting,
something like:
<lst name="pages">
<lst name="doc1">
<int name="pageid">234</int>
<int name="pageid">236</int>
</lst>
<lst
name="doc2">
<int name="pageid">19</int>
</lst>
</lst>
<lst name="hitpos">
<lst
name="doc1">
<lst name="234">
<int
name="pos">14325</int>
</lst>
</lst>
...
</lst>
was:
"Paged-Text" FieldType for Solr
A chance to dig into the guts of Solr. The problem: If we
index a monograph in Solr, there's no way to convert search
results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
At search time, Solr would retrieve term positions for all
hits that are returned in the current request, and use the
stored map to determine page ids for each term position. The
results would imitate the results for highlighting,
something like:
<lst name="pages">
<lst name="doc1">
<int
name="pageid">234</int>
<int
name="pageid">236</int>
</lst>
<lst name="doc2">
<int
name="pageid">19</int>
</lst>
</lst>
<lst name="hitpos">
<lst name="doc1">
<lst name="234">
<int
name="pos">14325</int>
</lst>
</lst>
...
</lst>
formatted the xml for clarity
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
>
<int name="pageid">234</int>
>
<int name="pageid">236</int>
> </lst>
> <lst
name="doc2">
>
<int name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst
name="doc1">
>
<lst name="234">
>
<int
name="pos">14325</int>
>
</lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-17 01:47:51 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535426 ]
Pieter Berkel commented on SOLR-380:
------------------------------------
There was a recent discussion surrounding a similar problem
on solr-user:
http://www.nabble.com/Structured-Lucen
e-documents-tf4234661.html#a12048390
The idea was to use dynamic fields (e.g. page_1, page_2,
page_3... page_N) to store the text of each page in a single
document. The problem is that currently Solr does not
support "glob" style field expansion in query
parameters (e.g. qf=page_* ) so you would end up having to
specify the entire list of page fields in your query, which
is impractical. There is already an open issue related to
this particular problem (SOLR-247) but nobody has had time
to look into it.
In terms of returning term position information, this seems
somehow (albeit loosely) related to highlighting, is there
any way you could use the existing functionality to achieve
your goal? (definitely would be a hack though)
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
>
<int name="pageid">234</int>
>
<int name="pageid">236</int>
> </lst>
> <lst
name="doc2">
>
<int name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst
name="doc1">
>
<lst name="234">
>
<int
name="pos">14325</int>
>
</lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-17 04:34:51 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535489 ]
Erik Hatcher commented on SOLR-380:
-----------------------------------
> The idea was to use dynamic fields (e.g. page_1,
page_2, page_3... page_N) to store the text of each page in
a single document. The problem is that currently Solr does
not support "glob" style field expansion in query
parameters (e.g.
> qf=page_* ) so you would end up having to specify the
entire list of page fields in your query, which is
impractical. There is already an open issue related to this
particular problem (SOLR-247) but nobody has had time to
look into it.
In this case, a copyField from page_* into an unstored
"contents" would do the trick, which would also
facilitate querying across pages. A position increment gap
could also prohibit phrase queries across "pages",
optionally.
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
>
<int name="pageid">234</int>
>
<int name="pageid">236</int>
> </lst>
> <lst
name="doc2">
>
<int name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst
name="doc1">
>
<lst name="234">
>
<int
name="pos">14325</int>
>
</lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-17 12:51:50 |
[ https://issues.apache.org/jira/browse/SO
LR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12535648 ]
Peter Binkley commented on SOLR-380:
------------------------------------
Both these methods (page_* fields or unstored
"contents" field) would make it difficult to
discover from the search results which pages matched the
query, though, wouldn't they? They would both need extra
work to populate a structure like the "pages" and
"hitpos" elements in the sample xml above. Would
that extra work be more efficient than the document-map
approach we've proposed above?
The highlighting functionality is definitely the model to
follow for handling term positions.
> There's no way to convert search results into
page-level hits of a "structured document".
>
------------------------------------------------------------
-----------------------------
>
> Key: SOLR-380
> URL: https:
//issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If
we index a monograph in Solr, there's no way to convert
search results into page-level hits. The solution: have a
"paged-text" fieldtype which keeps track of page
divisions as it indexes, and reports page-level hits in the
search results.
> The input would contain page milestones: <page
id="234"/>. As Solr processed the tokens (using
its standard tokenizers and filters), it would concurrently
build a structural map of the item, indicating which term
position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This
map would be stored in an unindexed field in some efficient
format.
> At search time, Solr would retrieve term positions for
all hits that are returned in the current request, and use
the stored map to determine page ids for each term position.
The results would imitate the results for highlighting,
something like:
> <lst name="pages">
> <lst name="doc1">
>
<int name="pageid">234</int>
>
<int name="pageid">236</int>
> </lst>
> <lst
name="doc2">
>
<int name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst
name="doc1">
>
<lst name="234">
>
<int
name="pos">14325</int>
>
</lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (SOLR-380) There's no way to
convert search results into page-level
hits of a "s |
  United States |
2007-10-17 16:16:51 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/SOLR-380?PAGE=COM.ATLA
SSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#ACT
ION_12535748 ]
TRICIA WILLIAMS COMMENTED ON SOLR-380:
--------------------------------------
THE DISCUSSION FROM
HTTP://WWW.NABBLE.COM/STRUCTURED-LUCENE-DOCUMENTS-TF4234661.
HTML#A12048390 GIVES ONE SOLUTION (WHICH IS MORE OF A
WORKAROUND IN MY OPINION), BUT I DON'T THINK IT IS
PRACTICAL. THE NUMBER OF PAGES OF THE MONOGRAPHS WE INDEX
VARY GREATLY (10S TO 1000S OF PAGES). SO WHILE SPECIFYING
EACH PAGE_* (PAGE_1,PAGE_2,PAGE_3,...,PAGE_N) AS A FIELD TO
HIGHLIGHT WILL WORK, I DON'T THINK IT IS THE CLEANEST
SOLUTION BECAUSE YOU HAVE TO INFER PAGE NUMBERS FROM THE
HIGHLIGHTED SAMPLES. FURTHERMORE, IN ORDER TO GET THE
HIGHLIGHTED SAMPLES YOU NEED TO KNOW THE VALUES OF THE * IN
A DYNAMIC FIELD WHICH SORT OF DEFEATS THE PURPOSE OF THE
DYNAMIC FIELD. IF YOU WANTED TO USE THE POSITION NUMBERS
THEMSELVES (FOR EXAMPLE USING POSITIONS AND OCR INFORMATION
TO CREATE HIGHLIGHTING ON AN ORIGINAL IMAGE), THEY ARE NOT
AVAILABLE IN THE RESULTS.
IN ANSWER TO YOUR QUESTION PETER, ONE MUST ENABLE
HIGHLIGHTING AND LIST ALL THE PAGE_* FIELDS FOR HIGHLIGHTER
SNIPPETS. IN THE FOLLOWING EXAMPLE I HAVE A DYNAMIC FIELD
FULLTEXT_*, COPYFIELD FULLTEXT, AND
DEFAULTSEARCHFIELD=FULLTEXT:
HTTP://LOCALHOST:8080/SOLR/SELECT?INDENT=ON&VERSION=2.2&
amp;Q=EMPLOY&START=0&ROWS=10&FL=*%2CSCORE&QT
=STANDARD&WT=STANDARD&EXPLAINOTHER=&HL=ON&HL
.FL=FULLTEXT_1%2CFULLTEXT_2%2CFULLTEXT_3%2CFULLTEXT_4%2CFULL
TEXT_5%2CFULLTEXT_6%2CFULLTEXT_7%2CFULLTEXT_8%2CFULLTEXT_9
GIVES THE NORMAL RESULTS, WITH THE FOLLOWING AT THE END:
<LST NAME="HIGHLIGHTING">
&NBSP;<LST NAME="NEWS.EFP.186500">
&NBSP;&NBSP;<ARR NAME="FULLTEXT_1">
&NBSP;&NBSP;&NBSP;<STR>
&NBSP;&NBSP;&NBSP;&NBSP; WAS
<EM>EMPLOYED</EM> ON THE G. T. R. AS FIREMAN MET
HIS DEATH IN AN ACCIDENT ON THAT ROAD SOME YEARA AGO BUT
THREE
&NBSP;&NBSP;&NBSP;</STR>
&NBSP;&NBSP;</ARR>
&NBSP;&NBSP;<ARR NAME="FULLTEXT_4">
&NBSP;&NBSP;&NBSP;<STR>
&NBSP;&NBSP;&NBSP;&NBSP; ^-F
6R-KE.W-¥EAF!FL&APOS;: MR.-BRADV WHB IS
<EM>EMPLOYED</EM> IN WINDSOR, WAS ALSO AT HIS
BORNE FOR JSEW YEAR
&NBSP;&NBSP;&NBSP;</STR>
&NBSP;&NBSP;</ARR>
&NBSP;&NBSP;<ARR NAME="FULLTEXT_6">
&NBSP;&NBSP;&NBSP;<STR>
&NBSP;&NBSP;&NBSP;&NBSP;
<EM>EMPLOYED</EM> AT THE WALKERVILLE BREWERY OP
TO A SHORT TIME AGO,WHEN ILLNESS ECESSILATER! HIS
RESIGNATION. HE
&NBSP;&NBSP;&NBSP;</STR>
&NBSP;&NBSP;</ARR>
&NBSP;&NBSP;<ARR NAME="FULLTEXT_7">
&NBSP;&NBSP;&NBSP;<STR>
&NBSP;&NBSP;&NBSP;&NBSP; . HAVE ENTERED
INTOAN AGREEMENT TO <EM>EMPLOY</EM> THE POWERFUL
TUG LNTZ TO KEEP TH>E DETROIT RIVER BETWEEN
&NBSP;&NBSP;&NBSP;</STR>
&NBSP;&NBSP;</ARR>
&NBSP;</LST>
</LST>
YOU WILL NOTICE THAT ONLY THE PAGES WITH HITS ON THEM APPEAR
IN THE HIGHLIGHT SECTION. FROM THIS POINT IT WOULD TAKE A
LITTLE WORK TO PARSE THE /ARR[ NAME] TO GET THE * FROM
FULLTEXT_* FOR EACH DOCUMENT MATCH.
I AGREE THAT THE HIGHLIGHTER IS A GOOD MODEL OF WHAT WE WANT
TO DO. BUT THE DIFFICULTY I'M FINDING IS THE UPFRONT PART
WHERE WE NEED TO STORE THE POSITION TO PAGE MAPPING IN A
FIELD WHILE AT THE SAME TIME WE NEED TO ANALYZE THE FULL
PAGE TEXT INTO ANOTHER FIELD FOR SEARCHING.
I DON'T THINK DEFINING A FIELDTYPE WILL ALLOW US TO DO THIS.
THE FIELDTYPE LOOKS LIKE IT IS USEFUL IN CONTROLLING WHAT
THE OUTPUT OF YOUR DEFINED FIELD IS (WRITE()), AND HOW IT IS
SORTED, BUT NOT HOW FIELDS WITH YOUR FIELDTYPE WILL BE
INDEXED OR QUERIED.
WOULD SOMEONE MORE FAMILIAR WITH THE INNARDS OF SOLR
RECOMMEND I PURSUE THE SOLR-247 PROBLEM, OR CONTINUE HUNTING
FOR A SOLUTION IN THE MANNER THAT I'VE BEEN PURSUING IN THIS
ISSUE?
> THERE'S NO WAY TO CONVERT SEARCH RESULTS INTO
PAGE-LEVEL HITS OF A "STRUCTURED DOCUMENT".
>
------------------------------------------------------------
-----------------------------
>
> KEY: SOLR-380
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/SOLR-380
> PROJECT: SOLR
> ISSUE TYPE: NEW FEATURE
> COMPONENTS: SEARCH
> REPORTER: TRICIA WILLIAMS
> PRIORITY: MINOR
>
> "PAGED-TEXT" FIELDTYPE FOR SOLR
> A CHANCE TO DIG INTO THE GUTS OF SOLR. THE PROBLEM: IF
WE INDEX A MONOGRAPH IN SOLR, THERE'S NO WAY TO CONVERT
SEARCH RESULTS INTO PAGE-LEVEL HITS. THE SOLUTION: HAVE A
"PAGED-TEXT" FIELDTYPE WHICH KEEPS TRACK OF PAGE
DIVISIONS AS IT INDEXES, AND REPORTS PAGE-LEVEL HITS IN THE
SEARCH RESULTS.
> THE INPUT WOULD CONTAIN PAGE MILESTONES: <PAGE
ID="234"/>. AS SOLR PROCESSED THE TOKENS (USING
ITS STANDARD TOKENIZERS AND FILTERS), IT WOULD CONCURRENTLY
BUILD A STRUCTURAL MAP OF THE ITEM, INDICATING WHICH TERM
POSITION MARKED THE BEGINNING OF WHICH PAGE: <PAGE
ID="234" FIRSTTERM="14324"/>. THIS
MAP WOULD BE STORED IN AN UNINDEXED FIELD IN SOME EFFICIENT
FORMAT.
> AT SEARCH TIME, SOLR WOULD RETRIEVE TERM POSITIONS FOR
ALL HITS THAT ARE RETURNED IN THE CURRENT REQUEST, AND USE
THE STORED MAP TO DETERMINE PAGE IDS FOR EACH TERM POSITION.
THE RESULTS WOULD IMITATE THE RESULTS FOR HIGHLIGHTING,
SOMETHING LIKE:
> <LST NAME="PAGES">
> &NBSP;&NBSP;<LST NAME="DOC1">
> &NBSP;&NBSP;&NBSP;&NBSP;
<INT NAME="PAGEID">234</INT>
> &NBSP;&NBSP;&NBSP;&NBSP;
<INT NAME="PAGEID">236</INT>
> &NBSP;&NBSP; </LST>
> &NBSP;&NBSP; <LST
NAME="DOC2">
> &NBSP;&NBSP;&NBSP;&NBSP;
<INT NAME="PAGEID">19</INT>
> &NBSP;&NBSP; </LST>
> </LST>
> <LST NAME="HITPOS">
> &NBSP;&NBSP; <LST
NAME="DOC1">
> &NBSP;&NBSP;&NBSP;&NBSP;
<LST NAME="234">
>
&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;
<INT
NAME="POS">14325</INT>
> &NBSP;&NBSP;&NBSP;&NBSP;
</LST>
> &NBSP;&NBSP; </LST>
> &NBSP;&NBSP; ...
> </LST>
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
|
|