List Info

Thread: Created: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-a




Created: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-a
country flaguser name
United States
2007-07-31 16:07:52
Create enwiki indexable data as line-per-article rather than
file-per-article
------------------------------------------------------------
-----------------

                 Key: LUCENE-971
                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Steven Parkes


Create a line per article rather than a file. Consume with
indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-a
country flaguser name
United States
2007-07-31 16:09:53
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Parkes updated LUCENE-971:
---------------------------------

    Attachment: LUCENE-971.patch.txt

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-01 10:21:52
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12516996 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------

This looks great!

One alternate approach here would be to create a
WikipediaDocMaker
(implementing DocMaker interface) that pulls directly from
the XML
file and feeds documents into the alg.

Then, to make a line file, one could create an alg that
pulls docs
from WikipediaDocMaker and uses WriteLineDoc task to create
the
line-by-line file.

One benefit of this approach is creating docs of a certain
size (10
tokens, 100 tokens, etc) would become a one-step process
(single alg)
instead of what I think is a 2-step process now (make first
line file,
then reprocess into second line file).  Another benefit
would be you
could make wikipedia tasks that pull directly from the XML
file and
not even use a line file as an intermediary.

Steve do you think this would be a hard change?  I think it
should be
easy, except, I'm not sure how to do this w/ SAX since SAX
is "in
control".  You sort of need coroutines.  Or maybe one
thread is
running SAX and putting doc data into a shared queue, and
then the other
thread (the normal "main" thread that benchmark
runs) would pull from
this queue?


> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-01 10:29:52
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12516997 ] 

Steven Parkes commented on LUCENE-971:
--------------------------------------

I can look at what it would take to avoid the line file ...
but ... what about the overhead of the XML parser? I don't
tend to think of XML parsers as "light". Would
bundling that into the test be a concern?

I guess it's not an issue if you're just using this to
create an index and then are going to do your performance
measurements on the queries of the index. But for measuring
index performance, I would probably be cautious of bundling
in the XML processing (until proven insignificant).

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-01 11:33:52
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12517007 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------


> I can look at what it would take to avoid the line file
... but
> ... what about the overhead of the XML parser? I don't
tend to think
> of XML parsers as "light". Would bundling
that into the test be a
> concern?

Right I too would not consider XML parsing overhead
"light".  So tests
that are sensitive to the XML parsing cost should first
create a line
file.

But, this is the case regardless of which approach we use
(ie, both
approaches allow you use a line file -- the WriteLineDocTask
writes a
line file from any DocMaker).  It's just that the new
approach would
buy us more flexibility for those people who don't need (or
want) to
use the line file as an intermediary.


> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-01 14:00:52
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12517047 ] 

Doron Cohen commented on LUCENE-971:
------------------------------------

> But, this is the case regardless of which approach we
use (ie, both
> approaches allow you use a line file -- the
WriteLineDocTask writes a
> line file from any DocMaker).  It's just that the new
approach would
> buy us more flexibility for those people who don't need
(or want) to
> use the line file as an intermediary.

So there would now be two alternative ways to index wiki
data:
(1) using the proposed WikiDocMaker directly to feed AddDoc
task.
(2) using line file after first running WriteLineDocTask
when the 
doc maker was WikiDocMaker.

I like this approach.

This means that WikiDocMaker would read the data straight
from 
temp/enwiki-20070527-pages-articles.xml. So the
extract-enwiki 
target in build.xml would no longer be needed, right?



> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-01 14:05:52
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12517048 ] 

Doron Cohen commented on LUCENE-971:
------------------------------------

Mmm... an additional advantage of this is not needing to
extract 
the entire enwiki collection in order to index it - setting
the 
repetition count to 100 for AddDocTask in alternative 1 or
for 
WriteLineDocTask in alternative 2 would  mean that only 100

docs from the huge file are extracted.

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per
country flaguser name
United States
2007-08-06 15:27:59
    [ https://issues.apache.org/jira/browse/
LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12518016 ] 

Steven Parkes commented on LUCENE-971:
--------------------------------------

Sounds good. New patch soon.

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-a
country flaguser name
United States
2007-08-06 19:53:59
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Parkes updated LUCENE-971:
---------------------------------

         Assignee: Steven Parkes
    Lucene Fields: [Patch Available]  (was: [Patch
Available, New])

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt,
LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-a
country flaguser name
United States
2007-08-06 19:53:59
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Parkes updated LUCENE-971:
---------------------------------

    Attachment: LUCENE-971.patch.txt

Okay. Here's an update to the patch.

Changes:

1) EnwikiDocMaker replaces ExtractWikipedia

2) A sample algorithm is provided (and used by the build.xml
file, which could be removed if desired

3) A bug in LineDocMaker is fixed (it was storing both the
title and date in the title field (small enough that it
doesn't need its own JIRA(?))

4) LineDocMaker was made derivable-from

Much of the code in LineDocMaker is useful in EnwikiDocMaker
so I made it so (it's inheritance for impl, not abstraction
so it could be changed, of course)

5) Made LineDocMaker and WriteLineDocTask multicharater
safe

Or at least I tried to. Wikipedia has non-ascii characters
in it. To make LineDocMaker work as a base class, I made it
use an explicit FileInputStream which is required so that
SAX can extract the encoding correctly. I made
WriteLineDocTask always write UTF-8 so that I can get
non-ASCII in the output file. Seems like UTF-8 is the best
encoding for line files? At the same time, I made
LineDocMaker assume UTF-8 (unless told otherwise by a
derived class like EnwikiDocMaker) so that the line files
created by EnwikiDocMaker/WriteLineDocTask can be read by
LineDocMaker w/o loss.

> Create enwiki indexable data as line-per-article rather
than file-per-article
>
------------------------------------------------------------
-----------------
>
>                 Key: LUCENE-971
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt,
LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume
with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-10] [11-14]

about | contact  Other archives ( Real Estate discussion Medical topics )