List Info

Thread: Created: (LUCENE-1012) Problems with maxMergeDocs parameter




Created: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-01 01:09:50
Problems with maxMergeDocs parameter
------------------------------------

                 Key: LUCENE-1012
                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index
            Reporter: Michael Busch
            Priority: Minor
             Fix For: 2.3


I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:

{code:java} 
  public void testMaxMergeDocs() throws IOException {
    final int maxMergeDocs = 50;
    final int numSegments = 40;
    
    MockRAMDirectory dir = new MockRAMDirectory();
    IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
    writer.setMergePolicy(new LogDocMergePolicy());
    writer.setMaxMergeDocs(maxMergeDocs);

    Document doc = new Document();
    doc.add(new Field("field", "aaa",
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
    for (int i = 0; i < numSegments * maxMergeDocs; i++)
{
      writer.addDocument(doc);
      //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
    }
    writer.close();
    
    new SegmentInfos.FindSegmentsFile(dir) {

      protected Object doBody(String segmentFileName) throws
CorruptIndexException, IOException {

        SegmentInfos infos = new SegmentInfos();
        infos.read(directory, segmentFileName);
        for (int i = 0; i < infos.size(); i++) {
          assertTrue(infos.info(i).docCount <=
maxMergeDocs);
        }
        return null;
      }
    }.run();
  }
 
  
- It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.

- If I flush manually after each addDocument() call, then
the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
{code:java}
   /**
   * Returns the largest number of documents allowed in a
   * single segment.
   *
   * see #setMaxMergeDocs
   */
  public int getMaxMergeDocs() {
    return getLogDocMergePolicy().getMaxMergeDocs();
  }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-01 03:52:50
    [ https://issues.apache.org/jira/browse
/LUCENE-1012?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531440 ] 

Michael McCandless commented on LUCENE-1012:
--------------------------------------------

> - It seems that DocumentsWriter does not obey the
maxMergeDocs
>   parameter. If I don't flush manually, then the index
only contains
>   one segment at the end and the test fails.

This bug actually predates DocumentsWriter: the flushing
logic has
never respected maxMergeDocs.  I think normally maxMergeDocs
is far
larger than maxBufferedDocs.

To fix this we could change the flushing logic to include
"# buffered
docs > maxMergeDocs" as one of its flush criteria,
if the current
merge policy is a LogMergePolicy.

> - If I flush manually after each addDocument() call,
then the index
>   contains more segments. But still, there are segments
that contain 
>   more docs than maxMergeDocs, e. g. 55 vs. 50.

This behavior also predates the recent changes (MergePolicy,
etc.), eg
the test fails on 2.1 if you flush every 6 docs (whenever
"0 == i%6").

Really the current approach is better described as "any
segment with
doc count greater than maxMergeDocs will not be
merged".

We could just fix the javadocs to match the current
approach?

Or, we could change the code to actually work the way the
current
javadoc says, ie "no segment with > maxMergeDocs
will ever be
created".

Though, changing the code is somewhat tricky: in order to
know whether
a segment will have > maxMergeDocs after the merge is
done, you must
know the delete count against each of the segments, which is
somewhat
costly to compute now (you have to read the current _X_N.del
file for that
segment).

Maybe we should store the deleteCount in the SegmentInfo
(and save it
to segments_N); we've discussed this in the past, eg, you
would also
want to do this when making a merge policy that takes
deletes into
account (favors merging segments that have many deletes).

Note also that making the similar change for
"maxMergeMB" is not
really feasible: you can't really compute how many MB a
merged segment
will be from the input segments without just doing the merge
and then
checking the resulting size.  Maybe we could make a coarse
approximation by summing input sizes of the segments
(usually this is
an upper bound on final segment ssize), maybe doing
proportional
reduction of this size based on delete count.  Still it
would be
approaximate and you could wind up with a segment larger
than
maxMergeMB.


> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field",
"aaa", Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs;
i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName)
throws CorruptIndexException, IOException {
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <=
maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
>  
>   
> - It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.
> - If I flush manually after each addDocument() call,
then the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in
a
>    * single segment.
>    *
>    * see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-01 08:26:53
    [ https://issues.apache.org/jira/browse
/LUCENE-1012?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531510 ] 

Yonik Seeley commented on LUCENE-1012:
--------------------------------------

> We could just fix the javadocs to match the current
approach?
That sounds like the right approach.

> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field",
"aaa", Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs;
i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName)
throws CorruptIndexException, IOException {
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <=
maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
>  
>   
> - It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.
> - If I flush manually after each addDocument() call,
then the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in
a
>    * single segment.
>    *
>    * see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Commented: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-16 15:12:50
    [ https://issues.apache.org/jira/browse
/LUCENE-1012?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12535331 ] 

Michael McCandless commented on LUCENE-1012:
--------------------------------------------

OK I will commit a fix to the javadocs.

> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field",
"aaa", Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs;
i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName)
throws CorruptIndexException, IOException {
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <=
maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
>  
>   
> - It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.
> - If I flush manually after each addDocument() call,
then the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in
a
>    * single segment.
>    *
>    * see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-16 15:14:50
     [ https://issues.apache.org/jira/browse/LUCENE-1012?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1012:
---------------------------------------

    Assignee: Michael McCandless

> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field",
"aaa", Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs;
i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName)
throws CorruptIndexException, IOException {
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <=
maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
>  
>   
> - It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.
> - If I flush manually after each addDocument() call,
then the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in
a
>    * single segment.
>    *
>    * see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Resolved: (LUCENE-1012) Problems with maxMergeDocs parameter
country flaguser name
United States
2007-10-17 11:54:50
     [ https://issues.apache.org/jira/browse/LUCENE-1012?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1012.
----------------------------------------

    Resolution: Fixed

Corrected the javadocs.

> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's
maxMergeDocs value. I'm using the following code to test
maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new
WhitespaceAnalyzer(), true);      
>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field",
"aaa", Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs;
i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the
DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName)
throws CorruptIndexException, IOException {
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <=
maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
>  
>   
> - It seems that DocumentsWriter does not obey the
maxMergeDocs parameter. If I don't flush manually, then the
index only contains one segment at the end and the test
fails.
> - If I flush manually after each addDocument() call,
then the index contains more segments. But still, there are
segments that contain more docs than maxMergeDocs, e. g. 55
vs. 50. The javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in
a
>    * single segment.
>    *
>    * see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )