List Info

Thread: Created: (NUTCH-406) Metadata tries to write null values




Created: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 13:27:04
Metadata tries to write null values
-----------------------------------

                 Key: NUTCH-406
                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Doğacan Güney


During parsing, some urls (especially pdfs, it seems) may
create <some_key, null> pairs in ParseData's
parseMeta. 
When Metadata.write() tries to write such a pair, it causes
an NPE.

Stack trace will be something like this:
        at org.apache.hadoop.io.Text.encode(Text.java:373)
        at org.apache.hadoop.io.Text.encode(Text.java:354)
        at
org.apache.hadoop.io.Text.writeString(Text.java:394)
        at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)


I can consistently reproduce this using the following url:
http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Updated: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 13:29:03
     [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]

Doğacan Güney updated NUTCH-406:
--------------------------------

    Attachment: NUTCH-406.patch

A simple patch that writes nulls as empty strings.

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Updated: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 15:45:03
     [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]

Chris A. Mattmann updated NUTCH-406:
------------------------------------

    Assignee: Chris A. Mattmann

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Work started: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 15:45:04
     [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]

Work on NUTCH-406 started by Chris A. Mattmann.

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Commented: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 15:59:03
    [ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452270 ] 
            
Andrzej Bialecki  commented on NUTCH-406:
-----------------------------------------

Null value is not equivalent to an empty String - perhaps we
should simply skip such values.

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Updated: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 16:18:08
     [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]

Doğacan Güney updated NUTCH-406:
--------------------------------

    Attachment: NUTCH-406.patch

How about something like this then?

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Commented: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 16:26:03
    [ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452275 ] 
            
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------

Hi Andrzej, Doğacan,

 +1. I think it makes a lot of sense to just not include the
null key in the Met container. Doğacan, in the future, when
you attach a new version of a patch for a JIRA issue, please
indicate the change by renaming the patch. Not a big deal,
but good style points ;)

  I'll commit this patch shortly.

Cheers,
  Chris


> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Commented: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 16:44:04
    [ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452282 ] 
            
Andrzej Bialecki  commented on NUTCH-406:
-----------------------------------------

Erhm, -1 from me. This code checks only if the first value
is null, and then discards all other values (which may be
non-null), thus we could lose valuable data if only the
first value happens to be null ...

I think we should indeed check if the first value is null,
but then if it is then loop over all other values, count
non-nulls, and if the count > 0 then write out the
<key, <non-null values>> set.

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Commented: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 16:48:04
    [ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452285 ] 
            
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------

Hi Doğacan,

  Loooking at your latest patch, I'm not sure that it
completely does the right behavior. For example, what
happens if there are 3 met values for a key k, and one of
them is null, but the other 2 are not? Specifically, what if
the first value is null, but the other 2 are not. In that
case, your patch would skip over writing all of the keys.
Wouldn't it just be easier to do something like this?

Index: src/java/org/apache/nutch/metadata/Metadata.java
============================================================
=======
--- src/java/org/apache/nutch/metadata/Metadata.java   
(revision 478613)
+++ src/java/org/apache/nutch/metadata/Metadata.java   
(working copy)
 -211,7
+211,9 
       values = getValues(names[i]);
       out.writeInt(values.length);
       for (int j = 0; j < values.length; j++) {
-        Text.writeString(out, values[j]);
+        if(values[j] != null &&
!values[j].equals("")){
+               Text.writeString(out, values[j]);
+        }
       }
     }
   }

> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
Commented: (NUTCH-406) Metadata tries to write null values
user name
2006-11-23 16:50:03
    [ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452286 ] 
            
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------

Hi Andrzej,

  Yup, you caught the same thing as me. +1 for your
solution. I will extend my above patch by writing
getNumNonNullValues(values) instead of values.length.

Cheers,
  Chris


> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta. 
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
>         at
org.apache.hadoop.io.Text.encode(Text.java:373)
>         at
org.apache.hadoop.io.Text.encode(Text.java:354)
>         at
org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

       
[1-10] [11-14]

about | contact  Other archives ( Real Estate discussion Medical topics )