|
List Info
Thread: Created: (NUTCH-406) Metadata tries to write null values
|
|
| Created: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 13:27:04 |
Metadata tries to write null values
-----------------------------------
Key: NUTCH-406
URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
Project: Nutch
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Doğacan Güney
During parsing, some urls (especially pdfs, it seems) may
create <some_key, null> pairs in ParseData's
parseMeta.
When Metadata.write() tries to write such a pair, it causes
an NPE.
Stack trace will be something like this:
at org.apache.hadoop.io.Text.encode(Text.java:373)
at org.apache.hadoop.io.Text.encode(Text.java:354)
at
org.apache.hadoop.io.Text.writeString(Text.java:394)
at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
I can consistently reproduce this using the following url:
http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Updated: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 13:29:03 |
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all
a> ]
Doğacan Güney updated NUTCH-406:
--------------------------------
Attachment: NUTCH-406.patch
A simple patch that writes nulls as empty strings.
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Updated: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 15:45:03 |
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all
a> ]
Chris A. Mattmann updated NUTCH-406:
------------------------------------
Assignee: Chris A. Mattmann
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Work started: (NUTCH-406) Metadata tries
to write null values |

|
2006-11-23 15:45:04 |
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all
a> ]
Work on NUTCH-406 started by Chris A. Mattmann.
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Commented: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 15:59:03 |
[ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452270 ]
Andrzej Bialecki commented on NUTCH-406:
-----------------------------------------
Null value is not equivalent to an empty String - perhaps we
should simply skip such values.
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Updated: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 16:18:08 |
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all
a> ]
Doğacan Güney updated NUTCH-406:
--------------------------------
Attachment: NUTCH-406.patch
How about something like this then?
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Commented: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 16:26:03 |
[ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452275 ]
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------
Hi Andrzej, Doğacan,
+1. I think it makes a lot of sense to just not include the
null key in the Met container. Doğacan, in the future, when
you attach a new version of a patch for a JIRA issue, please
indicate the change by renaming the patch. Not a big deal,
but good style points ;)
I'll commit this patch shortly.
Cheers,
Chris
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Commented: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 16:44:04 |
[ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452282 ]
Andrzej Bialecki commented on NUTCH-406:
-----------------------------------------
Erhm, -1 from me. This code checks only if the first value
is null, and then discards all other values (which may be
non-null), thus we could lose valuable data if only the
first value happens to be null ...
I think we should indeed check if the first value is null,
but then if it is then loop over all other values, count
non-nulls, and if the count > 0 then write out the
<key, <non-null values>> set.
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Commented: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 16:48:04 |
[ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452285 ]
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------
Hi Doğacan,
Loooking at your latest patch, I'm not sure that it
completely does the right behavior. For example, what
happens if there are 3 met values for a key k, and one of
them is null, but the other 2 are not? Specifically, what if
the first value is null, but the other 2 are not. In that
case, your patch would skip over writing all of the keys.
Wouldn't it just be easier to do something like this?
Index: src/java/org/apache/nutch/metadata/Metadata.java
============================================================
=======
--- src/java/org/apache/nutch/metadata/Metadata.java
(revision 478613)
+++ src/java/org/apache/nutch/metadata/Metadata.java
(working copy)
 -211,7
+211,9 
values = getValues(names[i]);
out.writeInt(values.length);
for (int j = 0; j < values.length; j++) {
- Text.writeString(out, values[j]);
+ if(values[j] != null &&
!values[j].equals("")){
+ Text.writeString(out, values[j]);
+ }
}
}
}
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
| Commented: (NUTCH-406) Metadata tries to
write null values |

|
2006-11-23 16:50:03 |
[ http://issues.apache.org/jira/browse
/NUTCH-406?page=comments#action_12452286 ]
Chris A. Mattmann commented on NUTCH-406:
-----------------------------------------
Hi Andrzej,
Yup, you caught the same thing as me. +1 for your
solution. I will extend my above patch by writing
getNumNonNullValues(values) instead of values.length.
Cheers,
Chris
> Metadata tries to write null values
> -----------------------------------
>
> Key: NUTCH-406
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-406
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Doğacan Güney
> Assigned To: Chris A. Mattmann
> Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems)
may create <some_key, null> pairs in ParseData's
parseMeta.
> When Metadata.write() tries to write such a pair, it
causes an NPE.
> Stack trace will be something like this:
> at
org.apache.hadoop.io.Text.encode(Text.java:373)
> at
org.apache.hadoop.io.Text.encode(Text.java:354)
> at
org.apache.hadoop.io.Text.writeString(Text.java:394)
> at
org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following
url:
> http://www.efesbev.com/corporate_governance/p
df/MergerAgreement.pdf
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|
|
|
|