[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-493?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY CLOSED NUTCH-493.
-------------------------------
RESOLUTION: INVALID
ASSIGNEE: DO?ACAN GüNEY
THIS IS NOT A BUG. WHEN FETCHER WAS UNABLE TO FETCH PAGES,
IT CREATED EMPTY CONTENT. SUCH EMPTY CONTENTS ARE NOT
PARSEABLE, HENCE WHAT YOU ARE SEEING IN YOUR LOG.
AFTER NUTCH-443, FETCHER WILL NOT CREATE EMTPY CONTENT FOR
SUCH PAGES, SO YOU SHOULD NOT SEE THEM IN YOUR LOG ANYMORE.
ALSO, PLEASE USE NUTCH-USER MAILING LIST TO ASK QUESTIONS.
> CONTENTTYPE PARSE NOT CORRECTLY,,,,GOT EMPTY CONTENT
USING READSEG -GET
>
------------------------------------------------------------
-----------
>
> KEY: NUTCH-493
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-493
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> AFFECTS VERSIONS: 0.9.0
> ENVIRONMENT: JAVA VERSION "1.5.0_04"
> LINUX LOCALHOST 2.6.8-2-386 #1 TUE AUG 16 12:46:35 UTC
2005 I686 GNU/LINUX
> REPORTER: WANGXU
> ASSIGNEE: DO?ACAN GüNEY
>
> I AM USING NUTCH0.9.
> I FOUND LOTS OF MY CRAWLED PAGES'S CONTENTS ARE EMPTY.
> THEN I CHECKED THE LOG,AND FIND THE WARNNING
ACCORDINGLY:THE CONTENTTYPE IS SAID TO BE
"URL=HTTP://......",AND CANNOT
> FIND A SUITABLE PARSER FOR THE PAGE:
> PARSER NOT FOUND FOR CONTENTTYPE=
>
URL=HTTP://PRODUCT.DANGDANG.COM/PRODUCT.ASPX?PRODUCT_ID=4903
21
> THEN MOST OF THIS KIND OF PAGES'S CONTENTS ARE EMPTY.
> BUT I DIDNOT FIND ANY WARN OR ERROR OTHER THAN
"TIMEOUT" FROM THE FETCHER LOG.
> CAN SOMEBODY EXPLAIN ME WHY?
> MANY THANKS!
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|