List Info

Thread: Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get




Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get
country flaguser name
United States
2007-05-29 19:05:15
contentType parse not correctly,,,,got empty content using
readseg -get
------------------------------------------------------------
-----------

                 Key: NUTCH-493
                 URL: https
://issues.apache.org/jira/browse/NUTCH-493
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: java version "1.5.0_04"

Linux localhost 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005
i686 GNU/Linux
            Reporter: wangxu


I am using nutch0.9.
I found lots of my crawled pages's contents are empty.
then I checked the log,and find the warnning accordingly:the
ContentType is said to be "url=http://......",and
cannot 
find a suitable parser for the page:


parser not found for contentType=
url=http://product.dangdang.com/product.aspx?product_id=49
0321


then most of this kind of pages's contents are empty.
but I didnot find any warn or error other than
"timeout" from the fetcher log.

Can somebody explain me why?
many thanks!



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Closed: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get
country flaguser name
United States
2007-06-18 04:01:32
     [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-493?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]

DO?ACAN GüNEY CLOSED NUTCH-493.
-------------------------------

    RESOLUTION: INVALID
      ASSIGNEE: DO?ACAN GüNEY

THIS IS NOT A BUG. WHEN FETCHER WAS UNABLE TO FETCH PAGES,
IT CREATED EMPTY CONTENT. SUCH EMPTY CONTENTS ARE NOT
PARSEABLE, HENCE WHAT YOU ARE SEEING IN YOUR LOG.

AFTER NUTCH-443, FETCHER WILL NOT CREATE EMTPY CONTENT FOR
SUCH PAGES, SO YOU SHOULD NOT SEE THEM IN YOUR LOG ANYMORE.

ALSO, PLEASE USE NUTCH-USER MAILING LIST TO ASK QUESTIONS.

> CONTENTTYPE PARSE NOT CORRECTLY,,,,GOT EMPTY CONTENT
USING READSEG -GET
>
------------------------------------------------------------
-----------
>
>                 KEY: NUTCH-493
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-493
>             PROJECT: NUTCH
>          ISSUE TYPE: BUG
>          COMPONENTS: FETCHER
>    AFFECTS VERSIONS: 0.9.0
>         ENVIRONMENT: JAVA VERSION "1.5.0_04"
> LINUX LOCALHOST 2.6.8-2-386 #1 TUE AUG 16 12:46:35 UTC
2005 I686 GNU/LINUX
>            REPORTER: WANGXU
>            ASSIGNEE: DO?ACAN GüNEY
>
> I AM USING NUTCH0.9.
> I FOUND LOTS OF MY CRAWLED PAGES'S CONTENTS ARE EMPTY.
> THEN I CHECKED THE LOG,AND FIND THE WARNNING
ACCORDINGLY:THE CONTENTTYPE IS SAID TO BE
"URL=HTTP://......",AND CANNOT 
> FIND A SUITABLE PARSER FOR THE PAGE:
> PARSER NOT FOUND FOR CONTENTTYPE=
>
URL=HTTP://PRODUCT.DANGDANG.COM/PRODUCT.ASPX?PRODUCT_ID=4903
21
> THEN MOST OF THIS KIND OF PAGES'S CONTENTS ARE EMPTY.
> BUT I DIDNOT FIND ANY WARN OR ERROR OTHER THAN
"TIMEOUT" FROM THE FETCHER LOG.
> CAN SOMEBODY EXPLAIN ME WHY?
> MANY THANKS!

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )