List Info

Thread: Re: OOM error during parsing with nekohtml




Re: OOM error during parsing with nekohtml
country flaguser name
United States
2007-07-16 10:43:29
You could try looking at these two discussions:
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html

--Kai

----- Original Message ----
From: Tsengtan A Shuy <ttashuysbcglobal.net>
To: nutch-devlucene.apache.org; nutch-userlucene.apache.org
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new
ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the
most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Shailendra Mudgal [mailto:mudgal.shailendragmail.com] 
Sent: Monday, July 16, 2007 3:05 AM
To: nutch-userlucene.apache.org; nutch-devlucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http:
//www.fotofinity.com/cgi-bin/homepages.cgi . We have
also applied
Nutch-497 patch to our source code. But actually the error
is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the
complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
    at java.lang.String.toUpperCase(String.java:2637)
    at java.lang.String.toUpperCase(String.java:2660)
    at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(Na
mespaceBinder.ja
va:443)
    at
org.cyberneko.html.filters.NamespaceBinder.startElement(Name
spaceBinder.java
:252)
    at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagB
alancer.java:100
9)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:639)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:646)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartEleme
nt(HTMLScanner.j
ava:2343)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScann
er.java:1820)
    at
org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java
:789)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:478)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:431)
    at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragme
ntParser.java:16
4)
    at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.
java:265)
    at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java
:229)
    at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.j
ava:168)
    at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
    at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75
)
    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445)


Regards,
Shailendra








       
____________________________________________________________
________________________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now
(it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyh
erenow  
RE: OOM error during parsing with nekohtml
country flaguser name
United States
2007-07-16 11:37:04
Thank you for the info.
The OOM exception in your previous email indicates that your
system is
running out of heap memory.  You either have instantiated
too many objects,
or there are memory leaks in the source codes.

Hope this will help you!
Cheer!!

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com

-----Original Message-----
From: Kai_testing Middleton [mailto:kai_testingyahoo.com] 
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-devlucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html

--Kai

----- Original Message ----
From: Tsengtan A Shuy <ttashuysbcglobal.net>
To: nutch-devlucene.apache.org; nutch-userlucene.apache.org
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new
ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the
most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Shailendra Mudgal [mailto:mudgal.shailendragmail.com] 
Sent: Monday, July 16, 2007 3:05 AM
To: nutch-userlucene.apache.org; nutch-devlucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http:
//www.fotofinity.com/cgi-bin/homepages.cgi . We have
also applied
Nutch-497 patch to our source code. But actually the error
is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the
complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
    at java.lang.String.toUpperCase(String.java:2637)
    at java.lang.String.toUpperCase(String.java:2660)
    at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(Na
mespaceBinder.ja
va:443)
    at
org.cyberneko.html.filters.NamespaceBinder.startElement(Name
spaceBinder.java
:252)
    at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagB
alancer.java:100
9)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:639)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:646)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartEleme
nt(HTMLScanner.j
ava:2343)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScann
er.java:1820)
    at
org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java
:789)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:478)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:431)
    at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragme
ntParser.java:16
4)
    at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.
java:265)
    at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java
:229)
    at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.j
ava:168)
    at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
    at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75
)
    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445)


Regards,
Shailendra








       
____________________________________________________________
________________
________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now
(it's updated
for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyh
erenow  


Re: OOM error during parsing with nekohtml
user name
2007-07-16 23:53:06
Hi all,

Thanks for your suggestions.

I am running parse on a single url (
http:
//www.fotofinity.com/cgi-bin/homepages.cgi). For other
urls, parse
works perfectly. we are getting this error because of the
html of the page.
The page contains many anchor tags which are not closed
properly. Hence neko
html parser throws this exception. The page can be parsed
successfully using
tagsoup. We think this as a bug in neko html parser.


Regards,
Shailendra







On 7/16/07, Tsengtan A Shuy <ttashuysbcglobal.net> wrote:
>
> Thank you for the info.
> The OOM exception in your previous email indicates that
your system is
> running out of heap memory.  You either have
instantiated too many
> objects,
> or there are memory leaks in the source codes.
>
> Hope this will help you!
> Cheer!!
>
> Adam Shuy, President
> ePacific Web Design & Hosting
> Professional Web/Software developer
> TEL: 408-272-6946
> www.epacificweb.com
>
> -----Original Message-----
> From: Kai_testing Middleton [mailto:kai_testingyahoo.com]
> Sent: Monday, July 16, 2007 8:43 AM
> To: nutch-devlucene.apache.org
> Subject: Re: OOM error during parsing with nekohtml
>
> You could try looking at these two discussions:
> http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
> http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
>
> --Kai
>
> ----- Original Message ----
> From: Tsengtan A Shuy <ttashuysbcglobal.net>
> To: nutch-devlucene.apache.org; nutch-userlucene.apache.org
> Sent: Monday, July 16, 2007 3:45:59 AM
> Subject: RE: OOM error during parsing with nekohtml
>
> I successfully run the whole-web crawl with the my new
ubuntu OS, and I am
> ready to fix the bug.  I need someone to guide me to
get the most updated
> source code and the bug assignment.
>
> Thank you in advance!!
>
> Adam Shuy, President
> ePacific Web Design & Hosting
> Professional Web/Software developer
> TEL: 408-272-6946
> www.epacificweb.com
> -----Original Message-----
> From: Shailendra Mudgal [mailto:mudgal.shailendragmail.com]
> Sent: Monday, July 16, 2007 3:05 AM
> To: nutch-userlucene.apache.org; nutch-devlucene.apache.org
> Subject: OOM error during parsing with nekohtml
>
> Hi All,
>
> We are getting an OOM Exception during the processing
of
> http:
//www.fotofinity.com/cgi-bin/homepages.cgi . We have
also applied
> Nutch-497 patch to our source code. But actually the
error is coming
> during
> the parse method.
> Does anybody has any idea regarding this.  Here is the
complete stacktrace
> :
>
> java.lang.OutOfMemoryError: Java heap space
>     at java.lang.String.toUpperCase(String.java:2637)
>     at java.lang.String.toUpperCase(String.java:2660)
>     at
>
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
> NamespaceBinder.ja
> va:443)
>     at
>
org.cyberneko.html.filters.NamespaceBinder.startElement(
> NamespaceBinder.java
> :252)
>     at
>
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagB
alancer.java
> :100
> 9)
>     at
>
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:639)
>     at
>
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:646)
>     at
>
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartEleme
nt(
> HTMLScanner.j
> ava:2343)
>     at
>
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScann
er.java:1820)
>     at
org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java
:789)
>     at
>
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:478)
>     at
>
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:431)
>     at
>
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragme
ntParser.java
> :16
> 4)
>     at
>
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.
java:265)
>     at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java
:229)
>     at
>
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.j
ava:168)
>     at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
>     at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75
)
>     at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>     at
>
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445)
>
>
> Regards,
> Shailendra
>
>
>
>
>
>
>
>
>
>
>
____________________________________________________________
________________
> ________
> Boardwalk for $500? In 2007? Ha! Play Monopoly Here and
Now (it's updated
> for today's economy) at Yahoo! Games.
> http://get.games.yahoo.com/proddesc?gamekey=monopolyh
erenow
>
>
Re: OOM error during parsing with nekohtml
user name
2007-07-17 01:35:12
Hi,

On 7/17/07, Shailendra Mudgal <mudgal.shailendragmail.com> wrote:
> Hi all,
>
> Thanks for your suggestions.
>
> I am running parse on a single url (
> http:
//www.fotofinity.com/cgi-bin/homepages.cgi). For other
urls, parse
> works perfectly. we are getting this error because of
the html of the page.
> The page contains many anchor tags which are not closed
properly. Hence neko
> html parser throws this exception. The page can be
parsed successfully using
> tagsoup. We think this as a bug in neko html parser.

Since tagsoup works and neko doesn't, I agree with you that
this is a
bug with neko.

If you want to skip over this page (parser will not extract
text from
this page but parsing will successfully run overall), you
may try
changing catch clause in ParseSegment. java:77 from
Exception to
Throwable. This should catch OOM and continue.

>
>
> Regards,
> Shailendra
>
>
>
>
>
>
>
> On 7/16/07, Tsengtan A Shuy <ttashuysbcglobal.net> wrote:
> >
> > Thank you for the info.
> > The OOM exception in your previous email indicates
that your system is
> > running out of heap memory.  You either have
instantiated too many
> > objects,
> > or there are memory leaks in the source codes.
> >
> > Hope this will help you!
> > Cheer!!
> >
> > Adam Shuy, President
> > ePacific Web Design & Hosting
> > Professional Web/Software developer
> > TEL: 408-272-6946
> > www.epacificweb.com
> >
> > -----Original Message-----
> > From: Kai_testing Middleton
[mailto:kai_testingyahoo.com]
> > Sent: Monday, July 16, 2007 8:43 AM
> > To: nutch-devlucene.apache.org
> > Subject: Re: OOM error during parsing with
nekohtml
> >
> > You could try looking at these two discussions:
> > http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
> > http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
> >
> > --Kai
> >
> > ----- Original Message ----
> > From: Tsengtan A Shuy <ttashuysbcglobal.net>
> > To: nutch-devlucene.apache.org; nutch-userlucene.apache.org
> > Sent: Monday, July 16, 2007 3:45:59 AM
> > Subject: RE: OOM error during parsing with
nekohtml
> >
> > I successfully run the whole-web crawl with the my
new ubuntu OS, and I am
> > ready to fix the bug.  I need someone to guide me
to get the most updated
> > source code and the bug assignment.
> >
> > Thank you in advance!!
> >
> > Adam Shuy, President
> > ePacific Web Design & Hosting
> > Professional Web/Software developer
> > TEL: 408-272-6946
> > www.epacificweb.com
> > -----Original Message-----
> > From: Shailendra Mudgal
[mailto:mudgal.shailendragmail.com]
> > Sent: Monday, July 16, 2007 3:05 AM
> > To: nutch-userlucene.apache.org;
nutch-devlucene.apache.org
> > Subject: OOM error during parsing with nekohtml
> >
> > Hi All,
> >
> > We are getting an OOM Exception during the
processing of
> > http:
//www.fotofinity.com/cgi-bin/homepages.cgi . We have
also applied
> > Nutch-497 patch to our source code. But actually
the error is coming
> > during
> > the parse method.
> > Does anybody has any idea regarding this.  Here is
the complete stacktrace
> > :
> >
> > java.lang.OutOfMemoryError: Java heap space
> >     at
java.lang.String.toUpperCase(String.java:2637)
> >     at
java.lang.String.toUpperCase(String.java:2660)
> >     at
> >
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
> > NamespaceBinder.ja
> > va:443)
> >     at
> >
org.cyberneko.html.filters.NamespaceBinder.startElement(
> > NamespaceBinder.java
> > :252)
> >     at
> >
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagB
alancer.java
> > :100
> > 9)
> >     at
> >
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:639)
> >     at
> >
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java:646)
> >     at
> >
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartEleme
nt(
> > HTMLScanner.j
> > ava:2343)
> >     at
> >
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScann
er.java:1820)
> >     at
org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java
:789)
> >     at
> >
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:478)
> >     at
> >
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:431)
> >     at
> >
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragme
ntParser.java
> > :16
> > 4)
> >     at
> >
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.
java:265)
> >     at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java
:229)
> >     at
> >
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.j
ava:168)
> >     at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
> >     at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75
)
> >     at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >     at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> >     at
> >
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445)
> >
> >
> > Regards,
> > Shailendra
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
____________________________________________________________
________________
> > ________
> > Boardwalk for $500? In 2007? Ha! Play Monopoly
Here and Now (it's updated
> > for today's economy) at Yahoo! Games.
> > http://get.games.yahoo.com/proddesc?gamekey=monopolyh
erenow
> >
> >
>


-- 
Doğacan Güney
no nutch script file under bin directory
country flaguser name
United States
2007-07-17 14:22:48
I follow the msg06571.html to check out the trunk.
Then I found there is no nutch script file under the bin
directory.
How do you crawl the multiple websites without this nutch
script file?

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Kai_testing Middleton [mailto:kai_testingyahoo.com] 
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-devlucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html

--Kai


Re: OOM error during parsing with nekohtml
user name
2007-07-19 08:26:39
Hi ,

After replacing it with the Throwable, it safely parsed that
page, but got
the same OOM Error during the parse of
htt
p://lcweb2.loc.gov/ndlpcoop/nicmoas/livn-2/liv
n0181.sgm. But this time it seems that the error occured at
line 78 .
Here is the stacktrace. (The same page we cant parse using
the tagsoup also)

java.lang.OutOfMemoryError: GC overhead limit exceeded
	at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:78
)
	at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445


Regards,
Shailendra

On 7/17/07, Doðacan Güney <dogacangmail.com> wrote:
>
> Hi,
>
> On 7/17/07, Shailendra Mudgal <mudgal.shailendragmail.com> wrote:
> > Hi all,
> >
> > Thanks for your suggestions.
> >
> > I am running parse on a single url (
> > http:
//www.fotofinity.com/cgi-bin/homepages.cgi). For other
urls, parse
> > works perfectly. we are getting this error because
of the html of the
> page.
> > The page contains many anchor tags which are not
closed properly. Hence
> neko
> > html parser throws this exception. The page can be
parsed successfully
> using
> > tagsoup. We think this as a bug in neko html
parser.
>
> Since tagsoup works and neko doesn't, I agree with you
that this is a
> bug with neko.
>
> If you want to skip over this page (parser will not
extract text from
> this page but parsing will successfully run overall),
you may try
> changing catch clause in ParseSegment. java:77 from
Exception to
> Throwable. This should catch OOM and continue.
>
> >
> >
> > Regards,
> > Shailendra
> >
> >
> >
> >
> >
> >
> >
> > On 7/16/07, Tsengtan A Shuy <ttashuysbcglobal.net> wrote:
> > >
> > > Thank you for the info.
> > > The OOM exception in your previous email
indicates that your system is
> > > running out of heap memory.  You either have
instantiated too many
> > > objects,
> > > or there are memory leaks in the source
codes.
> > >
> > > Hope this will help you!
> > > Cheer!!
> > >
> > > Adam Shuy, President
> > > ePacific Web Design & Hosting
> > > Professional Web/Software developer
> > > TEL: 408-272-6946
> > > www.epacificweb.com
> > >
> > > -----Original Message-----
> > > From: Kai_testing Middleton
[mailto:kai_testingyahoo.com]
> > > Sent: Monday, July 16, 2007 8:43 AM
> > > To: nutch-devlucene.apache.org
> > > Subject: Re: OOM error during parsing with
nekohtml
> > >
> > > You could try looking at these two
discussions:
> > > http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
> > > http://www.mail
-archive.com/nutch-devlucene.apache.org/msg06571.html
> > >
> > > --Kai
> > >
> > > ----- Original Message ----
> > > From: Tsengtan A Shuy <ttashuysbcglobal.net>
> > > To: nutch-devlucene.apache.org;
nutch-userlucene.apache.org
> > > Sent: Monday, July 16, 2007 3:45:59 AM
> > > Subject: RE: OOM error during parsing with
nekohtml
> > >
> > > I successfully run the whole-web crawl with
the my new ubuntu OS, and
> I am
> > > ready to fix the bug.  I need someone to
guide me to get the most
> updated
> > > source code and the bug assignment.
> > >
> > > Thank you in advance!!
> > >
> > > Adam Shuy, President
> > > ePacific Web Design & Hosting
> > > Professional Web/Software developer
> > > TEL: 408-272-6946
> > > www.epacificweb.com
> > > -----Original Message-----
> > > From: Shailendra Mudgal
[mailto:mudgal.shailendragmail.com]
> > > Sent: Monday, July 16, 2007 3:05 AM
> > > To: nutch-userlucene.apache.org;
nutch-devlucene.apache.org
> > > Subject: OOM error during parsing with
nekohtml
> > >
> > > Hi All,
> > >
> > > We are getting an OOM Exception during the
processing of
> > > http:
//www.fotofinity.com/cgi-bin/homepages.cgi . We have
also applied
> > > Nutch-497 patch to our source code. But
actually the error is coming
> > > during
> > > the parse method.
> > > Does anybody has any idea regarding this. 
Here is the complete
> stacktrace
> > > :
> > >
> > > java.lang.OutOfMemoryError: Java heap space
> > >     at
java.lang.String.toUpperCase(String.java:2637)
> > >     at
java.lang.String.toUpperCase(String.java:2660)
> > >     at
> > >
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
> > > NamespaceBinder.ja
> > > va:443)
> > >     at
> > >
org.cyberneko.html.filters.NamespaceBinder.startElement(
> > > NamespaceBinder.java
> > > :252)
> > >     at
> > >
org.cyberneko.html.HTMLTagBalancer.callStartElement(
> HTMLTagBalancer.java
> > > :100
> > > 9)
> > >     at
> > >
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java
> :639)
> > >     at
> > >
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalan
cer.java
> :646)
> > >     at
> > >
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartEleme
nt(
> > > HTMLScanner.j
> > > ava:2343)
> > >     at
> > >
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScann
er.java
> :1820)
> > >     at
org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java
> :789)
> > >     at
> > >
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:478)
> > >     at
> > >
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration
.java:431)
> > >     at
> > >
org.cyberneko.html.parsers.DOMFragmentParser.parse(
> DOMFragmentParser.java
> > > :16
> > > 4)
> > >     at
> > >
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.
java:265)
> > >     at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java
> :229)
> > >     at
> > >
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.j
ava:168)
> > >     at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
> > >     at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75
)
> > >     at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > >     at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> > >     at
> > >
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1445)
> > >
> > >
> > > Regards,
> > > Shailendra
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
>
____________________________________________________________
________________
> > > ________
> > > Boardwalk for $500? In 2007? Ha! Play
Monopoly Here and Now (it's
> updated
> > > for today's economy) at Yahoo! Games.
> > > http://get.games.yahoo.com/proddesc?gamekey=monopolyh
erenow
> > >
> > >
> >
>
>
> --
> Doðacan Güney
>
[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )