List Info

Thread: help




help
country flaguser name
Greece
2008-02-29 07:33:26
hello

I have some html files

is it possible to use a program to get all the text
only of the html? as if I open the html with a
browser, then click ctrl+a and then copy paste all the
selected text

I need to do this in batch

can lynx do it?

thanks 



     
____________________________________________________________
________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9
tAcJ 



_______________________________________________
Lynx-dev mailing list
Lynx-devnongnu.org
htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev

Re: help
country flaguser name
United States
2008-02-29 11:02:24
> is it possible to use a program to get all the text
only of
> the html? as if I open the html with a browser, then
click
> ctrl+a and then copy paste all the selected text
> 
> I need to do this in batch
> 
> can lynx do it?

Sounds like you're reaching for the "-dump"
parameter that Lynx
supports, as described in the man-page:

    lynx -dump http://www.example.com

This can then be automated via a script, or you may be able
to
use the '-crawl' parameter in conjunction with -dump to walk
a
site.  I didn't see anything in my man-page to limit link
recursion-depth as wget offers.

If you don't want the link-lists, you can use the -nolist
parameter as well.

-tim






_______________________________________________
Lynx-dev mailing list
Lynx-devnongnu.org
htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev

Extracting text from an HTML file (was: help)
country flaguser name
United Kingdom
2008-03-01 06:03:12
Tim Chase wrote:
>> is it possible to use a program to get all the text
only of
>> the html? as if I open the html with a browser,
then click
>> ctrl+a and then copy paste all the selected text

You need to define what you mean by this much more
precisely.  Do you 
want title elements to be included?  Do you want initial
values for 
input elements (which are attributes, not text nodes) to be
included? 
Do you want text nodes in a pre element handled specially. 
Can you 
constrain the input to be valid HTML (most web sites
aren't)?   If not, 
what error recovery do you want?  Etc.

> 
> Sounds like you're reaching for the "-dump"
parameter that Lynx
> supports, as described in the man-page:
> 
>    lynx -dump http://www.example.com

This will insert extra characters to achieve a rendering of
the text. 
If only the input characters are wanted, one might be better
using a 
Perl script to strip out all the tags, directives, etc., and
resolve 
entities.

You could also use the nsgmls tools, provided the input is
valid, to get 
the infoset representation and then strip out the tag and
attribute 
lines.  You still need to decide how you will deal with the
resulting 
newlines and any newlines in the original text nodes.

You could probably modify lynx to dump the text nodes as it
identifies 
them, but Lynx is quite big and complex.

P.S. when posting to support lists, please use a subject
that is a 
precis of the complete question.
> 
> This can then be automated via a script, or you may be
able to
> use the '-crawl' parameter in conjunction with -dump to
walk a
> site.  I didn't see anything in my man-page to limit
link
> recursion-depth as wget offers.
> 
> If you don't want the link-lists, you can use the
-nolist
> parameter as well.
> 
> -tim
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Lynx-dev mailing list
> Lynx-devnongnu.org
> htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev


-- 
David Woolley
Emails are not formal business letters, whatever businesses
may want.
RFC1855 says there should be an address here, but, in a
world of spam,
that is no longer good advice, as archive address hiding may
not work.


_______________________________________________
Lynx-dev mailing list
Lynx-devnongnu.org
htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )