Tim Chase wrote:
>> is it possible to use a program to get all the text
only of
>> the html? as if I open the html with a browser,
then click
>> ctrl+a and then copy paste all the selected text
You need to define what you mean by this much more
precisely. Do you
want title elements to be included? Do you want initial
values for
input elements (which are attributes, not text nodes) to be
included?
Do you want text nodes in a pre element handled specially.
Can you
constrain the input to be valid HTML (most web sites
aren't)? If not,
what error recovery do you want? Etc.
>
> Sounds like you're reaching for the "-dump"
parameter that Lynx
> supports, as described in the man-page:
>
> lynx -dump http://www.example.com
This will insert extra characters to achieve a rendering of
the text.
If only the input characters are wanted, one might be better
using a
Perl script to strip out all the tags, directives, etc., and
resolve
entities.
You could also use the nsgmls tools, provided the input is
valid, to get
the infoset representation and then strip out the tag and
attribute
lines. You still need to decide how you will deal with the
resulting
newlines and any newlines in the original text nodes.
You could probably modify lynx to dump the text nodes as it
identifies
them, but Lynx is quite big and complex.
P.S. when posting to support lists, please use a subject
that is a
precis of the complete question.
>
> This can then be automated via a script, or you may be
able to
> use the '-crawl' parameter in conjunction with -dump to
walk a
> site. I didn't see anything in my man-page to limit
link
> recursion-depth as wget offers.
>
> If you don't want the link-lists, you can use the
-nolist
> parameter as well.
>
> -tim
>
>
>
>
>
>
> _______________________________________________
> Lynx-dev mailing list
> Lynx-dev nongnu.org
> htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev
--
David Woolley
Emails are not formal business letters, whatever businesses
may want.
RFC1855 says there should be an address here, but, in a
world of spam,
that is no longer good advice, as archive address hiding may
not work.
_______________________________________________
Lynx-dev mailing list
Lynx-dev nongnu.org
htt
p://lists.nongnu.org/mailman/listinfo/lynx-dev
|