|
|
| stats on well formed XHTML |

|
2008-01-16 02:41:49 |
Has anyone done any large scale audits of XHTML in the wild
to
determine the percentage that parse correctly?
I'm thinking about deploying one in Spinn3r but I'd rather
focus on
other tasks if this has already been done.
I'm curious about the assumptions one could make when
assuming that
XHTML is well formed.
Specifically, the probability that a naive non-XML parser
can make
while indexing the content.
Kevin
--
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com
and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |
  United States |
2008-01-16 17:04:38 |
On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:
> Has anyone done any large scale audits of XHTML in the
wild to
> determine the percentage that parse correctly?
Yes, Ian Hickson at Google did a survey of about 1B pages
and found
that over 90% had *well-formedness* errors. I can't find a
reference
off hand, but it maybe buried somewhere in [#webstats].
> I'm thinking about deploying one in Spinn3r but I'd
rather focus on
> other tasks if this has already been done.
I'd suggest working on other tasks.
> I'm curious about the assumptions one could make when
assuming that
> XHTML is well formed.
You know what they say about assumptions.
> Specifically, the probability that a naive non-XML
parser can make
> while indexing the content.
I'm not sure what you mean here, but I'd reccomend against
using an
XML parser against web content and instead use something
like the
HTML5 parsing algorithm [#html5-parsing].
-ryan
[webstats]: http://code.google.c
om/webstats/
[html5-parsing]: http://whatwg.org/specs/web-apps/current-work/#parsing
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |

|
2008-01-16 19:44:28 |
> > Specifically, the probability that a naive non-XML
parser can make
> > while indexing the content.
>
> I'm not sure what you mean here, but I'd reccomend
against using an
> XML parser against web content and instead use
something like the
> HTML5 parsing algorithm [#html5-parsing].
Yes... I'm just trying to avoid using a full HTML parser
(DOM or not)
to avoid garbage generation and processor overhead.
However, I think I'm losing that battle.
Kevin
--
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com
and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |
  United Kingdom |
2008-01-17 04:22:06 |
On Wed, January 16, 2008 11:04 pm, ryan wrote:
> On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:
>
>> Has anyone done any large scale audits of XHTML in
the wild to
>> determine the percentage that parse correctly?
>
> Yes, Ian Hickson at Google did a survey of about 1B
pages and found
> that over 90% had *well-formedness* errors. I can't
find a reference
> off hand, but it maybe buried somewhere in
[#webstats].
>
Ian Hickson's study at <http:/
/code.google.com/webstats/index.html> links
to studies by Marko Karppinen (2002)
<http:
//www.markokarppinen.com/20020222.html> and Evan Goer
(2003)
<http://www.goer.org/Journal/2003/04/the_xhtml_100.html
> both of which
suggest that anyone expecting to find much well-formed XHTML
on the web is
doomed to disappointment.
I can't imagine that things have got any better since :-(
HTH,
Nick.
--
Nick Fitzsimons
http://www.nickfitz.co.uk/
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |
  United States |
2008-01-17 05:16:19 |
Le 17 janv. 2008 à 19:22, Nick Fitzsimons a écrit :
> I can't imagine that things have got any better since
:-(
to really evaluate this, there are two parameters to take
into account.
nb of xhtml pages
----------------- [now]
nb of total pages
but in my humble opinion, more interesting would be to have
this ratio
for each year with *only the new pages* created during the
year.
Unfortunately because there is no uniform way to sign the
date of
pages, and because HTTP is even a worse shape than HTML, it
is almost
impossible to evaluate.
--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |
  United Kingdom |
2008-01-17 09:08:06 |
On 17 Jan 2008, at 01:44, Kevin Burton wrote:
>>> Specifically, the probability that a naive
non-XML parser can make
>>> while indexing the content.
>>
>> I'm not sure what you mean here, but I'd reccomend
against using an
>> XML parser against web content and instead use
something like the
>> HTML5 parsing algorithm [#html5-parsing].
>
> Yes... I'm just trying to avoid using a full HTML
parser (DOM or not)
> to avoid garbage generation and processor overhead.
>
> However, I think I'm losing that battle.
Once you start dealing with the joy of DOCTYPEs and the
like, it
becomes rather questionable whether XML parsers really are
much
simpler than HTML ones.
--
Geoffrey Sneddon
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |

|
2008-01-17 12:40:48 |
On 17/01/2008, Derrick Lyndon Pallas <derrick pallas.us> wrote:
> Not so. The Internet Archive knows the first time
they've seen an URL,
> over the past ten years; they can also tell you when
the content has
> significantly changed.
But can it tell you whether it's a new page or an old page
at a new address?
Rob
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |

|
2008-01-17 18:03:03 |
> but in my humble opinion, more interesting would be to
have this ratio
> for each year with *only the new pages* created during
the year.
> Unfortunately because there is no uniform way to sign
the date of
> pages, and because HTTP is even a worse shape than
HTML, it is almost
> impossible to evaluate.
On could perform such an audit with hAtom published values.
Either
that or use the RSS timestamp or timestamp in the URL.
This would suffer from sampling bias though because most
blog hosts at
least PRETEND to care about standards.
Thought it would still be interesting to compute the stats.
Kevin
--
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com
and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |

|
2008-01-17 18:05:25 |
> Not so. The Internet Archive knows the first time
they've seen an URL,
> over the past ten years; they can also tell you when
the content has
> significantly changed. Obviously, there is a bias
towards pages (and
> sites) with higher traffic, but that seems reasonable
if you're
> evaluating standard practices. ~ Derrick Pallas
Yes... but it would suffer from crawler priority bias.
If it was a low ranked page it might take a few month to get
around to
crawling it.
Spinn3r would have better data here because we're real
time....
Observing the URL and hAtom timestamp as I mentioned before
would be
nice but would suffer from bias again.
Kevin
--
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com
and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: stats on well formed XHTML |
  United States |
2008-01-17 22:29:08 |
Le 18 janv. 2008 à 09:03, Kevin Burton a écrit :
> On could perform such an audit with hAtom published
values. Either
> that or use the RSS timestamp or timestamp in the URL.
hmm maybe an intermediate possibility, Timestamp of domain
creation.
whois microformats.org
Created On:26-Jan-2005 04:13:04 UTC
Last Updated On:02-Nov-2007 05:19:18 UTC
Expiration Date:26-Jan-2008 04:13:04 UTC
Often (not always) Web sites use a common publishing system
for the
whole site.
Domains creation then come with a publishing system which
generates a
kind of HTML which "should" be the same for all
URLs of this domain.
A lot of bias too, but just another way to constraint the
data set.
--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|