|
|
| Re: Parsing XFN in PHP |

|
2008-04-10 11:38:30 |
all of this being true - on the simpler subject of just
grokking XFN,
we can let tidy do all the heavy lifting (and return
XML-serialised
HTML from standard HTML) and not worry about the
complexities of
anything else. other microformats, of course, are a
completely
different matter (and, as you suggest *MUCH* harder to deal
with
reliably).
Or am I really being naive here ?
On 10/04/2008, Ryan Parman <ryan.lists.warpshare gmail.com> wrote:
> As someone with a background in parsing RSS/Atom, I can
say from years of
> experience that RSS is only occasionally XML and that
you typically find far
> more HTML in a feed than XML. And parsing HTML can be a
bitch.
> 1) Parsing HTML is hard -- especially when the only
tools available are for
> another language (XML). If you need to screw something
in, but screw drivers
> don't exist, do you use a hammer? An elegantly folded
paperclip? A
> combination of both?
>
> 2) *Reliably* parsing microformats out of *most*
(X)HTML with
> object-oriented PHP 5.x is going to be a big project.
If you're diligent
> about commenting your code so that others can
understand what's going on,
> I'd expect a PHP5 library to be at least 1 megabyte.
You'll need to account
> for an unprecedented number of completely idiotic
markup faults.
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |
  United Kingdom |
2008-04-10 11:56:05 |
Ryan Parman <ryan.lists.warpshare gmail.com> Thu, 10 Apr
2008 08:25:06
>4) If this were an existing project with PHP 4.x
support, then sure,
>maintain support if the cost is reasonable. But for any
new project,
>I'd say to start on a 5.x codebase.
Yes, absolutely.
The thinking here is that it would be good to get uF parsing
into useful
application modules in PHP projects like Wordpress and
Drupal. Those
projects put a lot of effort into being able to run on
lowest common
denominator, cheap, php hosting. Hence it seemed like a good
idea to try
and match that or at least not introduce new awkward
dependencies that
restricted roll out.
But then maybe it is about time that we actually used a
dependency on
PHP5 to encourage end administrators to demand PHP5 from
their hosting
providers. So maybe it's all moot.
I have a feeling this same discussion is taking place in
many different
mailing lists.
--
Julian Bond E&MSN: julian_bond at voidstar.com M: +44
(0)77 5907 2173
Webmaster: http://www.ecademy.com/
T: +44 (0)192 0412 433
Personal WebLog: http://www.voidstar.com/
skype:julian.bond?chat
Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Parsing XFN in PHP |
  United Kingdom |
2008-04-10 12:04:41 |
Ryan Parman <ryan.lists.warpshare gmail.com> Thu, 10 Apr
2008 09:05:47
>As someone with a background in parsing RSS/Atom, I can
say from years
>of experience that RSS is only occasionally XML and that
you typically
>find far more HTML in a feed than XML. And parsing HTML
can be a bitch.
Big snip.
Woah! That's enough to put one off even starting on parsing
and reading
uF. Which makes uF all a bit pointless. Oh dear. :(
I suspect though that this Gordian knot can be cut. It seems
quite
likely that any page marked up with uF is good enough that
HTML-Tidy
won't remove too many uF marked up elements. If that's the
case, then
Fetch html -> HTML-Tidy -> XML parsing is going to get
99% of the job
done and successfully extract the uF marked data. But that
HTML-Tidy
step is going to be indispensable. It just plain won't work
without it.
And the shortcut that reduces even that step is
DomDocument>loadHtml($html) which is effectively doing
the same thing.
It would be interesting to do some interop testing and see
just how bad
a web page has to be before the uF starts getting missed.
And a uF validator would come in handy there.
--
Julian Bond E&MSN: julian_bond at voidstar.com M: +44
(0)77 5907 2173
Webmaster: http://www.ecademy.com/
T: +44 (0)192 0412 433
Personal WebLog: http://www.voidstar.com/
skype:julian.bond?chat
Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Parsing XFN in PHP |
  United States |
2008-04-10 12:34:20 |
Ryan Parman wrote:
> "But we can do it in web browsers!" What do
web browsers have that PHP
> developers don't? An HTML parser. As far as I know
there are no HTML
> parsers written for PHP (or any other language that I'm
aware of).
http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 15 days, 4:51.]
Tagliatelle with Fennel and Asparagus
http://tobyinkster.co.uk/blog/2008/04/06/t
agliatelle-fennel-asparagus/
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |
  United States |
2008-04-10 12:32:36 |
Sorry for not noticing this earlier, but this thread really
belongs on
the -dev list (to keep the -discuss list relevant for those
not doing
development). Please direct future replies to
microformats-dev microformats.org
(and join first if you're doing development and not
already on the
list).
http://microformats.org/mailman/listinfo/microformats-
dev/
Peace,
Scott
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |
  United States |
2008-04-10 13:34:23 |
On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> Ryan Parman <ryan.lists.warpshare gmail.com> Thu, 10 Apr 2008 09:05:47
>> As someone with a background in parsing RSS/Atom, I
can say from
>> years of experience that RSS is only occasionally
XML and that you
>> typically find far more HTML in a feed than XML.
And parsing HTML
>> can be a bitch.
>
> Big snip.
>
> Woah! That's enough to put one off even starting on
parsing and
> reading uF. Which makes uF all a bit pointless. Oh
dear. :(
Sarcasm noted. ;)
> I suspect though that this Gordian knot can be cut. It
seems quite
> likely that any page marked up with uF is good enough
that HTML-Tidy
> won't remove too many uF marked up elements. If that's
the case,
> then Fetch html -> HTML-Tidy -> XML parsing is
going to get 99% of
> the job done and successfully extract the uF marked
data. But that
> HTML-Tidy step is going to be indispensable. It just
plain won't
> work without it. And the shortcut that reduces even
that step is
> DomDocument>loadHtml($html) which is effectively
doing the same thing.
On Apr 10, 2008, at 10:34 AM, Toby A Inkster wrote:
> http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php
This is interesting -- especially if it works. However the
version
information is noted as CVS-only. Is this in a shipping
version of PHP
yet?
Using HTML-Tidy is a fairly big gotcha for most people on
shared
hosting. I don't know the stats, but I would guess that not
many
hosting providers have this installed. I have access to
dedicated
hardware, so I'm definitely interested in this (assuming it
works as
expected, of course), but I'm concerned about the community
at-large.
On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> It would be interesting to do some interop testing and
see just how
> bad a web page has to be before the uF starts getting
missed.
I agree.
--
Ryan Parman
<http://ryanparman.com>
a>
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |
  United Kingdom |
2008-04-10 15:26:02 |
Scott Reynen <scott randomchaos.com> Thu, 10 Apr 2008
11:32:36
>Sorry for not noticing this earlier, but this thread
really belongs on
>the -dev list (to keep the -discuss list relevant for
those not doing
>development). Please direct future replies to
>microformats-dev microformats.org (and join first if
you're doing
>development and not already on the list).
>
>http://microformats.org/mailman/listinfo/microformats-
dev/
And just as I was about to post sample code as well. ;)
--
Julian Bond E&MSN: julian_bond at voidstar.com M: +44
(0)77 5907 2173
Webmaster: http://www.ecademy.com/
T: +44 (0)192 0412 433
Personal WebLog: http://www.voidstar.com/
skype:julian.bond?chat
Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |
  United Kingdom |
2008-04-11 05:54:51 |
Brian Suda <brian.suda gmail.com> Fri, 11 Apr
2008 08:47:51
>--- i have to echo what Scott said several messages
back. Any parsing
>or development related discussions should be taken to
the dev list. If
>anyone want to discuss/dispute the merits of various
languages and
>their versions, then that should be done on other
mailing lists.
>
>Lets keep this discuss list to discussions and the dev
list to developing.
I've been trying to sub to that list but without success.
Perhaps I'm
stuck in a moderation queue or something.
To some extent, encouraging the development of parsers in
common
languages is a necessary bit of evangelism, since a markup
language
needs consumers to have value. So that does feel on topic
here. But I
take the point that it can quickly slip into technical
issues, as has
happened.
--
Julian Bond E&MSN: julian_bond at voidstar.com M: +44
(0)77 5907 2173
Webmaster: http://www.ecademy.com/
T: +44 (0)192 0412 433
Personal WebLog: http://www.voidstar.com/
skype:julian.bond?chat
Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Parsing XFN in PHP |
  United States |
2008-04-11 02:33:59 |
Ryan Parman wrote:
> On Apr 10, 2008, at 10:34 AM, Toby A Inkster wrote:
>> http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php
>
> This is interesting -- especially if it works. However
the version
> information is noted as CVS-only. Is this in a shipping
version of PHP
> yet?
I've never quite understood how they get that versioning
information in
the documentation. It often seems wrong. DOMDocument has
been part of the
PHP core since 5.0.
Another option is XML_HTMLSax3 from PEAR:
http://pear.
php.net/package/XML_HTMLSax3
Here's an example use of the above PEAR module: it converts
an HTML table
into a two-dimensional array, respecting rowspan and colspan
(oh yes!):
http://tobyinkster.co.uk/blog/2007/07/20/html-table-
parsing/
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 15 days,
18:33.]
Tagliatelle with Fennel and Asparagus
http://tobyinkster.co.uk/blog/2008/04/06/t
agliatelle-fennel-asparagus/
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: Re: Parsing XFN in PHP |

|
2008-04-11 03:47:51 |
2008/4/10 Scott Reynen <scott randomchaos.com>:
> Please direct future replies to
> microformats-dev microformats.org (and join first if you're
doing
> development and not already on the list).
>
> http://microformats.org/mailman/listinfo/microformats-
dev/
--- i have to echo what Scott said several messages back.
Any parsing
or development related discussions should be taken to the
dev list. If
anyone want to discuss/dispute the merits of various
languages and
their versions, then that should be done on other mailing
lists.
Lets keep this discuss list to discussions and the dev list
to developing.
Thanks,
-brian
--
brian suda
http://suda.co.uk
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|