List Info

Thread: Parsing XFN in PHP




Re: Parsing XFN in PHP
user name
2008-04-10 11:38:30
all of this being true - on the simpler subject of just
grokking XFN,
we can let tidy do all the heavy lifting (and return
XML-serialised
HTML from standard HTML) and not worry about the
complexities of
anything else.  other microformats, of course, are a
completely
different matter (and, as you suggest *MUCH* harder to deal
with
reliably).

Or am I really being naive here ?

On 10/04/2008, Ryan Parman <ryan.lists.warpsharegmail.com> wrote:
> As someone with a background in parsing RSS/Atom, I can
say from years of
> experience that RSS is only occasionally XML and that
you typically find far
> more HTML in a feed than XML. And parsing HTML can be a
bitch.
>  1) Parsing HTML is hard -- especially when the only
tools available are for
> another language (XML). If you need to screw something
in, but screw drivers
> don't exist, do you use a hammer? An elegantly folded
paperclip? A
> combination of both?
>
>  2) *Reliably* parsing microformats out of *most*
(X)HTML with
> object-oriented PHP 5.x is going to be a big project.
If you're diligent
> about commenting your code so that others can
understand what's going on,
> I'd expect a PHP5 library to be at least 1 megabyte.
You'll need to account
> for an unprecedented number of completely idiotic
markup faults.
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
country flaguser name
United Kingdom
2008-04-10 11:56:05
Ryan Parman <ryan.lists.warpsharegmail.com> Thu, 10 Apr
2008 08:25:06
>4) If this were an existing project with PHP 4.x
support, then sure, 
>maintain support if the cost is reasonable. But for any
new project, 
>I'd say to start on a 5.x codebase.

Yes, absolutely.

The thinking here is that it would be good to get uF parsing
into useful 
application modules in PHP projects like Wordpress and
Drupal. Those 
projects put a lot of effort into being able to run on
lowest common 
denominator, cheap, php hosting. Hence it seemed like a good
idea to try 
and match that or at least not introduce new awkward
dependencies that 
restricted roll out.

But then maybe it is about time that we actually used a
dependency on 
PHP5 to encourage end administrators to demand PHP5 from
their hosting 
providers. So maybe it's all moot.

I have a feeling this same discussion is taking place in
many different 
mailing lists.

-- 
Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44
(0)77 5907 2173
Webmaster:          http://www.ecademy.com/  
   T: +44 (0)192 0412 433
Personal WebLog:    http://www.voidstar.com/
    skype:julian.bond?chat
                            Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Parsing XFN in PHP
country flaguser name
United Kingdom
2008-04-10 12:04:41
Ryan Parman <ryan.lists.warpsharegmail.com> Thu, 10 Apr
2008 09:05:47
>As someone with a background in parsing RSS/Atom, I can
say from years 
>of experience that RSS is only occasionally XML and that
you typically 
>find far more HTML in a feed than XML. And parsing HTML
can be a bitch.

Big snip.

Woah! That's enough to put one off even starting on parsing
and reading 
uF. Which makes uF all a bit pointless. Oh dear. :(

I suspect though that this Gordian knot can be cut. It seems
quite 
likely that any page marked up with uF is good enough that
HTML-Tidy 
won't remove too many uF marked up elements. If that's the
case, then 
Fetch html -> HTML-Tidy -> XML parsing is going to get
99% of the job 
done and successfully extract the uF marked data. But that
HTML-Tidy 
step is going to be indispensable. It just plain won't work
without it. 
And the shortcut that reduces even that step is 
DomDocument>loadHtml($html) which is effectively doing
the same thing.

It would be interesting to do some interop testing and see
just how bad 
a web page has to be before the uF starts getting missed.

And a uF validator would come in handy there.

-- 
Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44
(0)77 5907 2173
Webmaster:          http://www.ecademy.com/  
   T: +44 (0)192 0412 433
Personal WebLog:    http://www.voidstar.com/
    skype:julian.bond?chat
                            Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Parsing XFN in PHP
country flaguser name
United States
2008-04-10 12:34:20
Ryan Parman wrote:

> "But we can do it in web browsers!" What do
web browsers have that PHP
> developers don't? An HTML parser. As far as I know
there are no HTML
> parsers written for PHP (or any other language that I'm
aware of).

http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php

-- 
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 15 days, 4:51.]

                   Tagliatelle with Fennel and Asparagus
   http://tobyinkster.co.uk/blog/2008/04/06/t
agliatelle-fennel-asparagus/

_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
country flaguser name
United States
2008-04-10 12:32:36
Sorry for not noticing this earlier, but this thread really
belongs on  
the -dev list (to keep the -discuss list relevant for those
not doing  
development).  Please direct future replies to
microformats-devmicroformats.org 
  (and join first if you're doing development and not
already on the  
list).

http://microformats.org/mailman/listinfo/microformats-
dev/

Peace,
Scott

_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
country flaguser name
United States
2008-04-10 13:34:23
On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> Ryan Parman <ryan.lists.warpsharegmail.com> Thu, 10 Apr 2008 09:05:47
>> As someone with a background in parsing RSS/Atom, I
can say from  
>> years of experience that RSS is only occasionally
XML and that you  
>> typically find far more HTML in a feed than XML.
And parsing HTML  
>> can be a bitch.
>
> Big snip.
>
> Woah! That's enough to put one off even starting on
parsing and  
> reading uF. Which makes uF all a bit pointless. Oh
dear. :(

Sarcasm noted. ;)


> I suspect though that this Gordian knot can be cut. It
seems quite  
> likely that any page marked up with uF is good enough
that HTML-Tidy  
> won't remove too many uF marked up elements. If that's
the case,  
> then Fetch html -> HTML-Tidy -> XML parsing is
going to get 99% of  
> the job done and successfully extract the uF marked
data. But that  
> HTML-Tidy step is going to be indispensable. It just
plain won't  
> work without it. And the shortcut that reduces even
that step is  
> DomDocument>loadHtml($html) which is effectively
doing the same thing.

On Apr 10, 2008, at 10:34 AM, Toby A Inkster wrote:
> http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php

This is interesting -- especially if it works. However the
version  
information is noted as CVS-only. Is this in a shipping
version of PHP  
yet?

Using HTML-Tidy is a fairly big gotcha for most people on
shared  
hosting. I don't know the stats, but I would guess that not
many  
hosting providers have this installed. I have access to
dedicated  
hardware, so I'm definitely interested in this (assuming it
works as  
expected, of course), but I'm concerned about the community
at-large.

On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> It would be interesting to do some interop testing and
see just how  
> bad a web page has to be before the uF starts getting
missed.


I agree.

--
Ryan Parman
<http://ryanparman.com>



_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
country flaguser name
United Kingdom
2008-04-10 15:26:02
Scott Reynen <scottrandomchaos.com> Thu, 10 Apr 2008
11:32:36
>Sorry for not noticing this earlier, but this thread
really belongs on 
>the -dev list (to keep the -discuss list relevant for
those not doing 
>development).  Please direct future replies to 
>microformats-devmicroformats.org  (and join first if
you're doing 
>development and not already on the  list).
>
>http://microformats.org/mailman/listinfo/microformats-
dev/

And just as I was about to post sample code as well. ;)

-- 
Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44
(0)77 5907 2173
Webmaster:          http://www.ecademy.com/  
   T: +44 (0)192 0412 433
Personal WebLog:    http://www.voidstar.com/
    skype:julian.bond?chat
                            Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
country flaguser name
United Kingdom
2008-04-11 05:54:51
Brian Suda <brian.sudagmail.com> Fri, 11 Apr
2008 08:47:51
>--- i have to echo what Scott said several messages
back. Any parsing
>or development related discussions should be taken to
the dev list. If
>anyone want to discuss/dispute the merits of various
languages and
>their versions, then that should be done on other
mailing lists.
>
>Lets keep this discuss list to discussions and the dev
list to developing.

I've been trying to sub to that list but without success.
Perhaps I'm 
stuck in a moderation queue or something.

To some extent, encouraging the development of parsers in
common 
languages is a necessary bit of evangelism, since a markup
language 
needs consumers to have value. So that does feel on topic
here. But I 
take the point that it can quickly slip into technical
issues, as has 
happened.

-- 
Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44
(0)77 5907 2173
Webmaster:          http://www.ecademy.com/  
   T: +44 (0)192 0412 433
Personal WebLog:    http://www.voidstar.com/
    skype:julian.bond?chat
                            Tastes Like Milk
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Parsing XFN in PHP
country flaguser name
United States
2008-04-11 02:33:59
Ryan Parman wrote:

> On Apr 10, 2008, at 10:34 AM, Toby A Inkster wrote:
>> http://www.php.net/manual/en/function.dom-domd
ocument-loadhtml.php
> 
> This is interesting -- especially if it works. However
the version
> information is noted as CVS-only. Is this in a shipping
version of PHP
> yet?

I've never quite understood how they get that versioning
information in 
the documentation. It often seems wrong. DOMDocument has
been part of the 
PHP core since 5.0.

Another option is XML_HTMLSax3 from PEAR:
http://pear.
php.net/package/XML_HTMLSax3

Here's an example use of the above PEAR module: it converts
an HTML table 
into a two-dimensional array, respecting rowspan and colspan
(oh yes!):
http://tobyinkster.co.uk/blog/2007/07/20/html-table-
parsing/

-- 
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 15 days,
18:33.]

                   Tagliatelle with Fennel and Asparagus
   http://tobyinkster.co.uk/blog/2008/04/06/t
agliatelle-fennel-asparagus/

_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

Re: Re: Parsing XFN in PHP
user name
2008-04-11 03:47:51
2008/4/10 Scott Reynen <scottrandomchaos.com>:
> Please direct future replies to
> microformats-devmicroformats.org (and join first if you're
doing
> development and not already on the list).
>
>  http://microformats.org/mailman/listinfo/microformats-
dev/


--- i have to echo what Scott said several messages back.
Any parsing
or development related discussions should be taken to the
dev list. If
anyone want to discuss/dispute the merits of various
languages and
their versions, then that should be done on other mailing
lists.

Lets keep this discuss list to discussions and the dev list
to developing.

Thanks,
-brian


-- 
brian suda
http://suda.co.uk
_______________________________________________
microformats-discuss mailing list
microformats-discussmicroformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss

[1-10] [11-20] [21-30]

about | contact  Other archives ( Real Estate discussion Medical topics )