|
List Info
Thread: site text search
|
|
| site text search |

|
2006-02-09 10:31:59 |
On Wed, Feb 08, 2006 at 10:23:10PM +0000, Andy Armstrong
wrote:
> * if you have a site that presents multiple views of
the same
> data (e.g. articles sorted by date, by subject, by
keyword)
> then a crawler based indexer will index each item
many times
> - once for each view in which it appears; MySQL will
only have
> a single copy of the data.
Then you've got a broken web site -- period. Content should
have only
one, canonical URL.
-Dom
|
|
| site text search |

|
2006-02-09 11:08:16 |
On Thu, 2006-02-09 at 10:31 +0000, Dominic Mitchell wrote:
> On Wed, Feb 08, 2006 at 10:23:10PM +0000, Andy
Armstrong wrote:
> > * if you have a site that presents multiple views
of the same
> > data (e.g. articles sorted by date, by subject,
by keyword)
> > then a crawler based indexer will index each
item many times
> > - once for each view in which it appears; MySQL
will only have
> > a single copy of the data.
>
> Then you've got a broken web site -- period. Content
should have only
> one, canonical URL.
<rant>
Really? So we are only allowed to hyperlink with absolute
names now are
we? Have you googled for stuff on this mailing list
recently?
The real world really does not work like that. In fact, I
would go
further: any site that implements the "one canonical
URL" paradigm is
likely to be extremely difficult to use.
</rant>
Of course making sure that only one copy of the content goes
into a
search engine's index is a different problem and is an ideal
to be
striven for. The after effect of actually achieving this
would be just
the one URI to the content from that search engine.
Dirk
|
|
| site text search |

|
2006-02-09 11:17:39 |
* at 09/02 11:08 +0000 Dirk Koopman said:
> On Thu, 2006-02-09 at 10:31 +0000, Dominic Mitchell
wrote:
> > Then you've got a broken web site -- period.
Content should have only
> > one, canonical URL.
>
> <rant>
>
> Really? So we are only allowed to hyperlink with
absolute names now are
> we? Have you googled for stuff on this mailing list
recently?
Aren't all the messages in the archive in one place only?
> The real world really does not work like that. In fact,
I would go
> further: any site that implements the "one
canonical URL" paradigm is
> likely to be extremely difficult to use.
>
> </rant>
I don't think so. Sure, you can have multiple ways to get to
that one
URL but it makes perfect sense to me to have the content in
one place
on the site. Tim Bray's 'blog[0] is a pretty good example of
this in
that he has pages that let you drill down by date and by
category but
each entry is in one place only. Heck, the mailman list
archive for
this mailing list is an example of that.
I really don't see how it's hard to do this.
> Of course making sure that only one copy of the content
goes into a
> search engine's index is a different problem and is an
ideal to be
> striven for. The after effect of actually achieving
this would be just
> the one URI to the content from that search engine.
But then why not just have the content in one place in the
first
place and then you get round this?
s
[0] http://www.tbray.org/on
going/
|
|
| site text search |

|
2006-02-09 12:04:02 |
Dominic Mitchell wrote:
> Then you've got a broken web site -- period. Content
should have only
> one, canonical URL.
Not so. Remember that a URL is a Universal Resource
Locator. It simply
says that there is a resource that can be fetched from this
location. It
doesn't say that this is the only location for the resource,
or that the
resource will be there tomorrow or the day after.
For example, both these URLs point to the same resource:
http://cpan.org/modules/by-module/T
emplate/Template-Plugin-Colour-0.01.tar.gz
http://cpan.org/modules/by-authors/i
d/ABW/Template-Plugin-Colour-0.01.tar.gz
Both are correct, both are equally valid.
Homotopic paths (different paths that get you to the same
place) are
a fundamental part of any hyperspace such as the web, and a
good thing too.
There is no strict hierarchy, there is no "One True
Location". That's
why it's a web and not an index.
On the other hand, every resource should ideally have one,
canonical URI.
A Universal Resource Identifier does not say anything about
the location
of the resource, but simply gives it an identifier so that
it can be
uniquely referenced now and for the rest of time. You may
not be able
to fetch the resource, now or ever, but at least you'll be
able to talk
about it.
I suspect this is what you're really getting at. The
URL/URI confusion
is a common one, made all the more perplexing by the fact
that URLs are
written using the URI "syntax". Thus in common
parlance, URL and URI are
interchangeable but actually relate to slightly different
concepts.
A
|
|
| site text search |

|
2006-02-09 12:02:05 |
On Thu, 2006-02-09 at 11:17 +0000, Struan Donald wrote:
> * at 09/02 11:08 +0000 Dirk Koopman said:
> > On Thu, 2006-02-09 at 10:31 +0000, Dominic
Mitchell wrote:
> > > Then you've got a broken web site -- period.
Content should have only
> > > one, canonical URL.
> >
> > <rant>
> >
> > Really? So we are only allowed to hyperlink with
absolute names now are
> > we? Have you googled for stuff on this mailing
list recently?
>
> Aren't all the messages in the archive in one place
only?
But that isn't the point of the rant.
>
> > The real world really does not work like that. In
fact, I would go
> > further: any site that implements the "one
canonical URL" paradigm is
> > likely to be extremely difficult to use.
> >
> > </rant>
>
> I don't think so. Sure, you can have multiple ways to
get to that one
> URL but it makes perfect sense to me to have the
content in one place
> on the site.
And if this is true then you don't (necessarily) have
"one canonical
URL". Multiple ways imply the possibility of multiple
URLs.
As an illustration, a music site I once was concerned with
had a single
canonical "database" of articles/data on musicians
and their music. It
was a "find a starting point, but then follow your nose
if you found
anything (more) interesting" site. This meant that the
site had a)
multiple pathways to navigate around that data and [to pay
for it all]
b) different branded skins to view it with. Editors were at
liberty to
create/break "cross" hyperlinks as they saw fit.
The result of this was that, depending on how you got into
the site and
also how you arrived at the data, you [wc]ould get a
different URL.
Same data - many URLs.
Dirk
|
|
| site text search |

|
2006-02-09 15:58:51 |
On Thu, Feb 09, 2006 at 12:04:02PM +0000, Andy Wardley
wrote:
> Dominic Mitchell wrote:
> > Then you've got a broken web site -- period.
Content should have only
> > one, canonical URL.
>
> Not so. Remember that a URL is a Universal Resource
Locator. It simply
> says that there is a resource that can be fetched from
this location. It
> doesn't say that this is the only location for the
resource, or that the
> resource will be there tomorrow or the day after.
>
> For example, both these URLs point to the same
resource:
>
> http://cpan.org/modules/by-module/T
emplate/Template-Plugin-Colour-0.01.tar.gz
> http://cpan.org/modules/by-authors/i
d/ABW/Template-Plugin-Colour-0.01.tar.gz
>
> Both are correct, both are equally valid.
>
> Homotopic paths (different paths that get you to the
same place) are
> a fundamental part of any hyperspace such as the web,
and a good thing too.
> There is no strict hierarchy, there is no "One
True Location". That's
> why it's a web and not an index.
>
> On the other hand, every resource should ideally have
one, canonical URI.
> A Universal Resource Identifier does not say anything
about the location
> of the resource, but simply gives it an identifier so
that it can be
> uniquely referenced now and for the rest of time. You
may not be able
> to fetch the resource, now or ever, but at least you'll
be able to talk
> about it.
>
> I suspect this is what you're really getting at. The
URL/URI confusion
> is a common one, made all the more perplexing by the
fact that URLs are
> written using the URI "syntax". Thus in
common parlance, URL and URI are
> interchangeable but actually relate to slightly
different concepts.
Spot on. Guess I've been listening to too many semantic web
people
recently.
-Dom
|
|
[1-6]
|
|