List Info

Thread: How to treat # in URLs?




How to treat # in URLs?
user name
2007-08-13 21:49:57
Hi,

I noticed that urls with a # in them are not handled any
differently to 
normal urls. See output of readdb:

http://127.0.0.1:800
0/about.html        Version: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:41:55 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0

http://127.0.0.1
:8000/about.html#top    Version: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:42:03 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0

I would have expected that, when doing an updatedb, the
#foobar part of 
the URL would be stripped.

Is there a sensible reason for the current behaviour? Or
have I found a bug?

Cheers,
Carl.

Re: How to treat # in URLs?
country flaguser name
Turkey
2007-08-14 01:23:51
Technically, the fragment is a part of the url, but foo and
foo#bar 
points to the same location, so it should be stripped out.
Are you using 
url-normalizers. If not could you please try them.

Carl Cerecke wrote:
> Hi,
>
> I noticed that urls with a # in them are not handled
any differently 
> to normal urls. See output of readdb:
>
> http://127.0.0.1:800
0/about.html        Version: 5
> Status: 2 (db_fetched)
> Fetch time: Thu Sep 13 14:41:55 NZST 2007
> Modified time: Thu Jan 01 12:00:00 NZST 1970
> Retries since fetch: 0
> Retry interval: 2592000.0 seconds (30.0 days)
> Score: 4.0
> Signature: c79a4a20d6a19603120d1fdbaf19b0eb
> Metadata: _pst_:success(1), lastModified=0
>
> http://127.0.0.1
:8000/about.html#top    Version: 5
> Status: 2 (db_fetched)
> Fetch time: Thu Sep 13 14:42:03 NZST 2007
> Modified time: Thu Jan 01 12:00:00 NZST 1970
> Retries since fetch: 0
> Retry interval: 2592000.0 seconds (30.0 days)
> Score: 4.0
> Signature: c79a4a20d6a19603120d1fdbaf19b0eb
> Metadata: _pst_:success(1), lastModified=0
>
> I would have expected that, when doing an updatedb, the
#foobar part 
> of the URL would be stripped.
>
> Is there a sensible reason for the current behaviour?
Or have I found 
> a bug?
>
> Cheers,
> Carl.
>

Re: How to treat # in URLs?
user name
2007-08-15 16:33:28
Doh! I had copied the plugin.includes  property from
nutch-default.xml 
to nutch-site.xml and somehow accidentally inserted a
newline so it 
looked like [...]urlno
rmalizer[...]

Oops,
Carl.

Enis Soztutar wrote:
> Technically, the fragment is a part of the url, but foo
and foo#bar 
> points to the same location, so it should be stripped
out. Are you using 
> url-normalizers. If not could you please try them.
> 
> Carl Cerecke wrote:
>> Hi,
>>
>> I noticed that urls with a # in them are not
handled any differently 
>> to normal urls. See output of readdb:
>>
>> http://127.0.0.1:800
0/about.html        Version: 5
>> Status: 2 (db_fetched)
>> Fetch time: Thu Sep 13 14:41:55 NZST 2007
>> Modified time: Thu Jan 01 12:00:00 NZST 1970
>> Retries since fetch: 0
>> Retry interval: 2592000.0 seconds (30.0 days)
>> Score: 4.0
>> Signature: c79a4a20d6a19603120d1fdbaf19b0eb
>> Metadata: _pst_:success(1), lastModified=0
>>
>> http://127.0.0.1
:8000/about.html#top    Version: 5
>> Status: 2 (db_fetched)
>> Fetch time: Thu Sep 13 14:42:03 NZST 2007
>> Modified time: Thu Jan 01 12:00:00 NZST 1970
>> Retries since fetch: 0
>> Retry interval: 2592000.0 seconds (30.0 days)
>> Score: 4.0
>> Signature: c79a4a20d6a19603120d1fdbaf19b0eb
>> Metadata: _pst_:success(1), lastModified=0
>>
>> I would have expected that, when doing an updatedb,
the #foobar part 
>> of the URL would be stripped.
>>
>> Is there a sensible reason for the current
behaviour? Or have I found 
>> a bug?
>>
>> Cheers,
>> Carl.
>>
> 
>
____________________________________________________________
_________
> 
> This has been cleaned & processed by
www.rocketspam.co.nz
>
____________________________________________________________
_________
> 


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )