List Info

Thread: Nutch and GET




Nutch and GET
country flaguser name
Poland
2007-03-23 08:20:32
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there,

Does nutch can index dynamic pages with multilpe GET
parameters in request ?

- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCf
Qbx9
0kfqaVSXY4AY78DGo0pFg6Q=
=HTOD
-----END PGP SIGNATURE-----

Re: Nutch and GET
user name
2007-03-23 08:47:49
Yes, Nutch can crawl and index any URLs that it can access.

You may have to tweak the max. URL length in the
configuration.

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi there,
>
> Does nutch can index dynamic pages with multilpe GET
parameters in request ?
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>
>
iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCf
Qbx9
> 0kfqaVSXY4AY78DGo0pFg6Q=
> =HTOD
> -----END PGP SIGNATURE-----
>

Re: Nutch and GET
country flaguser name
Poland
2007-03-23 09:02:09
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisaƂ(a):
> Yes, Nutch can crawl and index any URLs that it can
access.
> 
> You may have to tweak the max. URL length in the
configuration.
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> Hi there,
> 
> Does nutch can index dynamic pages with multilpe GET
parameters in
> request ?
> 
>>
Which param is it ? I cannot find it in nutch-default.xml


- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACd
G9/r
bmXOt6m6w7iO8Z/WKNTXuYU=
=lCt7
-----END PGP SIGNATURE-----

Re: Nutch and GET
user name
2007-03-23 09:25:25
Try this:

db.max.anchor.length

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Yes, Nutch can crawl and index any URLs that it
can access.
> >
> > You may have to tweak the max. URL length in the
configuration.
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> > Hi there,
> >
> > Does nutch can index dynamic pages with multilpe
GET parameters in
> > request ?
> >
> >>
> Which param is it ? I cannot find it in
nutch-default.xml
>
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>
>
iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACd
G9/r
> bmXOt6m6w7iO8Z/WKNTXuYU=
> =lCt7
> -----END PGP SIGNATURE-----
>
Re: Nutch and GET
country flaguser name
Poland
2007-03-23 09:35:03
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisaƂ(a):
> Try this:
> 
> db.max.anchor.length
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Yes, Nutch can crawl and index any URLs that it can
access.
> 
>> You may have to tweak the max. URL length in the
configuration.
> 
>> - Ravi Chintakunta
> 
>> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
>> Hi there,
> 
>> Does nutch can index dynamic pages with multilpe
GET parameters in
>> request ?
> 
> 
> Which param is it ? I cannot find it in
nutch-default.xml
> 
> 
>>

Well, it donest resolve my problem. but my problem may be
connected with
URL which i'm trying to index. It's like
http:/
/some.example/dir/pre?sth=aa&sth2=bb
Maybe this pre (which doesnt have any extension coudnt be
assosiate with
any parser ?
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCe
J25p
kqHwQuS6Lg65zquDrK9PPT0=
=jl/j
-----END PGP SIGNATURE-----

Re: Nutch and GET
user name
2007-03-23 10:27:18
Hi,

That shouldn't be an issue.

Are you sure that this line

-[?*!=]

is commented in crawl-urlfilter.txt file.

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Try this:
> >
> > db.max.anchor.length
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> > Ravi Chintakunta napisaB(a):
> >> Yes, Nutch can crawl and index any URLs that
it can access.
> >
> >> You may have to tweak the max. URL length in
the configuration.
> >
> >> - Ravi Chintakunta
> >
> >> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> >> Hi there,
> >
> >> Does nutch can index dynamic pages with
multilpe GET parameters in
> >> request ?
> >
> >
> > Which param is it ? I cannot find it in
nutch-default.xml
> >
> >
> >>
>
> Well, it donest resolve my problem. but my problem may
be connected with
> URL which i'm trying to index. It's like
> http:/
/some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt
be assosiate with
> any parser ?
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>
>
iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCe
J25p
> kqHwQuS6Lg65zquDrK9PPT0=
> =jl/j
> -----END PGP SIGNATURE-----
>
Re: Nutch and GET
country flaguser name
Poland
2007-03-23 10:35:26
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisaƂ(a):
> Hi,
> 
> That shouldn't be an issue.
> 
> Are you sure that this line
> 
> -[?*!=]
> 
> is commented in crawl-urlfilter.txt file.
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Try this:
> 
>> db.max.anchor.length
> 
>> - Ravi Chintakunta
> 
>> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
>> Ravi Chintakunta napisaB(a):
>>> Yes, Nutch can crawl and index any URLs that it
can access.
> 
>>> You may have to tweak the max. URL length in
the configuration.
> 
>>> - Ravi Chintakunta
> 
>>> On 3/23/07, Damian Florczyk <thundergentoo.org> wrote:
>>> Hi there,
> 
>>> Does nutch can index dynamic pages with
multilpe GET parameters in
>>> request ?
> 
> 
>> Which param is it ? I cannot find it in
nutch-default.xml
> 
> 
> 
> 
> Well, it donest resolve my problem. but my problem may
be connected with
> URL which i'm trying to index. It's like
> http:/
/some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt
be assosiate with
> any parser ?
>>
Yes, i'm sure. URL's are more then 100 chars long


- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGA/O+ocEgB5I0QSkRApSzAJ9OPBR9/1NJbA5qB4bzGyVlW+Uc9QCb
BI8I
ap+2hLJLCICGCdBycrazOu8=
=cJDC
-----END PGP SIGNATURE-----

Re: Nutch and GET
country flaguser name
Finland
2007-03-23 10:30:20
Damian Florczyk wrote:
> Hi there,
> 
> Does nutch can index dynamic pages with multilpe GET
parameters in request ?
> 

Have you allowed them in URL filter configuration? By
default regex
urlfilter filters away those:

# skip URLs containing certain characters as probable
queries, etc.
-[?*!=]



--
 Sami Siren

Re: Nutch and GET
country flaguser name
Poland
2007-03-23 13:12:12
Sami Siren wrote:
> Damian Florczyk wrote:
>   
>> Hi there,
>>
>> Does nutch can index dynamic pages with multilpe
GET parameters in request ?
>>
>>     
>
> Have you allowed them in URL filter configuration? By
default regex
> urlfilter filters away those:
>
> # skip URLs containing certain characters as probable
queries, etc.
> -[?*!=]
>
>
>
> --
>  Sami Siren
>   


Yes, i did that. Anyway i've crawled another application
which has 
shorter GET parameters and everything works fine. strange
maybe long 
URLs are not probably crawled or sth ?

[1-9]

about | contact  Other archives ( Real Estate discussion Medical topics )