|
List Info
Thread: WhiteListBlackList
|
|
| WhiteListBlackList |

|
2006-05-22 11:50:56 |
Hi, I have problem when I am using black-white list url
filtering. I have two directiory for filtering
called NegativeURLS and PositiveURLS
************************************************************
*****************************
in NegativeURLS, I have
www.hurriyet.com.tr
in PostiveURLS, I have
www.milliyet.com.tr
************************************************************
*****************************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr
I run the following commands from shell.
$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/PositiveURLS/ -white
$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/NegativeURLS/ -black
Then I run inject,generate and Fetch, After that I run
following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb
<crawldb> bwdb ~/trace/output/segments/20060522115951/
Finally I run GenericReader and I print the output, it
contains the URLs that are in the blacklist,
what can be the problem?
|
|
| WhiteListBlackList |

|
2006-05-22 13:04:09 |
Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
> Hi, I have problem when I am using black-white list url
filtering.
> I have two directiory for filtering
> called NegativeURLS and PositiveURLS
>
>
************************************************************
**********
> *******************
> in NegativeURLS, I have
> www.hurriyet.com.tr
>
> in PostiveURLS, I have www.milliyet.com.tr
>
>
************************************************************
**********
> *******************
> In the input directory for Crawl operation, I have
> www.hurriyet.com.tr
> www.milliyet.com.tr
>
> I run the following commands from shell.
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/
> PositiveURLS/ -white
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/
> NegativeURLS/ -black
>
> Then I run inject,generate and Fetch, After that I run
following
> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb
<crawldb> bwdb ~/
> trace/output/segments/20060522115951/
>
> Finally I run GenericReader and I print the output, it
contains the
> URLs that are in the blacklist,
> what can be the problem?
The Black/White List works only in the update process
(BWUpdateDb),
not by fetching or generating. Only the white Urls will be
updated to
the crawldb.
Are only www.hurriyet.com.tr in your crawldb or other html
sites from
this host? And what is the status of this urls
(STATUS_DB_FETCHED or
STATUS_DB_UNFETCHED )?
Marko
|
|
| WhiteListBlackList |

|
2006-05-22 16:22:43 |
Marko Bauhardt wrote:
>
> Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
>
>> Hi, I have problem when I am using black-white list
url filtering. I
>> have two directiory for filtering
>> called NegativeURLS and PositiveURLS
>>
>>
************************************************************
**********
>> *******************
>> in NegativeURLS, I have
>> www.hurriyet.com.tr
>>
>> in PostiveURLS, I have www.milliyet.com.tr
>>
>>
************************************************************
**********
>> *******************
>> In the input directory for Crawl operation, I have
>> www.hurriyet.com.tr
>> www.milliyet.com.tr
>>
>> I run the following commands from shell.
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/
>> PositiveURLS/ -white
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/
>> NegativeURLS/ -black
>>
>> Then I run inject,generate and Fetch, After that I
run following
>> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb
<crawldb> bwdb ~/
>> trace/output/segments/20060522115951/
>>
>> Finally I run GenericReader and I print the output,
it contains the
>> URLs that are in the blacklist,
>> what can be the problem?
>
>
> The Black/White List works only in the update process
(BWUpdateDb),
> not by fetching or generating. Only the white Urls will
be updated to
> the crawldb.
>
> Are only www.hurriyet.com.tr in your crawldb or other
html sites from
> this host? And what is the status of this urls
(STATUS_DB_FETCHED or
> STATUS_DB_UNFETCHED )?
>
> Marko
>
>
>
> The crawldb contains the following
http://hurriyet.com.tr/
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
http://milliyet.com.tr/
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
both of them is DB_unfetched.
PostiveURL is http://milliyet.com.tr
it is in ~/URL/PositiveURLS/Positive.txt
NegativeURL is http://hurriyet.com.tr
it is in ~/URL/NegativeURLS/Negative.txt
I run the following inject command
./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/PositiveURLS/
-white
./nutch org.apache.nutch.crawl.bw.BWInjector bwdb
~/URL/NegativeURLS/
-black
After fetch command with parsing option
I run the following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb
<crawldb> bwdb ~/
trace/output/segments/20060522115951/
Any suggestion for two DB_unfetched entry? I expect one them
is fetched.
|
|
| Run-Time Error |

|
2006-05-23 09:37:46 |
Hi everbody, I am running Nuth 0.8 under windows by using
Eclipse
I got the following error. I added conf directory to my
classpath. I
changed
nuth-site.xml added regex-url filter there. What can be
reason for the
following mistake?
java.lang.RuntimeException:
org.apache.nutch.net.URLFilter not found.
at
org.apache.nutch.net.URLFilters.<init>(URLFilters.java
:47)
at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injec
tor.java:55)
at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
at
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:
33)
at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:90)
Exception in thread "main" java.io.IOException:
Job
failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341
)
at
org.apache.nutch.crawl.Injector.inject(Injector.java:130)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
|
|
| Changing db data |

|
2006-05-23 10:09:34 |
Hi,
I'm writing a small utility to ammend the data in nutch
database. I managed
to read the nutch database, also I can delete document from
the database but
is there a way to change a value of the field in nutch db?
If you can just point me in right direction, spent lot of
time reading
lucene and nutch api, I can create db from scratch and add
data but cannot
change anything... Any ideas ?
10x in advance
Bogdan
|
|
| Run-Time Error |

|
2006-05-26 09:56:12 |
Did you add the plugins directory to your classpath and does
it
contain all of your plugins?
Rgrds, Thomas
On 5/23/06, Murat Ali Bayir <murat.bayir agmlab.com> wrote:
> Hi everbody, I am running Nuth 0.8 under windows by
using Eclipse
> I got the following error. I added conf directory to
my classpath. I
> changed
> nuth-site.xml added regex-url filter there. What can be
reason for the
> following mistake?
>
> java.lang.RuntimeException:
> org.apache.nutch.net.URLFilter not found.
> at
>
org.apache.nutch.net.URLFilters.<init>(URLFilters.java
:47)
> at
>
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injec
tor.java:55)
> at
>
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
> at
>
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:
33)
> at
>
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
> at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
> at
>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:90)
> Exception in thread "main"
java.io.IOException: Job
> failed!
> at
>
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341
)
> at
>
org.apache.nutch.crawl.Injector.inject(Injector.java:130)
> at
org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>
>
>
>
>
|
|
| Run-Time Error |

|
2006-05-26 13:09:31 |
On the launcher under classpath you will need to add the
directory above
plugins. Make sure this is on the eclipse laucher though.
Setting it
on the project won't help
TDLN wrote:
> Did you add the plugins directory to your classpath and
does it
> contain all of your plugins?
>
> Rgrds, Thomas
>
> On 5/23/06, Murat Ali Bayir <murat.bayir agmlab.com> wrote:
>> Hi everbody, I am running Nuth 0.8 under windows by
using Eclipse
>> I got the following error. I added conf directory
to my classpath. I
>> changed
>> nuth-site.xml added regex-url filter there. What
can be reason for the
>> following mistake?
>>
>> java.lang.RuntimeException:
>> org.apache.nutch.net.URLFilter not found.
>> at
>>
org.apache.nutch.net.URLFilters.<init>(URLFilters.java
:47)
>> at
>>
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injec
tor.java:55)
>> at
>>
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
>> at
>>
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:
33)
>> at
>>
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:38
9)
>> at
>>
org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>> at
>>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:90)
>> Exception in thread "main"
java.io.IOException: Job
>> failed!
>> at
>>
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341
)
>> at
>>
org.apache.nutch.crawl.Injector.inject(Injector.java:130)
>> at
org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>>
>>
>>
>>
>>
|
|
[1-7]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|