List Info

Thread: Re: extracting urls into text files




Re: extracting urls into text files
country flaguser name
United States
2007-03-20 04:18:47
First of all thanks for your reply.

Am really got confused !! pardon me..
I dont know whether i  need to put the given code by
creating new class in
nutch directory?
 Do i have to import other classes or packages..?? any thing
i need to take
care of??

I have tried creating a new separate class in nutch
directory..but gives
lotsa errors related to packages/class not found.Still try
to figuring out
whats wrong there.

Secondly How should am able to read the urls from crawldb
once the class get
running..I have know idea how should i figure it out..

How can fit the output of my url in some xml format.i.e.
<url>
    <loc>http://www.exampl
e.com/</loc>
  </url>
<url>
    <loc>http://www.examp
le1.com/</loc>
  </url>
...........
So can you please elaborate me how should i do this..

Thanks a lot for your time..

Cheers,
Cha

Enis Soztutar wrote:
> 
> cha wrote:
>> Thanks enis,
>>
>> am getting some idea from that..
>> Can you tell me in which class i should implement
that.
>> I havent have hadoop install on my box.
>>
>>   
> Just  make a new class in nutch and put the code there
: ) As long as 
> you have hadoop jar in your classpath, you do not need
to checkout the 
> hadoop codebase.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/extracting-ur
ls-into-text-files-tf3409030.html#a9568050
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: extracting urls into text files
country flaguser name
United States
2007-03-20 07:12:52
cha wrote:
> First of all thanks for your reply.
>   
you're welcome.

> Am really got confused !! pardon me..
> I dont know whether i  need to put the given code by
creating new class in
> nutch directory?
>  Do i have to import other classes or packages..?? any
thing i need to take
> care of??
>   
I can suggest you download eclipse, then using the tutorial
on nutch 
wiki called running nutch on eclipse, set up the project.
Then for 
example in the org.apache.nutch.tools package create a new
class and 
then paste the previously mentioned code.

    //here fs is an instance of FileSystem object, seqFile
is a Path to 
the crawldb
    MapFile.Reader reader = new MapFile.Reader (fs, seqFile,
conf);

then in the loop change the below from

out.println(key);

to

out.println("<url><loc>" + key +
"</loc></url>");

> I have tried creating a new separate class in nutch
directory..but gives
> lotsa errors related to packages/class not found.Still
try to figuring out
> whats wrong there.
>
> Secondly How should am able to read the urls from
crawldb once the class get
> running..I have know idea how should i figure it out..
>
> How can fit the output of my url in some xml
format.i.e.
> <url>
>     <loc>http://www.exampl
e.com/</loc>
>   </url>
> <url>
>     <loc>http://www.examp
le1.com/</loc>
>   </url>
> ...........
> So can you please elaborate me how should i do this..
>
> Thanks a lot for your time..
>   
Well, there is nothing more i can do except write the code
my own : )
You can first try to be more familiar with Java programming
if need be. 
Good luck
> Cheers,
> Cha
>
> Enis Soztutar wrote:
>   
>> cha wrote:
>>     
>>> Thanks enis,
>>>
>>> am getting some idea from that..
>>> Can you tell me in which class i should
implement that.
>>> I havent have hadoop install on my box.
>>>
>>>   
>>>       
>> Just  make a new class in nutch and put the code
there : ) As long as 
>> you have hadoop jar in your classpath, you do not
need to checkout the 
>> hadoop codebase.
>>
>>
>>
>>     
>
>   

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )