List Info

Thread: File system watching for intranets




File system watching for intranets
user name
2006-09-12 18:04:13
Hi all, our organization is using nutch on a documentation
intranet that
changes every now and then. To keep the index up to date, we
are recrawling
the whole thing every night. For an intranet this seems to
be a workaround
at best. Our nutch crawler is on the same server as our
content and a
simpler solution, IMO, would be to monitor file system
events and just
recrawl the necessary pages each time something changes.
That way our index
would always be up to date and there would be no reason to
do a brute force
recrawl every night. I am willing to write this
functionality and contribute
it to the community as I believe other organizations could
benefit from this
as well, but since I am not as familiar with nutch as some
of the folks
here, I have a few questions.

- Is this a solution to a nonexistent problem? I mean, is
there a nice
solution using the tools already provided? I know each page
is time stamped
in the database when it is fetched, but does this correspond
to the last
modified date? 

- Could this be done by using the existing
generate/fetch/update cycle with
a index update? Is there a way to just fetch and index the
pages necessary?
I suppose my tool could generate the fatch list(s) (I need
to look into this
more closely).

- Are there any other libraries like JNotify to implement
this functionality
that anyone knows about? I haven't found any others.

Any input/suggestions/additional questions/whatever on this
subject is
appreciated as I would like to come up with a more optimal
solution for us
intranet nutch users.

Ben
-- 
View this message in context: http://www.nabble.com/File-syste
m-watching-for-intranets-tf2260463.html#a6271430
Sent from the Nutch - Dev forum at Nabble.com.

File system watching for intranets
user name
2006-09-13 07:51:52
Ben Ogle wrote:

>Hi all, our organization is using nutch on a
documentation intranet that
>changes every now and then. To keep the index up to
date, we are recrawling
>the whole thing every night. For an intranet this seems
to be a workaround
>at best. Our nutch crawler is on the same server as our
content and a
>simpler solution, IMO, would be to monitor file system
events and just
>recrawl the necessary pages each time something changes.
That way our index
>would always be up to date and there would be no reason
to do a brute force
>recrawl every night. I am willing to write this
functionality and contribute
>it to the community as I believe other organizations
could benefit from this
>as well, but since I am not as familiar with nutch as
some of the folks
>here, I have a few questions.
>
>- Is this a solution to a nonexistent problem?
>

I don't think there is any standardized way to do this yet.
So every 
step into this
direction would be a great improvement.

> I mean, is there a nice
>solution using the tools already provided?
>

not that I am aware of, but I guess other people have
tackled this as well.

I think it would be nice to generate a RSS or something
similar as 
fetchlist which
could also be accessed by other crawlers

> I know each page is time stamped
>in the database when it is fetched, but does this
correspond to the last
>modified date? 
>  
>

I am still not sure if Nutch is actually comparing the last
modifieds. I 
know there exists something called
"adddays", but this is more to postpone
re-crawling for e.g. 30 days

>- Could this be done by using the existing
generate/fetch/update cycle with
>a index update? Is there a way to just fetch and index
the pages necessary?
>I suppose my tool could generate the fatch list(s) (I
need to look into this
>more closely).
>
>- Are there any other libraries like JNotify to
implement this functionality
>that anyone knows about? I haven't found any others.
>  
>

does JNotify also implement protocols, e.g. HTTP? In order
to notify 
accross networks,
or does it only work locally?

Thanks

Michi

>Any input/suggestions/additional questions/whatever on
this subject is
>appreciated as I would like to come up with a more
optimal solution for us
>intranet nutch users.
>
>Ben
>  
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache
Lenya
http://www.wyona.com     
                http://lenya.apache.org
michael.wechnerwyona.com                        michiapache.org
+41 44 272 91 61

File system watching for intranets
user name
2006-09-13 20:53:58
JNotify is only local. A simple mapping of paths to http
locations could be
provided in some config file to get around that. Also, I
figure that in an
intranet situation, the admin setting up nutch owns all of
the other servers
that will need to be fetched from, so (s)he could install
nutch on all those
machines to run this tool. 

So the tool could be setup in a distributed intranet
situation:

- admin sets up nutch similar to this:
http
://wiki.apache.org/nutch/NutchHadoopTutorial
- admin crawls and starts this file watcher tool on each
machine that has
searchable content

If I use a simple solution such as generating a fetch list
when a file is
changed (or some amount of time after its changed to catch
other changes),
then fetching and updating the db, my thought is that the
tool would work as
follows:

- file changes on a slave node
  - slave node notifies the tool 
  - tool starts a map/reduce job to generate fetch list,
fetch, update, etc.
  - name node (master node?) would be notified of the change
to the file
system and index is updated
  
I don't really know how well that would work, though. Can
slave nodes can
start map/reduce jobs? Should they? Would the task be
distributed among the
other nodes? Ideally, I suppose, the slave node should react
in the
following manner:

- file changes on a slave node
  - slave node notifies the tool 
  - tool notifies master node of update 
  - master node starts map reduce job to do the update
    - this would properly distribute the task of doing the
update, right?
    
With this scenario, I am not sure how (or if its possible)
to notify the
master node.

So maybe it doesn't scale well, but for an intranet such as
ours with one
machine doing it all (which is probably similar a good
majority of
intranets) it would provide a nice solution.

I hope there is more commentary on this topic, especially in
a distributed
environment. I would like to come up with something that
works in a good
range of intranet configurations.

Ben


Michael Wechner wrote:
> 
> Ben Ogle wrote:
> 
> I don't think there is any standardized way to do this
yet. So every 
> step into this
> direction would be a great improvement.
> 
>> I mean, is there a nice
>>solution using the tools already provided?
>>
> 
> not that I am aware of, but I guess other people have
tackled this as
> well.
> 
> I think it would be nice to generate a RSS or something
similar as 
> fetchlist which
> could also be accessed by other crawlers
> 
>> I know each page is time stamped
>>in the database when it is fetched, but does this
correspond to the last
>>modified date? 
> 
> I am still not sure if Nutch is actually comparing the
last modifieds. I 
> know there exists something called
> "adddays", but this is more to postpone
re-crawling for e.g. 30 days
> 
>>- Could this be done by using the existing
generate/fetch/update cycle
with
>>a index update? Is there a way to just fetch and
index the pages
necessary?
>>I suppose my tool could generate the fatch list(s)
(I need to look into
this
>>more closely).
>>
>>- Are there any other libraries like JNotify to
implement this
functionality
>>that anyone knows about? I haven't found any
others.
>>  
>>
> 
> does JNotify also implement protocols, e.g. HTTP? In
order to notify 
> accross networks,
> or does it only work locally?
> 
> Thanks
> 
> Michi
> 

-- 
View this message in context: http://www.nabble.com/File-syste
m-watching-for-intranets-tf2260463.html#a6294406
Sent from the Nutch - Dev forum at Nabble.com.

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )