For a hard-to-find, not-yet-reproducible bug reported by
several users, maybe log their http traffic for a while
and ask them to tell you right away when it happens again,
and then you could see if there were any clues in the
http transactions around that time (something wrong,
something extra present, or something missing).
Of course there might not be any clues there if it's
a database glitch. For that you need a stable of
continuous unit tests running against each server.
Wouldn't hurt to have automated continuous unit tests
testing through the web interface, too.
You probably already know/have these things, but I just
thought I'd share anyway. 
Do you guys do any high-level abstract visualization
of the servers and the flow of data? heatmaps, geiger-
counter ticks per connection, graphs, turning the
soundex() of each tag into a musical note, that sort
of thing? I think that would be fun.
Regards,
Trip
.