Louis Guillaume writes:
> Greg Troxel wrote:
>
> > The other hypothesis is that raidframe is buggy
>
> I'm starting to believe that there is something funky
going on inside of
> raidframe itself. I've seen this behaviour on different
hardware,
> different disks different memory different power
supplies. Perhaps there
> is a way to identify what types of hardware have these
problems.
In the past 8.5 years, 99% of "raidframe bugs"
have been hardware
issues or "something other than RAIDframe". I
don't know how many
hours I've hunted for RAIDframe problems that wern't really
there :-}
That said, if this is a RAIDframe issue, I'm more than happy
to help
track it down and fix it...
> I've been using my SATA drives for a month or so now
with one side of
> the RAID1 failed like this...
>
> Components:
> /dev/wd0a: optimal
> /dev/wd1a: failed
> No spares.
>
> ... and it is flawless.
>
> I am extremely confident and I promise you that if I
reconstruct this
> array, I'll see the corruption.
>
> Another interesting manifestation of the corruption was
when my wife
> started using Electric Sheep as her screen saver. Her
home directory is
> on a NFS-exported filesystem on this same array
(raid1).
>
> Before the last reconstruct I backed everything up.
After
> reconstruction, each new sheep her machine downloaded
showed strange
> artifacts and some had a kind of "scrambled"
look. But everything
> worked, strangely enough.
>
> After failing wd1a and restoring from backup, all of
her sheep work
> normally.
>
> I've had this same problem on a Pentium Pro, Pentium
III and Athlon
> systems. Have swapped drives, cables, memory, power
supplies. The
> constant is the way the system is used: a file server,
sharing home
> directories and other stuff over NFS and netatalk to
NetBSD, Linux and
> Mac systems.
>
> The next thing I will do is attempt to reproduce the
problem on a
> completely different machine that hasn't been involved
in any of this
> and see what happens.
>
> Any other ideas around where I can go from here would
be great. Also if
> anyone is interested in trying to reproduce the problem
it would
> certainly rule ME out as the problem
With the array in degraded mode, can you mount /dev/wd1a (or
equivalent) as a filesystem, and run a series of
stress-tests on
that, at the same time that you stress the RAID set?
Something like:
foreach i (`jot 1000`)
cp src.tar.gz src.tar.gz.$i && rm -f src.tar.gz.$i
&
sleep 10
dd if=/dev/zero of=bigfile.$i bs=10m count=100 &&
rm -f bigfile.$i &
sleep 10
dd if=src.tar.gz.$i of=/dev/null bs=10m &
end
that end up running on both wd0a and wd1a at the same time.
In an
ideal world, take RAIDframe out of the equation entirely,
and push
the disks, both reads and writes... (If you have an area
reserved for
swap on both, you could disable swap, and use that space).
And then
once the disks are "busy", do something like
extract src.tar.gz to
both wd0a and wd1a, and compare the bits as extracted and
see if
there are differences. (You'll need to tune things so you
don't run
out of space, of course)
I suspect it's a drive controller issue (or driver issue)
that only
manifests itself when you push both channels really hard...
Later...
Greg Oster
|