List Info

Thread: Re: raidframe problems (revisited)




Re: raidframe problems (revisited)
country flaguser name
Canada
2007-05-28 17:50:03
Louis Guillaume writes:
> Greg Troxel wrote:
> 
> > The other hypothesis is that raidframe is buggy
> 
> I'm starting to believe that there is something funky
going on inside of
> raidframe itself. I've seen this behaviour on different
hardware,
> different disks different memory different power
supplies. Perhaps there
> is a way to identify what types of hardware have these
problems.

In the past 8.5 years, 99% of "raidframe bugs"
have been hardware 
issues or "something other than RAIDframe".  I
don't know how many 
hours I've hunted for RAIDframe problems that wern't really
there :-}

That said, if this is a RAIDframe issue, I'm more than happy
to help 
track it down and fix it... 

> I've been using my SATA drives for a month or so now
with one side of
> the RAID1 failed like this...
> 
> Components:
>            /dev/wd0a: optimal
>            /dev/wd1a: failed
> No spares.
> 
> ... and it is flawless.
> 
> I am extremely confident and I promise you that if I
reconstruct this
> array, I'll see the corruption.
> 
> Another interesting manifestation of the corruption was
when my wife
> started using Electric Sheep as her screen saver. Her
home directory is
> on a NFS-exported filesystem on this same array
(raid1).
> 
> Before the last reconstruct I backed everything up.
After
> reconstruction, each new sheep her machine downloaded
showed strange
> artifacts and some had a kind of "scrambled"
look. But everything
> worked, strangely enough.
> 
> After failing wd1a and restoring from backup, all of
her sheep work
> normally.
> 
> I've had this same problem on a Pentium Pro, Pentium
III and Athlon
> systems. Have swapped drives, cables, memory, power
supplies. The
> constant is the way the system is used: a file server,
sharing home
> directories and other stuff over NFS and netatalk to
NetBSD, Linux and
> Mac systems.
> 
> The next thing I will do is attempt to reproduce the
problem on a
> completely different machine that hasn't been involved
in any of this
> and see what happens.
> 
> Any other ideas around where I can go from here would
be great. Also if
> anyone is interested in trying to reproduce the problem
it would
> certainly rule ME out as the problem 

With the array in degraded mode, can you mount /dev/wd1a (or

equivalent) as a filesystem, and run a series of
stress-tests on 
that, at the same time that you stress the RAID set? 
Something like:

  foreach i (`jot 1000`)
  cp src.tar.gz src.tar.gz.$i && rm -f src.tar.gz.$i
& 
  sleep 10
  dd if=/dev/zero of=bigfile.$i bs=10m count=100 &&
rm -f bigfile.$i &
  sleep 10
  dd if=src.tar.gz.$i of=/dev/null bs=10m &
  end

that end up running on both wd0a and wd1a at the same time. 
In an 
ideal world, take RAIDframe out of the equation entirely,
and push 
the disks, both reads and writes... (If you have an area
reserved for 
swap on both, you could disable swap, and use that space). 
And then 
once the disks are "busy", do something like
extract src.tar.gz to 
both wd0a and wd1a, and compare the bits as extracted and
see if 
there are differences.  (You'll need to tune things so you
don't run 
out of space, of course)

I suspect it's a drive controller issue (or driver issue)
that only 
manifests itself when you push both channels really hard...


Later...

Greg Oster



Re: raidframe problems (revisited)
country flaguser name
United States
2007-05-29 07:24:50
Greg Oster wrote:

> With the array in degraded mode, can you mount
/dev/wd1a (or 
> equivalent) as a filesystem, and run a series of
stress-tests on 
> that, at the same time that you stress the RAID set? 
Something like:
> 
>   foreach i (`jot 1000`)
>   cp src.tar.gz src.tar.gz.$i && rm -f
src.tar.gz.$i & 
>   sleep 10
>   dd if=/dev/zero of=bigfile.$i bs=10m count=100
&& rm -f bigfile.$i &
>   sleep 10
>   dd if=src.tar.gz.$i of=/dev/null bs=10m &
>   end
> 
> that end up running on both wd0a and wd1a at the same
time.  In an 
> ideal world, take RAIDframe out of the equation
entirely, and push 
> the disks, both reads and writes... (If you have an
area reserved for 
> swap on both, you could disable swap, and use that
space).  And then 
> once the disks are "busy", do something like
extract src.tar.gz to 
> both wd0a and wd1a, and compare the bits as extracted
and see if 
> there are differences.  (You'll need to tune things so
you don't run 
> out of space, of course)

This is a great idea and I'll add it to my list of tests to
try and
reproduce the problem.

> I suspect it's a drive controller issue (or driver
issue) that only 
> manifests itself when you push both channels really
hard... 
> 
Judging from your experience and what others have said about
the
stability of raidframe I highly suspect the controller (or
driver) too.
Especially since the RAID-1 set works fine with only one
component! It's
not like the system doesn't have the right data in the
buffers to write
out to disk. I don't believe the memory is the problem
because it's been
replaced.

What hasn't been tested (by me) is maxing out the i/o on
both channels
at the same time. So I will do this next...

Thanks!

Louis


Re: raidframe problems (revisited)
user name
2007-05-29 10:13:26
On Tue, 29 May 2007 08:24:50 -0400
Louis Guillaume <lguillaumeberklee.edu> wrote:

> I don't believe the memory is the
> problem because it's been replaced.

Memory can be weird.  I had a situation where two different
sticks of
memory each tested out just fine, alone -- but if both were
in the
system, it didn't work.  (I used both formal memory tests --
memtest+
-- and informal -- building NetBSD to see if gcc SEGVed...)




		--Steve Bellovin, http://www.cs.columbi
a.edu/~smb

Re: raidframe problems (revisited)
country flaguser name
United Kingdom
2007-06-01 11:25:42
On Mon, May 28, 2007 at 04:50:03PM -0600, Greg Oster wrote:
> With the array in degraded mode, can you mount
/dev/wd1a (or 
> equivalent) as a filesystem, and run a series of
stress-tests on 
> that, at the same time that you stress the RAID set? 
Something like:
> 
>   foreach i (`jot 1000`)
>   cp src.tar.gz src.tar.gz.$i && rm -f
src.tar.gz.$i & 
>   sleep 10
>   dd if=/dev/zero of=bigfile.$i bs=10m count=100
&& rm -f bigfile.$i &
>   sleep 10
>   dd if=src.tar.gz.$i of=/dev/null bs=10m &
>   end

I've modified the above like this:

#/bin/sh
for i in `jot 1000`
do
	cp src.tar.gz src.tar.gz.$i & 
	sleep 10
	(dd if=/dev/zero of=bigfile.$i bs=10m count=100 &&
rm -f bigfile.$i) &
	sleep 10
	(dd if=src.tar.gz.$i of=/dev/null bs=10m && rm -f
src.tar.gz.$i) &
	wait
done

I turned the unused RAID spare disk into a filesystem and
ran the stress
test on the degrade RAID and the spare disk for over an
hour. The machine
survived that stress test without any problems.

BTW: could this problem be related to the size of the disk?
The RAID 1
     in question uses to 250GB IDE disks.

	Kind regards

-- 
Matthias Scheler                                  http://zhadum.org.uk/

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )