List Info

Thread: ptrace (PT_STEP) causes 2 instruction step???




ptrace (PT_STEP) causes 2 instruction step???
user name
2006-11-23 04:22:27
Hello,

I was wondering if anyone had seen something like this
before.
We're running GDB on an x86_64 chip running Linux. I've seen
this behavior on SuSE and RedHat distributions. It's a bit
weird
because they started appearing one day, even with our old
releases
(ie tests that passed on the same machine with this release
now
stop passing on the same machine - no update done).

This reproduces with all versions of GDB that I have tested:
GDB 6.4
built by us, GDB 6.4 built by SuSE, and GDB from today's
CVS.

Here are the symptoms: I have a program were we're stopped
at one
instruction of a function. This is the return address from a
function
call, where we landed after doing a "finish". I
simulated this part
by inserting a breakpoint at that address and running to it.
 After that,
I do a "next" and here is what I see:

    (gdb) b *0x401e41
    Breakpoint 1 at 0x401e41: file x.adb, line 9.
    (gdb) run
    Starting program: /[...]/x 
    
    Breakpoint 1, 0x0000000000401e41 in x () at x.adb:9
    9          Z : constant Num := F;
    (gdb) n
    0x0000000000401e4f in x () at x.adb:13
    13      end X;

The unexpected part is that GDB did not stop at the begining
of
a line, as evidenced by the address printed after the
"next" has
completed.

Here is the assembly code:

        0x00401e31 <_ada_x+0>:  push   %rbp
        0x00401e32 <_ada_x+1>:  mov    %rsp,%rbp
        0x00401e35 <_ada_x+4>:  sub    $0x10,%rsp
        0x00401e39 <_ada_x+8>:  mov    %rbp,%r10
    [line 9 starts here]
        0x00401e3c <_ada_x+11>: callq  0x401e14
<x__f.0>
        0x00401e41 <_ada_x+16>: movsd 
%xmm0,0xfffffffffffffff0(%rbp)
        0x00401e46 <_ada_x+21>: mov   
0xfffffffffffffff0(%rbp),%rax
        0x00401e4a <_ada_x+25>: mov   
%rax,0xfffffffffffffff8(%rbp)
    [line 13 starts here]
        0x00401e4e <_ada_x+29>: leaveq 
        0x00401e4f <_ada_x+30>: retq   

I expected GDB to stop at 0x00401e4e, which is the first
instruction
of line 13.

At first sight, it looks like a malfunction of the kernel,
because
"set debug infrun 1" allows us to see how we get
there:

    infrun: proceed (addr=0xffffffffffffffff, signal=144,
step=1)
    infrun: resume (step=1, signal=0)
    infrun: wait_for_inferior
    infrun: infwait_normal_state
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x401e4a
    infrun: trap expected
    infrun: stepping inside range [0x401e39-0x401e4e]
    infrun: resume (step=1, signal=0)
    infrun: prepare_to_wait
    infrun: infwait_normal_state
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x401e4f
    infrun: stepped to a different function
    infrun: stop_stepping
    0x0000000000401e4f in x () at x.adb:13

As you can see, each "resume (step=1,...)" causes
the inferior
to step *two* instruction instead of one. I looked at the
code
and traced it, and we seem to be doing everything right: The
resume operation is turned into a call to "ptrace
(PT_STEP, ...)"
with the right arguments. It's then followed by a call to
"wait".
After the inferior stopped, we find that we're 2 instruction
later.

The behavior is actually relatively unpredictable.
Sometimes, it
works fine.

I searched the internet a bit, and apparently this type of
error
has happened a while ago. Unfortunately, I lost the link,
but the
reports were saying that the problem they saw only occured
in a
very specific case, which is not the case here...

Has anyone seen this before? Any clue? Surprisingly, all our
x86_64-linux machines started showing these symptoms on the
same
day.  All except one, which keeps working fine.

-- 
Joel
ptrace (PT_STEP) causes 2 instruction step???
user name
2006-11-23 08:35:20
>  Hello,
>
>  I was wondering if anyone had seen something like this
before.
>  We're running GDB on an x86_64 chip running Linux.
I've seen
>  this behavior on SuSE and RedHat distributions. It's a
bit weird
>  because they started appearing one day, even with our
old releases
>  (ie tests that passed on the same machine with this
release now
>  stop passing on the same machine - no update done).
>
>  This reproduces with all versions of GDB that I have
tested: GDB 6.4
>  built by us, GDB 6.4 built by SuSE, and GDB from
today's CVS.
>
>  Here are the symptoms: I have a program were we're
stopped at one
>  instruction of a function. This is the return address
from a function
>  call, where we landed after doing a
"finish". I simulated this part
>  by inserting a breakpoint at that address and running
to it.  After that,
>  I do a "next" and here is what I see:
>
>      (gdb) b *0x401e41
>      Breakpoint 1 at 0x401e41: file x.adb, line 9.
>      (gdb) run
>      Starting program: /[...]/x
>
>      Breakpoint 1, 0x0000000000401e41 in x () at
x.adb:9
>      9          Z : constant Num := F;
>      (gdb) n
>      0x0000000000401e4f in x () at x.adb:13
>      13      end X;
>
>  The unexpected part is that GDB did not stop at the
begining of
>  a line, as evidenced by the address printed after the
"next" has
>  completed.
>
>  Here is the assembly code:
>
>          0x00401e31 <_ada_x+0>:  push   %rbp
>          0x00401e32 <_ada_x+1>:  mov    %rsp,%rbp
>          0x00401e35 <_ada_x+4>:  sub   
$0x10,%rsp
>          0x00401e39 <_ada_x+8>:  mov    %rbp,%r10
>      [line 9 starts here]
>          0x00401e3c <_ada_x+11>: callq  0x401e14
<x__f.0>
>          0x00401e41 <_ada_x+16>: movsd 
%xmm0,0xfffffffffffffff0(%rbp)
>          0x00401e46 <_ada_x+21>: mov   
0xfffffffffffffff0(%rbp),%rax
>          0x00401e4a <_ada_x+25>: mov   
%rax,0xfffffffffffffff8(%rbp)
>      [line 13 starts here]
>          0x00401e4e <_ada_x+29>: leaveq
>          0x00401e4f <_ada_x+30>: retq
>
>  I expected GDB to stop at 0x00401e4e, which is the
first instruction
>  of line 13.
>
>  At first sight, it looks like a malfunction of the
kernel, because
>  "set debug infrun 1" allows us to see how we
get there:
>
>      infrun: proceed (addr=0xffffffffffffffff,
signal=144, step=1)
>      infrun: resume (step=1, signal=0)
>      infrun: wait_for_inferior
>      infrun: infwait_normal_state
>      infrun: TARGET_WAITKIND_STOPPED
>      infrun: stop_pc = 0x401e4a
>      infrun: trap expected
>      infrun: stepping inside range [0x401e39-0x401e4e]
>      infrun: resume (step=1, signal=0)
>      infrun: prepare_to_wait
>      infrun: infwait_normal_state
>      infrun: TARGET_WAITKIND_STOPPED
>      infrun: stop_pc = 0x401e4f
>      infrun: stepped to a different function
>      infrun: stop_stepping
>      0x0000000000401e4f in x () at x.adb:13
>
>  As you can see, each "resume (step=1,...)"
causes the inferior
>  to step *two* instruction instead of one. I looked at
the code
>  and traced it, and we seem to be doing everything
right: The
>  resume operation is turned into a call to "ptrace
(PT_STEP, ...)"
>  with the right arguments. It's then followed by a call
to "wait".
>  After the inferior stopped, we find that we're 2
instruction later.
>
>  The behavior is actually relatively unpredictable.
Sometimes, it
>  works fine.
>
>  I searched the internet a bit, and apparently this
type of error
>  has happened a while ago. Unfortunately, I lost the
link, but the
>  reports were saying that the problem they saw only
occured in a
>  very specific case, which is not the case here...
>
>  Has anyone seen this before? Any clue? Surprisingly,
all our
>  x86_64-linux machines started showing these symptoms
on the same
>  day.  All except one, which keeps working fine.

This must be a kernel bug of some sorts.  Was the kernel on
those machines
updated?

Are you perhaps running vmware on those machines?  My amd64
box at work
seems to do something similar from time to time when I have
it running
(random testfailures) but everything seems normal if I close
vmware.

Anyway, it is almost certainly something we (GDB developers)
can't do
anything about.

Mark

ptrace (PT_STEP) causes 2 instruction step???
user name
2006-11-23 16:13:20
On Wed, Nov 22, 2006 at 08:22:27PM -0800, Joel Brobecker
wrote:
> I searched the internet a bit, and apparently this type
of error
> has happened a while ago. Unfortunately, I lost the
link, but the
> reports were saying that the problem they saw only
occured in a
> very specific case, which is not the case here...

That was over syscalls, I think.

> Has anyone seen this before? Any clue? Surprisingly,
all our
> x86_64-linux machines started showing these symptoms on
the same
> day.  All except one, which keeps working fine.

Like Mark, I'm suspicious of updates.  I've never seen this
behavior
before.

-- 
Daniel Jacobowitz
CodeSourcery
ptrace (PT_STEP) causes 2 instruction step???
user name
2006-11-24 17:54:11
Hi Mark,

> This must be a kernel bug of some sorts.  Was the
kernel on those machines
> updated?

That's what I figured. The kernels have not been upgraded.
The kernel
version changes from distribution to distribution and the
actual version
of the distro.

> Are you perhaps running vmware on those machines?  My
amd64 box at work
> seems to do something similar from time to time when I
have it running
> (random testfailures) but everything seems normal if I
close vmware.

That was a good suggestion, but we're not running vmware.
I'm trying
to get one of our sysadmins to upgrade the kernel.

> Anyway, it is almost certainly something we (GDB
developers) can't do
> anything about.

That's what I thought too, but I was wondering is someone
else knew
more about this issue. I'll let everyone know if a kernel
upgrade
helps.

Thanks to everyone for the feedback,
-- 
Joel
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )