|
List Info
Thread: Apparent kernel bug with GDB on ppc405
|
|
| Apparent kernel bug with GDB on ppc405 |
  United States |
2007-10-24 14:46:40 |
I'm trying to debug a trivial statically-linked hello world
program on
a Xilinx PPC 405 and I'm seeing the following behavior:
With direct gdb on target, I can set a breakpoint at main,
run, and
the breakpoint is triggered.
With gdbserver and gdb with "target remote
localhost:1234", the above
still works.
With gdb on target redirected to a PC and tunneled back
to the target, everything still works.
With gdb on a PC, execution continues past the breakpoint.
Comparing
the gdb protocol streams here and and on the previous
(working) case
are identical up to the point of hitting the breakpoint
(which never
happens in the latter case).
Raising the load on the PC to 4 and running gdb under nice
-n 19 makes
things work again. So this begins to look like a kernel
cache or
timing bug rather than a problem with the PC tool. It
appears that the
breakpoint written to the executable at continue time is not
visible
to the CPU at execute time.
My first suspicion was a dcache/icache coherency issue in
copy_to_user_page, so I added flush_dcache_icache_page(page)
here to
no avail. On closer inspection, it looks like both icache
and dcache
are already being flushed by flush_icache_user_range().
Adding printk(".") (or any printk) in this
function here fixes things
(serial console at 115k), while printk("") and
udelay(100) do not.
Which still suggests an icache bug..?
Any suggestions?
--
"Love the dolphins," she advised him. "Write
by W.A.S.T.E.."
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |

|
2007-10-24 15:28:14 |
On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> I'm trying to debug a trivial statically-linked hello
world program on
> a Xilinx PPC 405 and I'm seeing the following
behavior:
>
<snip>
>
> Any suggestions?
http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
I was fighting with a similar problem almost 2 years ago.
Looks like
it might be related. At some point the problem seemed to go
away and
I determined what the root cause was. :-(
I haven't been using gdb lately, so I don't know if it's the
same
problem. Nobody I had talked to had seen the issue on other
405
platforms. It could very well be something
virtex-specific.
That's not much help, but maybe it will give you some
clues.
Cheers,
g.
--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely secretlab.ca
(403) 399-0195
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |
  United States |
2007-10-24 15:34:16 |
Matt Mackall wrote:
> I'm trying to debug a trivial statically-linked hello
world program on
> a Xilinx PPC 405 and I'm seeing the following
behavior:
>
> With direct gdb on target, I can set a breakpoint at
main, run, and
> the breakpoint is triggered.
>
> With gdbserver and gdb with "target remote
localhost:1234", the above
> still works.
>
> With gdb on target redirected to a PC and tunneled
back
> to the target, everything still works.
>
> With gdb on a PC, execution continues past the
breakpoint. Comparing
> the gdb protocol streams here and and on the previous
(working) case
> are identical up to the point of hitting the breakpoint
(which never
> happens in the latter case).
>
> Raising the load on the PC to 4 and running gdb under
nice -n 19 makes
> things work again. So this begins to look like a kernel
cache or
> timing bug rather than a problem with the PC tool. It
appears that the
> breakpoint written to the executable at continue time
is not visible
> to the CPU at execute time.
>
> My first suspicion was a dcache/icache coherency issue
in
> copy_to_user_page, so I added
flush_dcache_icache_page(page) here to
> no avail. On closer inspection, it looks like both
icache and dcache
> are already being flushed by
flush_icache_user_range().
>
First of all I have never used a similar configuration so
this may be
totally off base. But...
If the icache is virtually indexed, then I think there are
only two ways
to invalidate it. The first is from the context of the
debugged process
where the page is mapped at the location the target program
will see it.
If you try to invalidate from the context of the
debugger, the page
will most likely not be mapped at the virtual address of the
target
program so you might have to invalidate the *entire*
icache.
David Daney
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |
  United States |
2007-10-24 15:42:16 |
On Wed, Oct 24, 2007 at 02:28:14PM -0600, Grant Likely
wrote:
> On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > I'm trying to debug a trivial statically-linked
hello world program on
> > a Xilinx PPC 405 and I'm seeing the following
behavior:
> >
> <snip>
> >
> > Any suggestions?
>
> http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
>
> I was fighting with a similar problem almost 2 years
ago. Looks like
> it might be related. At some point the problem seemed
to go away and
> I determined what the root cause was. :-(
>
> I haven't been using gdb lately, so I don't know if
it's the same
> problem. Nobody I had talked to had seen the issue on
other 405
> platforms. It could very well be something
virtex-specific.
Could be the same problem, but I'm seeing only your symptom
3 so far.
I've tried throwing some larger hammers at the problem.
Flushing all
of the dcache and icache (flush_dcache_all and
flush_instruction_cache) isn't helping. But
printk(".") does!
--
Mathematics is the supreme nostalgia of our time.
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |

|
2007-10-24 15:46:17 |
On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> On Wed, Oct 24, 2007 at 02:28:14PM -0600, Grant Likely
wrote:
> > On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > > I'm trying to debug a trivial
statically-linked hello world program on
> > > a Xilinx PPC 405 and I'm seeing the following
behavior:
> > >
> > <snip>
> > >
> > > Any suggestions?
> >
> > http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
> >
> > I was fighting with a similar problem almost 2
years ago. Looks like
> > it might be related. At some point the problem
seemed to go away and
> > I determined what the root cause was. :-(
> >
> > I haven't been using gdb lately, so I don't know
if it's the same
> > problem. Nobody I had talked to had seen the
issue on other 405
> > platforms. It could very well be something
virtex-specific.
>
> Could be the same problem, but I'm seeing only your
symptom 3 so far.
>
> I've tried throwing some larger hammers at the problem.
Flushing all
> of the dcache and icache (flush_dcache_all and
> flush_instruction_cache) isn't helping. But
printk(".") does!
It's really true; printk *is* the most valuable tool kernel
hackers
have for debugging.
g.
--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely secretlab.ca
(403) 399-0195
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |
  United States |
2007-10-24 16:54:22 |
On Wed, Oct 24, 2007 at 03:42:16PM -0500, Matt Mackall
wrote:
> On Wed, Oct 24, 2007 at 02:28:14PM -0600, Grant Likely
wrote:
> > On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > > I'm trying to debug a trivial
statically-linked hello world program on
> > > a Xilinx PPC 405 and I'm seeing the following
behavior:
> > >
> > <snip>
> > >
> > > Any suggestions?
> >
> > http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
> >
> > I was fighting with a similar problem almost 2
years ago. Looks like
> > it might be related. At some point the problem
seemed to go away and
> > I determined what the root cause was. :-(
> >
> > I haven't been using gdb lately, so I don't know
if it's the same
> > problem. Nobody I had talked to had seen the
issue on other 405
> > platforms. It could very well be something
virtex-specific.
>
> Could be the same problem, but I'm seeing only your
symptom 3 so far.
>
> I've tried throwing some larger hammers at the problem.
Flushing all
> of the dcache and icache (flush_dcache_all and
> flush_instruction_cache) isn't helping. But
printk(".") does!
Well there was one remaining cache - the TLB. This patch
seems to make
things work, but don't ask me why:
--- include/asm-ppc/cacheflush.h (revision 10439)
+++ include/asm-ppc/cacheflush.h (working copy)
 -11,6
+11,7 
#define _PPC_CACHEFLUSH_H
#include <linux/mm.h>
+#include <asm/tlbflush.h>
/*
* No cache flushing is required when address mappings are
 -35,10
+36,23 
extern void flush_icache_user_range(struct vm_area_struct
*vma,
struct page *page, unsigned long addr, int
len);
#define copy_to_user_page(vma, page, vaddr, dst, src, len)
do { memcpy(dst, src, len);
flush_icache_user_range(vma, page, vaddr, len);
+ _tlbia();
} while (0)
--
Mathematics is the supreme nostalgia of our time.
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |

|
2007-10-24 17:27:52 |
On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> On Wed, Oct 24, 2007 at 03:42:16PM -0500, Matt Mackall
wrote:
> > On Wed, Oct 24, 2007 at 02:28:14PM -0600, Grant
Likely wrote:
> > > On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > > > I'm trying to debug a trivial
statically-linked hello world program on
> > > > a Xilinx PPC 405 and I'm seeing the
following behavior:
> > > >
> > > <snip>
> > > >
> > > > Any suggestions?
> > >
> > > http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
> > >
> > > I was fighting with a similar problem almost
2 years ago. Looks like
> > > it might be related. At some point the
problem seemed to go away and
> > > I determined what the root cause was. :-(
> > >
> > > I haven't been using gdb lately, so I don't
know if it's the same
> > > problem. Nobody I had talked to had seen the
issue on other 405
> > > platforms. It could very well be something
virtex-specific.
> >
> > Could be the same problem, but I'm seeing only
your symptom 3 so far.
> >
> > I've tried throwing some larger hammers at the
problem. Flushing all
> > of the dcache and icache (flush_dcache_all and
> > flush_instruction_cache) isn't helping. But
printk(".") does!
>
> Well there was one remaining cache - the TLB. This
patch seems to make
> things work, but don't ask me why:
>
> --- include/asm-ppc/cacheflush.h (revision
10439)
> +++ include/asm-ppc/cacheflush.h (working copy)
>  -11,6 +11,7 
> #define _PPC_CACHEFLUSH_H
>
> #include <linux/mm.h>
> +#include <asm/tlbflush.h>
>
> /*
> * No cache flushing is required when address mappings
are
>  -35,10 +36,23 
> extern void flush_icache_user_range(struct
vm_area_struct *vma,
> struct page *page, unsigned long addr,
int len);
>
> #define copy_to_user_page(vma, page, vaddr, dst, src,
len)
> do { memcpy(dst, src, len);
> flush_icache_user_range(vma, page, vaddr, len);
> + _tlbia();
> } while (0)
Hmmm; thinking out loud here...
- so tlbia invalidates all TLB entries
- When gdb inserts a breakpoint the .text pages are marked
as read
only, so the kernel does a copy on write so that gdb can
modify the
instruction. The kernel also updates the page tables so
that the test
process now uses the new page.
- This means that there are now 2 pages for that one section
of
executable code; the original and the one with the
breakpoint.
- However, the program is still in memory, and there is
probably
already a TLB entry pointing to the original page for that
range of
addresses.
Could it be that the kernel page tables are getting updated
to the new
page; but active set of TLB entries is not getting updated?
If so, then printk(".") probably solves the
problem simply because it
touches enough pages in its execution path that the old TLB
entry gets
overwritten? There are only 64 TLB entries afterall.
Thoughts?
g.
--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely secretlab.ca
(403) 399-0195
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |
  United States |
2007-10-24 17:32:50 |
On Wed, Oct 24, 2007 at 04:27:52PM -0600, Grant Likely
wrote:
> On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > On Wed, Oct 24, 2007 at 03:42:16PM -0500, Matt
Mackall wrote:
> > > On Wed, Oct 24, 2007 at 02:28:14PM -0600,
Grant Likely wrote:
> > > > On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > > > > I'm trying to debug a trivial
statically-linked hello world program on
> > > > > a Xilinx PPC 405 and I'm seeing the
following behavior:
> > > > >
> > > > <snip>
> > > > >
> > > > > Any suggestions?
> > > >
> > > > http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
> > > >
> > > > I was fighting with a similar problem
almost 2 years ago. Looks like
> > > > it might be related. At some point the
problem seemed to go away and
> > > > I determined what the root cause was.
:-(
> > > >
> > > > I haven't been using gdb lately, so I
don't know if it's the same
> > > > problem. Nobody I had talked to had
seen the issue on other 405
> > > > platforms. It could very well be
something virtex-specific.
> > >
> > > Could be the same problem, but I'm seeing
only your symptom 3 so far.
> > >
> > > I've tried throwing some larger hammers at
the problem. Flushing all
> > > of the dcache and icache (flush_dcache_all
and
> > > flush_instruction_cache) isn't helping. But
printk(".") does!
> >
> > Well there was one remaining cache - the TLB. This
patch seems to make
> > things work, but don't ask me why:
> >
> > --- include/asm-ppc/cacheflush.h (revision
10439)
> > +++ include/asm-ppc/cacheflush.h (working
copy)
> >  -11,6 +11,7 
> > #define _PPC_CACHEFLUSH_H
> >
> > #include <linux/mm.h>
> > +#include <asm/tlbflush.h>
> >
> > /*
> > * No cache flushing is required when address
mappings are
> >  -35,10 +36,23 
> > extern void flush_icache_user_range(struct
vm_area_struct *vma,
> > struct page *page, unsigned long
addr, int len);
> >
> > #define copy_to_user_page(vma, page, vaddr, dst,
src, len)
> > do { memcpy(dst, src, len);
> > flush_icache_user_range(vma, page, vaddr,
len);
> > + _tlbia();
> > } while (0)
>
> Hmmm; thinking out loud here...
>
> - so tlbia invalidates all TLB entries
> - When gdb inserts a breakpoint the .text pages are
marked as read
> only, so the kernel does a copy on write so that gdb
can modify the
> instruction. The kernel also updates the page tables
so that the test
> process now uses the new page.
> - This means that there are now 2 pages for that one
section of
> executable code; the original and the one with the
breakpoint.
> - However, the program is still in memory, and there is
probably
> already a TLB entry pointing to the original page for
that range of
> addresses.
>
> Could it be that the kernel page tables are getting
updated to the new
> page; but active set of TLB entries is not getting
updated?
>
> If so, then printk(".") probably solves the
problem simply because it
> touches enough pages in its execution path that the old
TLB entry gets
> overwritten? There are only 64 TLB entries afterall.
>
> Thoughts?
Not completely implausible, but a) why isn't this seen on
basically
every machine with software TLB? b) why does -local- GDB,
which is
presumably doing much less work than gdbserver + network
stack, not fail?
--
Mathematics is the supreme nostalgia of our time.
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |

|
2007-10-24 17:39:04 |
On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> On Wed, Oct 24, 2007 at 04:27:52PM -0600, Grant Likely
wrote:
> > On 10/24/07, Matt Mackall <mpm selenic.com> wrote:
> > > On Wed, Oct 24, 2007 at 03:42:16PM -0500,
Matt Mackall wrote:
> > > > On Wed, Oct 24, 2007 at 02:28:14PM
-0600, Grant Likely wrote:
> > > > > On 10/24/07, Matt Mackall
<mpm selenic.com> wrote:
> > > > > > I'm trying to debug a trivial
statically-linked hello world program on
> > > > > > a Xilinx PPC 405 and I'm
seeing the following behavior:
> > > > > >
> > > > > <snip>
> > > > > >
> > > > > > Any suggestions?
> > > > >
> > > > > http://thread.gmane.org/gmane.linux.ports.ppc.embedd
ed/11202
> > > > >
> > > > > I was fighting with a similar
problem almost 2 years ago. Looks like
> > > > > it might be related. At some point
the problem seemed to go away and
> > > > > I determined what the root cause
was. :-(
> > > > >
> > > > > I haven't been using gdb lately, so
I don't know if it's the same
> > > > > problem. Nobody I had talked to
had seen the issue on other 405
> > > > > platforms. It could very well be
something virtex-specific.
> > > >
> > > > Could be the same problem, but I'm
seeing only your symptom 3 so far.
> > > >
> > > > I've tried throwing some larger hammers
at the problem. Flushing all
> > > > of the dcache and icache
(flush_dcache_all and
> > > > flush_instruction_cache) isn't helping.
But printk(".") does!
> > >
> > > Well there was one remaining cache - the TLB.
This patch seems to make
> > > things work, but don't ask me why:
> > >
> > > --- include/asm-ppc/cacheflush.h
(revision 10439)
> > > +++ include/asm-ppc/cacheflush.h
(working copy)
> > >  -11,6 +11,7 
> > > #define _PPC_CACHEFLUSH_H
> > >
> > > #include <linux/mm.h>
> > > +#include <asm/tlbflush.h>
> > >
> > > /*
> > > * No cache flushing is required when
address mappings are
> > >  -35,10 +36,23 
> > > extern void flush_icache_user_range(struct
vm_area_struct *vma,
> > > struct page *page, unsigned
long addr, int len);
> > >
> > > #define copy_to_user_page(vma, page, vaddr,
dst, src, len)
> > > do { memcpy(dst, src, len);
> > > flush_icache_user_range(vma, page,
vaddr, len);
> > > + _tlbia();
> > > } while (0)
> >
> > Hmmm; thinking out loud here...
> >
> > - so tlbia invalidates all TLB entries
> > - When gdb inserts a breakpoint the .text pages
are marked as read
> > only, so the kernel does a copy on write so that
gdb can modify the
> > instruction. The kernel also updates the page
tables so that the test
> > process now uses the new page.
> > - This means that there are now 2 pages for that
one section of
> > executable code; the original and the one with the
breakpoint.
> > - However, the program is still in memory, and
there is probably
> > already a TLB entry pointing to the original page
for that range of
> > addresses.
> >
> > Could it be that the kernel page tables are
getting updated to the new
> > page; but active set of TLB entries is not getting
updated?
> >
> > If so, then printk(".") probably solves
the problem simply because it
> > touches enough pages in its execution path that
the old TLB entry gets
> > overwritten? There are only 64 TLB entries
afterall.
> >
> > Thoughts?
>
> Not completely implausible, but a) why isn't this seen
on basically
> every machine with software TLB? b) why does -local-
GDB, which is
> presumably doing much less work than gdbserver +
network stack, not fail?
a) I don't know.... very odd.
b) gdb is big. It probably touches far more pages (via
library calls)
than gdbserver. The network stack is also big, but it's
probably more
localized too.
Niceing down the host also makes sense because if the PC is
being slow
then the target may go off and run other things while
between setting
the breakpoint and getting the 'go' command.
Can you grab a snapshot of the TLB before and after setting
the breakpoint?
g.
--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely secretlab.ca
(403) 399-0195
|
|
| Re: Apparent kernel bug with GDB on
ppc405 |

|
2007-10-24 17:40:23 |
On 10/24/07, Grant Likely <grant.likely secretlab.ca> wrote:
> > Not completely implausible, but a) why isn't this
seen on basically
> > every machine with software TLB? b) why does
-local- GDB, which is
> > presumably doing much less work than gdbserver +
network stack, not fail?
>
> a) I don't know.... very odd.
>
> b) gdb is big. It probably touches far more pages (via
library calls)
> than gdbserver. The network stack is also big, but
it's probably more
> localized too.
>
> Niceing down the host also makes sense because if the
PC is being slow
> then the target may go off and run other things while
between setting
> the breakpoint and getting the 'go' command.
>
> Can you grab a snapshot of the TLB before and after
setting the breakpoint?
Or; probably more relevant, before and after the page copy?
g.
--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely secretlab.ca
(403) 399-0195
|
|
|
|