List Info

Thread: Re: - Optional method to purge the TLB on SN systems




Re: - Optional method to purge the TLB on SN systems
user name
2007-03-27 19:46:44
On Wed, 2007-03-28 at 03:39, Jack Steiner wrote:

> This patch adds an optional method for purging the TLB
on SN IA64 systems.
> The change should not affect any non-SN system.
> 
> 	Signed-off-by: Jack Steiner <steinersgi.com>
> 
> ---
> 
> +void
> +smp_flush_tlb_cpumask (cpumask_t xcpumask)
> +{
> +	unsigned short counts[NR_CPUS];
> +	cpumask_t cpumask = xcpumask;
> +	int count, mycpu, cpu, flush_mycpu = 0;
> +
> +	preempt_disable();
> +	mycpu = smp_processor_id();
> +
> +	for_each_cpu_mask(cpu, cpumask) {
> +		counts[cpu] = per_cpu(local_flush_count, cpu);
> +		mb();
> +		if (cpu == mycpu)
> +			flush_mycpu = 1;
> +		else
> +			smp_send_local_flush_tlb(cpu);
> +	}
> +
> +	if (flush_mycpu)
> +		smp_local_flush_tlb();
> +
> +	for_each_cpu_mask(cpu, cpumask) {
> +		count = 0;
> +		while(counts[cpu] == per_cpu(local_flush_count,
cpu)) {

Due to 64k offset of percpu data, the same percpu variable
on different
CPUs are very likely to be on the same cacheline of some
levels of
cache.

So I think the operation on local_flush_count may be very
cache
unfriendly...


Zou Nan hai




-
To unsubscribe from this list: send the line
"unsubscribe linux-ia64" in
the body of a message to majordomovger.kernel.org
More majordomo info at  http://vge
r.kernel.org/majordomo-info.html

Re: - Optional method to purge the TLB on SN systems
user name
2007-03-27 20:53:04
On Wed, Mar 28, 2007 at 08:46:44AM +0800, Zou Nan hai
wrote:
> On Wed, 2007-03-28 at 03:39, Jack Steiner wrote:
> 
> > This patch adds an optional method for purging the
TLB on SN IA64 systems.
> > The change should not affect any non-SN system.
> > 
> > 	Signed-off-by: Jack Steiner <steinersgi.com>
> > 
> > ---
> > 
> > +void
> > +smp_flush_tlb_cpumask (cpumask_t xcpumask)
> > +{
> > +	unsigned short counts[NR_CPUS];
> > +	cpumask_t cpumask = xcpumask;
> > +	int count, mycpu, cpu, flush_mycpu = 0;
> > +
> > +	preempt_disable();
> > +	mycpu = smp_processor_id();
> > +
> > +	for_each_cpu_mask(cpu, cpumask) {
> > +		counts[cpu] = per_cpu(local_flush_count, cpu);
> > +		mb();
> > +		if (cpu == mycpu)
> > +			flush_mycpu = 1;
> > +		else
> > +			smp_send_local_flush_tlb(cpu);
> > +	}
> > +
> > +	if (flush_mycpu)
> > +		smp_local_flush_tlb();
> > +
> > +	for_each_cpu_mask(cpu, cpumask) {
> > +		count = 0;
> > +		while(counts[cpu] == per_cpu(local_flush_count,
cpu)) {
> 
> Due to 64k offset of percpu data, the same percpu
variable on different
> CPUs are very likely to be on the same cacheline of
some levels of
> cache.
> 
> So I think the operation on local_flush_count may be
very cache
> unfriendly...

I was concerned about that, too, but testing finally
convinced me that
it was not an issue. I think the reason is that is takes a
few hundred
nanoseconds per cpu to send an IPI.  So rather than a
contended cache
line, we have a line that is serially read by multiple cpus.
Although
contention can occur, typically multiple cpus are not trying
to read
the same line at the same time.

For example (oversimplified), IPI sent to cpu 0 at time 0,
to cpu 1 at
time ~100, cpu 2 at time ~200, etc. The IPI requires a
chipset access
that takes order-of-memory-access time. Assume it take N
usec for a
cpu to recognize the IPI & call the TLB flushing code.
Cpu 0 reads
local_flush_count at time N, cpu reads local_flush_count at
time 
100+N, etc. Very little contention, just serial access.

--

I tried a second algorithm where the local_flush_count was
kept in
node-local percpu data. That scheme was significantly
slower. Most
likely because the cpu that initiates the flush will take N
(# of
cpus) cache misses to get an initial snapshot of the counts,
then
another N cache misses to check for completion. This assumes
that
a cpu doing a flush is not the most-recent cpu to do a
flush.
I believe this is typical.

Keeping the counts in a single array (64cpus/cache line)
significantly reduces the number of cache misses.

Another disadvantage of keeping counts in per-cpu data is
that
scanning the counts trashes the TLB for large NR_CPUS. The
counts will
be located in different 16MB granules. Each reference to
cpu's percpu
data will require a different TLB entry to map the address
used to
reference the count. To scan N cpus, there will be ~2*N TLB
misses
plus at the end of the flush, the contents of the TLB are
useless
for most kernel or user use.

--

I tried a third algorithm where the counts were kept in a
single array
but each count was cacheline aligned to eliminate any
possibility
of contention. This was better that the second method that
trashed
the TLB. 1 TLB entry will cover the entire array.
Unfortunately,
this algorithm still encurs 2*N cache misses & is slower
than
the current algorithm.


Does this explanation make sense...... If anyone has an
alternate
algorithm, I be glad to try it.


-- jack


-
To unsubscribe from this list: send the line
"unsubscribe linux-ia64" in
the body of a message to majordomovger.kernel.org
More majordomo info at  http://vge
r.kernel.org/majordomo-info.html

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )