List Info

Thread: Unexpected threading performance result




Unexpected threading performance result
country flaguser name
Croatia
2007-10-07 09:52:03
Hi,

For an unrelated purpose, I'm benchmarking performance of
tree 
algorithms in SMP environments and my preliminary run has an
unexpected 
result. Here's the typical output from the (small) benchmark
program, 
run on a dual-core Athlon64 (i386 mode):

Running benchmarks on small_nonuniform, 1000000 samples
Step 1: Running 100 loops
** Step 1 benchmark completed 100 loops in 84.44 seconds.
Step 2: Running 2 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 2 threads in
167.46 seconds.

The interpretation is: running the same loop twice, in two
parallel 
threads doesn't result in a speedup, but it looks like the
execution is 
serialized. The problem is: the loops are completely
independent, no 
locking in their execution, and 'top' confirms that both
threads in the 
program are running at 100% CPU each.

I verified this behaviour on:

- 7-CURRENT, i386, ULE
- 7-CURRENT, i386, 4BSD
- 6-STABLE, amd64, 4BSD

I can't really explain this behaviour, but it might not be
related to 
FreeBSD - maybe I made a mistake in the program or there's a

hardware-related reason for it (maybe CPU cache trashing
from the tree 
traversal?). In any case, can someone shed some light on
this?

The main part of the (small) program is pasted below.


  47 double time_start, time_b1, time_b2;
  48 int n_data, n_samples;
  49 int *data, *samples;
  50
  51
  52 void bench_loop()
  53 {
  54         int i;
  55         struct node *nd, find;
  56         for (i = 0; i < n_samples; i++) {
  57                 find.data = samples[i];
  58                 nd = RB_FIND(node_tree, &head,
&find);
  59                 if (nd == NULL)
  60                         errx(1, "Sample %d was not
found", find.data);
  61         }
  62 }
  63
  64 void step1()
  65 {
  66         int n;
  67         /* step 1 - simple tree traversal */
  68         printf("Step 1: Running %d loopsn",
STEP1_ITER);
  69         for (n = 0; n < STEP1_ITER; n++)
  70                 bench_loop();
  71         time_b1 = gettime();
  72         printf("** Step 1 benchmark completed %d
loops in %0.2lf 
seconds.n", STEP1_ITER, time_b1 - time_start);
  73 }
  74
  75 void *step2_thread(void *arg) {
  76         int n;
  77         for (n = 0; n < STEP2_ITER; n++)
  78                 bench_loop();
  79         return NULL;
  80 }
  81
  82 void step2()
  83 {
  84         /* step 2 - run tree traversal in parallel
threads */
  85         int n;
  86         pthread_t threads[STEP2_THREADS];
  87
  88         printf("Step 2: Running %d threads with %d
loops eachn", 
STEP2_THREADS, STEP2_ITER);
  89         for (n = 0; n < STEP2_THREADS; n++) {
  90                 if (pthread_create(&threads[n],
NULL, step2_thread, 
NULL) != 0)
  91                         err(1, "Cannot spawn
thread");
  92         }
  93         for (n = 0; n < STEP2_THREADS; n++)
  94                 pthread_join(threads[n], NULL);
  95         time_b2 = gettime();
  96         printf("** Step 2 benchmark completed %d
loops in %d 
threads in %0.2lf seconds.n",
  97                         STEP2_ITER, STEP2_THREADS,
time_b2 - 
time_start);
  98 }

_______________________________________________
freebsd-threadsfreebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-threa
ds
To unsubscribe, send any mail to
"freebsd-threads-unsubscribefreebsd.org"

Re: Unexpected threading performance result
country flaguser name
United States
2007-10-07 11:21:53
* Ivan Voras <ivorasfreebsd.org> [071007 07:52] wrote:
> Hi,
> 
> For an unrelated purpose, I'm benchmarking performance
of tree 
> algorithms in SMP environments and my preliminary run
has an unexpected 
> result. Here's the typical output from the (small)
benchmark program, 
> run on a dual-core Athlon64 (i386 mode):
> 
> Running benchmarks on small_nonuniform, 1000000
samples
> Step 1: Running 100 loops
> ** Step 1 benchmark completed 100 loops in 84.44
seconds.
> Step 2: Running 2 threads with 100 loops each
> ** Step 2 benchmark completed 100 loops in 2 threads in
167.46 seconds.
> 
> The interpretation is: running the same loop twice, in
two parallel 
> threads doesn't result in a speedup, but it looks like
the execution is 
> serialized. The problem is: the loops are completely
independent, no 
> locking in their execution, and 'top' confirms that
both threads in the 
> program are running at 100% CPU each.

...

1)
Could you please make this example program compile/work,
it's a bit
difficult to diagnose the problem if we don't know if things
like
n_samples are initialised properly.

2)
Please try to pthread_attr_setscope() to
PTHREAD_SCOPE_SYSTEM
on a pthread_attr_t object to pass into pthread_create(). 
That
may help.

3) 
What's the deal with RB_FIND?  What does that do?  Is that
data structure locked with a an exclusive lock?





> 
> I verified this behaviour on:
> 
> - 7-CURRENT, i386, ULE
> - 7-CURRENT, i386, 4BSD
> - 6-STABLE, amd64, 4BSD
> 
> I can't really explain this behaviour, but it might not
be related to 
> FreeBSD - maybe I made a mistake in the program or
there's a 
> hardware-related reason for it (maybe CPU cache
trashing from the tree 
> traversal?). In any case, can someone shed some light
on this?
> 
> The main part of the (small) program is pasted below.
> 
> 
>  47 double time_start, time_b1, time_b2;
>  48 int n_data, n_samples;
>  49 int *data, *samples;
>  50
>  51
>  52 void bench_loop()
>  53 {
>  54         int i;
>  55         struct node *nd, find;
>  56         for (i = 0; i < n_samples; i++) {
>  57                 find.data = samples[i];
>  58                 nd = RB_FIND(node_tree, &head,
&find);
>  59                 if (nd == NULL)
>  60                         errx(1, "Sample %d was
not found", find.data);
>  61         }
>  62 }
>  63
>  64 void step1()
>  65 {
>  66         int n;
>  67         /* step 1 - simple tree traversal */
>  68         printf("Step 1: Running %d
loopsn", STEP1_ITER);
>  69         for (n = 0; n < STEP1_ITER; n++)
>  70                 bench_loop();
>  71         time_b1 = gettime();
>  72         printf("** Step 1 benchmark completed
%d loops in %0.2lf 
> seconds.n", STEP1_ITER, time_b1 - time_start);
>  73 }
>  74
>  75 void *step2_thread(void *arg) {
>  76         int n;
>  77         for (n = 0; n < STEP2_ITER; n++)
>  78                 bench_loop();
>  79         return NULL;
>  80 }
>  81
>  82 void step2()
>  83 {
>  84         /* step 2 - run tree traversal in parallel
threads */
>  85         int n;
>  86         pthread_t threads[STEP2_THREADS];
>  87
>  88         printf("Step 2: Running %d threads
with %d loops eachn", 
> STEP2_THREADS, STEP2_ITER);
>  89         for (n = 0; n < STEP2_THREADS; n++) {
>  90                 if (pthread_create(&threads[n],
NULL, step2_thread, 
> NULL) != 0)
>  91                         err(1, "Cannot spawn
thread");
>  92         }
>  93         for (n = 0; n < STEP2_THREADS; n++)
>  94                 pthread_join(threads[n], NULL);
>  95         time_b2 = gettime();
>  96         printf("** Step 2 benchmark completed
%d loops in %d 
> threads in %0.2lf seconds.n",
>  97                         STEP2_ITER, STEP2_THREADS,
time_b2 - 
> time_start);
>  98 }
> 
> _______________________________________________
> freebsd-threadsfreebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-threa
ds
> To unsubscribe, send any mail to
"freebsd-threads-unsubscribefreebsd.org"

-- 
- Alfred Perlstein
_______________________________________________
freebsd-threadsfreebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-threa
ds
To unsubscribe, send any mail to
"freebsd-threads-unsubscribefreebsd.org"

Re: Unexpected threading performance result
country flaguser name
Belgium
2007-10-07 11:05:38
On Sunday 07 October 2007 16:52:03 Ivan Voras wrote:
> For an unrelated purpose, I'm benchmarking performance
of tree 
> algorithms in SMP environments and my preliminary run
has an unexpected 
> result. Here's the typical output from the (small)
benchmark program, 
> run on a dual-core Athlon64 (i386 mode):
> 
> Running benchmarks on small_nonuniform, 1000000
samples
> Step 1: Running 100 loops
> ** Step 1 benchmark completed 100 loops in 84.44
seconds.
> Step 2: Running 2 threads with 100 loops each
> ** Step 2 benchmark completed 100 loops in 2 threads in
167.46 seconds.

My guess is, that in the beginning of step1() and step2()
you have to
add a line "time_start = gettime();".
_______________________________________________
freebsd-threadsfreebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-threa
ds
To unsubscribe, send any mail to
"freebsd-threads-unsubscribefreebsd.org"

Re: Unexpected threading performance result
user name
2007-10-07 15:09:28
Tijl Coosemans wrote:
> On Sunday 07 October 2007 16:52:03 Ivan Voras wrote:
>> For an unrelated purpose, I'm benchmarking
performance of tree 
>> algorithms in SMP environments and my preliminary
run has an unexpected 
>> result. Here's the typical output from the (small)
benchmark program, 
>> run on a dual-core Athlon64 (i386 mode):
>>
>> Running benchmarks on small_nonuniform, 1000000
samples
>> Step 1: Running 100 loops
>> ** Step 1 benchmark completed 100 loops in 84.44
seconds.
>> Step 2: Running 2 threads with 100 loops each
>> ** Step 2 benchmark completed 100 loops in 2
threads in 167.46 seconds.
> 
> My guess is, that in the beginning of step1() and
step2() you have to
> add a line "time_start = gettime();".

Of course I have. I was so focused on the low level stuff I
did 
something stupid to the effect of your suggestion. Thanks
for the help!

The results make sense now, and if anyone's interested, I'm
pasting them 
below. I did additional effort and run it under both 4BSD
and ULE 
schedulers non 7-CURRENT (SMP, dual-core).

-- 4BSD, nonuniform samples --
Running benchmarks on small_nonuniform, 1000000 samples
Step 1: Running 100 loops
** Step 1 benchmark completed 100 loops in 86.33 seconds.
Step 2: Running 2 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 2 threads in
82.79 seconds.
Step 2: Running 3 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 3 threads in
124.67 seconds.
Step 2: Running 4 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 4 threads in
166.32 seconds.
Step 2: Running 5 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 5 threads in
210.67 seconds.
Step 2: Running 6 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 6 threads in
251.83 seconds.
Step 2: Running 7 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 7 threads in
291.25 seconds.

-- ULE nonuniform samples --
Running benchmarks on small_nonuniform, 1000000 samples
Step 1: Running 100 loops
** Step 1 benchmark completed 100 loops in 84.09 seconds.
Step 2: Running 2 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 2 threads in
83.43 seconds.
Step 2: Running 3 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 3 threads in
126.21 seconds.
Step 2: Running 4 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 4 threads in
166.66 seconds.
Step 2: Running 5 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 5 threads in
209.40 seconds.
Step 2: Running 6 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 6 threads in
250.36 seconds.
Step 2: Running 7 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 7 threads in
291.92 seconds.
Step 2: Running 8 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 8 threads in
333.42 seconds.

-- 4BSD uniform samples --
Running benchmarks on small_uniform, 1000000 samples
Step 1: Running 100 loops
** Step 1 benchmark completed 100 loops in 93.33 seconds.
Step 2: Running 2 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 2 threads in
89.33 seconds.
Step 2: Running 3 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 3 threads in
135.20 seconds.
Step 2: Running 4 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 4 threads in
179.96 seconds.
Step 2: Running 5 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 5 threads in
226.40 seconds.
Step 2: Running 6 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 6 threads in
269.57 seconds.
Step 2: Running 7 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 7 threads in
314.06 seconds.
Step 2: Running 8 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 8 threads in
358.67 seconds.

-- ULE uniform samples --
Running benchmarks on small_uniform, 1000000 samples
Step 1: Running 100 loops
** Step 1 benchmark completed 100 loops in 89.76 seconds.
Step 2: Running 2 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 2 threads in
89.90 seconds.
Step 2: Running 3 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 3 threads in
135.75 seconds.
Step 2: Running 4 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 4 threads in
179.72 seconds.
Step 2: Running 5 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 5 threads in
226.10 seconds.
Step 2: Running 6 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 6 threads in
269.63 seconds.
Step 2: Running 7 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 7 threads in
314.76 seconds.
Step 2: Running 8 threads with 100 loops each
** Step 2 benchmark completed 100 loops in 8 threads in
359.44 seconds.

"uniform" / "nonuniform" describes the
distribution of the random number 
function.

_______________________________________________
freebsd-threadsfreebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-threa
ds
To unsubscribe, send any mail to
"freebsd-threads-unsubscribefreebsd.org"

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )