List Info

Thread: gsoc & parallelization




gsoc & parallelization
country flaguser name
United States
2008-03-21 17:25:04
hi,

I have received more than one email expressing interest in
the ns-3
parallelization work and asking for guidance on how to
proceed further.
So, since this might be of use to more than one person, I
thought I
would send it here. Other mentors will probably be able to
provide
further guidance and input.

The first task is to decide what the scope of the project
should be.
There are really two possible options:
  - provide a parallel tool for clusters, and, more
generally, for
physically distributed systems
  - provide a parallel tool for multicore systems
(numa/smp)

Since I doubt very much that the same solution can be
applied with
useful performance to both options, I really would urge
interested
students to focus on one of the above use-cases.

Now, since I was asked for what what I would recommend, the
following is
really a summary of one of the many implementation options
you could
chose to pursue so, don't take this is as any kind of
mandatory path.
This is merely a hint as to what might be implementable
within the
requested timeframe and what might bring interesting and
useful output
to users.


1) The basic idea is to implement a conservative distributed
time
synchronization algorithm. The classic algorithm is
described in
"Parallel simulation: parallel and distributed
simulation systems" by
Richard M. Fujimoto. The fundamental assumption here is that
the
simulation models make no use of global variables and that
simulated
nodes are connected through channels which have a non-null
minimum
propagation delay. This allows you to parallelize two events
e1 and e2
generated on two simulated nodes (n1 and n2) if e2 > e1 +
min_delay
(n1,n2) or if e1 > e2 + min_delay (n1,n2).

2) We could, of course, implement the classic algorithm in
the classic
way to allow a user to build very large scale simulations
by
distributing the simulated nodes on a large set of
simulation machines.
This kind of parallelization is useful but it has a number
of drawbacks,
the main one being that it requires the user to describe a
static
partitioning of the simulated nodes: the total set of nodes
is split in
a set of partitions, each of which contains a subset of the
total set of
nodes. Then, each partition is scheduled to run on a single
machine.
Once you have made this decision, you cannot undo it since
you are not
using a language like java which allows code migration. In
practice,
finding a partitioning which does not generate slowdowns is
non-trivial.

3) So, from my perspective, a desirable solution would
require that the
user does not have to describe a partition set and it would
work really
hard to get automatically _some_ speedup from a multicore
system.

A simple solution would be to re-use a classic
synchronization algorithm
with a very fine-grained partitioning (i.e., each partition
contains
exactly one node). Then, you could schedule each partition
on one of the
cpus of the multicore machine until the synchronization
algorithm tells
you that this partition has no events to schedule anymore.
At this
point, you can pick another available partition to schedule
it on one of
the available cpus.

Pluses:
  - no user-visible partitioning on a single multicore
machine.
  - can be re-used if you have multiple multicore machines
to schedule
events within each multicore machine. So, you can reuse this
work if you
want to provide a more generic distributed simulation
solution.

Minuses:
  - have no idea if you will get any kind of measurable
speedup and you
will need to work hard to optimize local cache thrashing,
etc.

4) How do you implement this ?

The core synchronization algorithm should live in the
'simulator'
module: src/simulator/simulator.cc should basically be
replaced by a new
version which performs the synchronization/partition
scheduling. It is
trivial to make this code dynamically replaceable by
exporting something
like SimulatorImpl defined src/simulator/simulator.cc so, I
will skip
this detail.

The biggest problem from an API point of view is that you
need to be
able to tell which partition an event belongs to.
Traditionally, this is
not a problem because there is a single partition per
physical
machine/address-space. However, here, we have multiple
partitions within
a single address-space (physical machine). Since every
partition is made
of a single Node, we just need to tell which node an event
is coming
from. There are many options, but here are two:
  - Simulator::ScheduleWithObject (Ptr<Object> node,
Time delay,
function, ...);
  - Node::Schedule (Time delay, function, ...);

And, then, you need to make sure that all your models use
these
scheduling functions rather than the functions they are now
using.

5) A small tricky detail to get right from an algorithmic
perspective is
how you will handle events which expire at the same time.
There are two
issues to deal with here:
  - you need to ensure that the parallel simulations provide
a
deterministic ordering of these events.
  - you need to ensure that the deterministic ordering of
these events
is the same in both non-parallel and parallel simulations to
ensure that
models are trivially portable from a non-parallel simulation
to a
parallel one.

I suspect that there are multiple solutions to this problem,
but here is
an ordering algorithm which should solve this problem:
  - all same-time events within a single node should be
scheduled FIFO.
  - if you have two same-time events on node A and B, and
if
A->GetNodeId () < B->GetNodeId (), then, the event
on A should be
scheduled first.

This means that you need to add to the event ordering key
the nodeid as
a secondary ordering key (in fact, there are already 2 keys
in there,
so, this would be a third ordering key)

Mathieu



[1]

about | contact  Other archives ( Real Estate discussion Medical topics )