List Info

Thread: "success"




"success"
user name
2006-10-21 06:51:24
"Dan Oetting" <dan_oettingqwest.net> on Fri, 20 Oct 2006 22:08:03 -0600
writes:
> You either need to compute the L[ ] values to feed into
the first  
> round of each key schedule stage or save the S[ ]
values for each  
> iteration between stages. You could generate the L[ ]
values by  
> running your 3 stages through 3 passes with the same
key to generate  
> and pass the required values. Alternatively, you could
replicate the  
> early key schedule stages and feed them with the next 2
keys to be  
> processed. You would then have a total of 6 key
schedule stages and 1  
> decrypt stage but only need 1 pass per key and no S[ ]
storage. I  
> figure that's about a 40% savings.

>From an FPGA/VLSI perspective, I don't see how this is a
40% savings for
a fully unrolled solution.

A "stage" in an FPGA is something around 5*32
4-LUT's, and SBox storage
is 2*32 4-LUT's. The round 2 SBox propagating to Round 3,
would take about
2080 4-LUTs to replace 832 4-LUT's of LUT Rams for the SBox,
almost 250%
more expensive. Worse numbers for the round 3 to round 4
SBox propagation,
as you need about 4160 4-LUT's to regenerate the round 3
sbox terms, and
only 832 to store them.

The "trick" works for a processor solution when
the storage is more expensive
than the cycles, such as a small microprocessor. It's
expensive for nearly
every other case.

However, it could be cheaper for a fully looped design, just
as it is for
small prcoessors. And might be useful in running many small
looped engines
in the FPGA, rather than one large unrolled engine.
_______________________________________________
Hardware mailing list
Hardwarelists.distributed.net
http://lists.distributed.net/mailman/listinfo/hardware

"success"
user name
2006-10-23 08:38:42
Hello,

Saturday, October 21, 2006, 8:51:24 AM, you wrote:

JLB> However, it could be cheaper for a fully looped
design, just as it is for
JLB> small prcoessors. And might be useful in running
many small looped engines
JLB> in the FPGA, rather than one large unrolled engine.

If there is no room for a fully unrolled engine, the best is
of course
small looped engines.
But "not too small" looped engine! I'm sure the
area-time product is
better using inner pipelining in the looped engine (so while
processing several keys at the same time in the same
engine).
Another point is how you feed the engines. A big bus,
connected to
many small looped engine, cause extra delays and constraint
the whole
operating frequency (but it should be okay for rather small
FPGAs).

-- 
Guerric


_______________________________________________
Hardware mailing list
Hardwarelists.distributed.net
http://lists.distributed.net/mailman/listinfo/hardware

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )