List Info

Thread: RC5 algorithm .... Re: "success"




RC5 algorithm .... Re: "success"
user name
2006-10-19 07:52:36
david fleischer <cilantro_ilyahoo.com> on Thu, 19
Oct 2006 00:31:36 wrote:
> Subject: Re: [Hardware] "success"
> To: "John L. Bass" <jbassdmsd.com>
> 
> Martin,
> What is YOUR estimate?
> 
> Regards,
> David

Well David, since you sent this to me, I guess I'll reply
and Martin
can follow up ... but he vary clearly said a few hours ago:

Martin Klingensmith <martinnnytech.net> on Wed, 18
Oct 2006 18:54:02 -0400 wrote:
> Well, how is the data verified and sent back?
(conceptually, I'm not
> looking for details)
> There could always be some sort of software layer on a
host computer
> with any number of client FPGAs attached.
> 
> By the way, I made a simple top module and tried to
synthesize the code
> today. It started to work but I tried to re-run the
synthesis and
> Synplify Pro just sat there for hours running 100%. I'm
not sure what's
> going on with that - maybe it just needs a reboot. I
don't know how much
> space it will take up in an FPGA. It's a 155 stage 32
bit pipeline which
> seems large by my standards. It is for this reason that
I don't know
> which device it will fit in (I use Xilinx S3-200 right
now) or how fast
> it will run.
> 
> I will give out the code once I figure out if there are
any reasons not
> to do so. It's basically the code I posted to this list
10 months ago
> converted to Verilog and printed out with tac. It's
also part of my
> graduate thesis.

Which is pretty clear ... he doesn't know yet.

Since I helped him via email a bit last year, I do remember
a few more
details, including what he posted 10 months ago as his
design, and what
I posted as mine ... both 10 months ago, and the year before
that. Not
to make it too easy for this school project, and
purposefully didn't
give him exactly what he needed, but an older version.

John


	From jbassdmsd.com  Sun Dec 25 03:34:19 2005
	Date: Sun, 25 Dec 2005 03:34:17 -0700
	From: "John L. Bass" <jbassdmsd.com>
	To: hardwarelists.distributed.net
	Subject: Re: [Hardware] RC5 algorithm

	Martin Klingensmith <martinnnytech.net> writes:
	> I believe it /may/ be possible to pipeline this into
something that cranks
	> through one key each clock cycle but I'm not so much a
professional VLSI
	> designer, I don't know how that works in real designs.

	Actually Martin, the pipelining is a piece of cake with the
right tools,
	and it does work out to one key per clock. It takes a
pretty fair sized
	FPGA, and there are fit problems because of that. "All
you need to do,
	is just unroll the loop" (hehehe ... haven't we heard
that before ... lol).

	Actually, I did just that about a year ago, and it took
about a day,
	using a hacked version of TMCC.

	You will find Fpga C on sourceforge.net:

		http://sourcefo
rge.net/projects/fpgac

	A project I moved to sf.net about two months ago. It will
compile your
	unrolled c as a pure statement level pipeline with very
little change
	to your code or algorithm. Replace the multiply by 2's with
shift left
	by 1. Remove the array references (which are serial
bottlenecks), by
	enumerating the variables by their subscripts (IE s[0]
becomes s0, and
	s[13] becomes s13, etc ...). That will remove the need for
the mod
	function in the array subscript arithmetic (FpgaC doesn't
support %).

	There were a few minor issues with my hacked TMCC that
required a
	script to fix the net list, that actually burned a half
day, but it
	was VERY quick compared to doing this in VHDL.

	Next is a very cute little trick. Reverse all the unrolled
code blocks
	front to back to force pipelined registers. C is by
definition, sequential.
	So all C to netlist compilers have to preserve that
sequential operation,
	and frequently capitalize on it by building flattened
combinatorials for
	each block. That however, creates deep combinatorials,
which slow the
	clock rate. Reversing the blocks however, uses the
sequential semantics
	to our benifit by creating a pipeline:

		b += a;
		c += b;
		d += c;

	creates a combinatorial chain that ripples from a to d. Now
if we reverse
	the statements:

		d += c;
		c += b;
		b += a;

	we break that combinatorial chain, creating a pipeline. It
takes a little
	more work to initialize it, but after doing so, one result
per clock flows
	out the pipe (d in this case).

	So, my first pass at the RC5 core after unrolling the loops
was something like:

	                L0 = key0 = count;
	                L1 = key1 = count>>32;
	                S26 = ROTL3(0xb7e15163);
	                L2 = ROTL(L0 + S26, S26);
	                S27 = ROTL3(0x5618cb1c + S26 + L2);
	                L3 = ROTL(L1 + S27 + L2, S27 + L2);
	                S28 = ROTL3(0xf45044d5 + S27 + L3);
	                L4 = ROTL(L2 + S28 + L3, S28 + L3);
	                S29 = ROTL3(0x9287be8e + S28 + L4);
	                L5 = ROTL(L3 + S29 + L4, S29 + L4);
	                S30 = ROTL3(0x30bf3847 + S29 + L5);
	                L6 = ROTL(L4 + S30 + L5, S30 + L5);
	                S31 = ROTL3(0xcef6b200 + S30 + L6);
	                L7 = ROTL(L5 + S31 + L6, S31 + L6);
	                S32 = ROTL3(0x6d2e2bb9 + S31 + L7);
	                L8 = ROTL(L6 + S32 + L7, S32 + L7);
	                S33 = ROTL3(0x0b65a572 + S32 + L8);
	                L9 = ROTL(L7 + S33 + L8, S33 + L8);
	                S34 = ROTL3(0xa99d1f2b + S33 + L9);
	                L10 = ROTL(L8 + S34 + L9, S34 + L9);
	                S35 = ROTL3(0x47d498e4 + S34 + L10);
	                L11 = ROTL(L9 + S35 + L10, S35 + L10);
	                S36 = ROTL3(0xe60c129d + S35 + L11);
	                L12 = ROTL(L10 + S36 + L11, S36 + L11);

	           [...]
	                E0 = ROTL(E0 ^ E1, E1) + S96;
	                S97 = ROTL3(S71 + S96 + L72);
	                L73 = ROTL(L71 + S97 + L72, S97 + L72);
	                E1 = ROTL(E0 ^ E1, E0) + S97;
	                S98 = ROTL3(S72 + S97 + L73);
	                L74 = ROTL(L72 + S98 + L73, S98 + L73);
	                E0 = ROTL(E0 ^ E1, E1) + S98;
	                S99 = ROTL3(S73 + S98 + L74);
	                L75 = ROTL(L73 + S99 + L74, S99 + L74);
	                E1 = ROTL(E0 ^ E1, E0) + S99;
	                S100 = ROTL3(S74 + S99 + L75);
	                L76 = ROTL(L74 + S100 + L75, S100 + L75);
	                E0 = ROTL(E0 ^ E1, E1) + S100;
	                S101 = ROTL3(S75 + S100 + L76);
	                L77 = ROTL(L75 + S101 + L76, S101 + L76);
	                E1 = ROTL(E0 ^ E1, E0) + S101;
	                S102 = ROTL3(S76 + S101 + L77);
	                L78 = ROTL(L76 + S102 + L77, S102 + L77);
	                E0 = ROTL(E0 ^ E1, E1) + S102;
	                S103 = ROTL3(S77 + S102 + L78);
	                E1 = ROTL(E0 ^ E1, E0) + S103;

	which can be verified by wrapping a loop around it, and
executing it on
	a traditional processor. E0 and E1 must also be renumbered
so the
	var is not reused between pipeline stages for an FpgaC
version.

	For the current challenge, the slightly larger key width
changes
	the unrolled var numbering a bit, so what you see above is
just
	a conceptional model left over from the last challenge.

	Then invert the statements from top to bottom, and fix the
pipeline
	initialization issues (like the first 180 or so output are
garbage
	to be ignored). Wrap a test harness check around it, and
some communication
	with the external world, and you are done ... something
like this:

	    #define ROTL3(x) ((x << 3) | (x >> 29))
	    #define ROTL(x,n) (((x) << (n)) | ((x) >>
(32-(n))))

	        while (++count < max)
	        {
	                unsigned long E0, E1;
	                unsigned long L0, L1, L2, L3, L4, L5, L6,
L7, L8, L9;
	                unsigned long L10, L11, L12, L13, L14, L15,
L16, L17, L18, L19;
	                unsigned long L20, L21, L22, L23, L24, L25,
L26, L27, L28, L29;
	                unsigned long L30, L31, L32, L33, L34, L35,
L36, L37, L38, L39;
	                unsigned long L40, L41, L42, L43, L44, L45,
L46, L47, L48, L49;
	                unsigned long L50, L51, L52, L53, L54, L55,
L56, L57, L58, L59;
	                unsigned long L60, L61, L62, L63, L64, L65,
L66, L67, L68, L69;
	                unsigned long L70, L71, L72, L73, L74, L75,
L76, L77, L78, L79;
	                unsigned long S26, S27, S28, S29;
	                unsigned long S30, S31, S32, S33, S34, S35,
S36, S37, S38, S39;
	                unsigned long S40, S41, S42, S43, S44, S45,
S46, S47, S48, S49;
	                unsigned long S50, S51, S52, S53, S54, S55,
S56, S57, S58, S59;
	                unsigned long S60, S61, S62, S63, S64, S65,
S66, S67, S68, S69;
	                unsigned long S70, S71, S72, S73, S74, S75,
S76, S77, S78, S79;
	                unsigned long S80, S81, S82, S83, S84, S85,
S86, S87, S88, S89;
	                unsigned long S90, S91, S92, S93, S94, S95,
S96, S97, S98, S99;
	                unsigned long S100, S101, S102, S103;

	#include "RC5Core.c"

	                /* test for found key */
	                if (cypher0 == E0 && cypher1 == E1)
	                        break;
	        }

	I think this was posted to this list a little over a year
ago. I also did
	some serial verions which took A LOT more design energy - a
couple weeks each.
	There are some interesting space time tradeoff's.

	Have fun,
	John

		Date: Sun, 25 Dec 2005 00:00:46 -0500
		From: Martin Klingensmith <martinnnytech.net>
		To: Hardware <hardwarelists.distributed.net>
		Subject: Re: [Hardware] RC5 algorithm

		Dan Oetting wrote:

		> You are right that RC5 doesn't care what comes after
the current  
		> block. But it does care about what came before. The
RSA contests use  
		> RC5-CBC so you are given an IV that represents the
result from the  
		> previous block that needs to be XOR'd with the plain
text for the  
		> current block before applying the RC5 encryption. The
resulting  
		> cypher text is then used as the IV for the next
block.
		>
		> Here is a description with sample code: <http://www.ietf.org/rfc/

		> rfc2040.txt>
		>
		> -- Dan O. 


		Thanks for the tip, Dan.
		I did a lot of studying and reducing the rfc2040.txt into
something that 
		can fit on a page and is specifically for RC5-32/12/9.
There are a few 
		examples of this out on the 'net but I thought I'd send it
anyway in 
		case someone is interested in looking at it. The next step
is to figure 
		out what I need to do to make an FPGA run this as fast as
possible. I 
		believe it /may/ be possible to pipeline this into
something that cranks 
		through one key each clock cycle but I'm not so much a
professional VLSI 
		designer, I don't know how that works in real designs.

		Serially, the following happens:
		---------------------------
		The key expansion loop for: S[i] = S[i-1] + Qw; has to
iterate 25 times
		The S table creation loop iterates 78 times
		The actual block encryption routine only runs through 12
rounds
		---------------------------
		If you have any wise advice for how to pipeline this, I
would much 
		appreciate it.


		Here is the C code. It is not mine, I just hacked it up
from rfc2040.c 
		which started out at around 500 lines:
		////////////////
		#include <stdio.h>

		#define Pw  0xb7e15163
		#define Qw  0x9e3779b9
		#define RC5_WORD     unsigned int
		#define SHL(x,s)    ((RC5_WORD)((x)<<((s)&31)))
		#define SHR(x,s,w) 
((RC5_WORD)((x)>>((w)-((s)&31))))
		#define ROTL(x,s,w)
((RC5_WORD)(SHL((x),(s))|SHR((x),(s),(w))))

		int main(){
		    RC5_WORD    S[42];
		    RC5_WORD    L[3];  /* Based on max key size */
		    RC5_WORD    A, B, ivA, ivB;
		    int i, k;   
		    int        j=0;

		    // iv = 07 ce 59 1f 86 14 9a 41
		    ivA = 0x1f59ce07;
		    ivB = 0x419a1486;
		   
		    // key = c9 0c 03 53 c0 d4 e1 fe 85
		    L[0] = 0x53030cc9;
		    L[1] = 0xfee1d4c0;
		    L[2] = 0x85;


		    S[0] = Pw;
		    for(i=1;i<26;i++)
		        S[i] = S[i-1] + Qw;
		   
		    i = j = 0;
		    A = B = 0;

		    for (k = 0 ; k < 3*26 ; k++)  {
		        A = ROTL(S[i] + A + B, 3, 32);
		        S[i] = A;
		        B = ROTL(L[j] + A + B, A + B, 32);
		        L[j] = B;
		        i = (i + 1) % 26;
		        j = (j + 1) % 3;
		    }

		    // p = 54 68 65 20 75 6e 6b 6e = "The Unkn"
		    A = 0x20656854;
		    B = 0x6e6b6e75;
		   
		    // RC5-CBC requires the IV to be XORed with the input
block
		    A = (A ^ ivA);
		    B = (B ^ ivB);
		   
		    printf("nA: %xnB: %xn",A,B);

		    A = A + S[0];
		    B = B + S[1];

		    for(i=1;i<=12;i++){
		        A = A ^ B;
		        A = ROTL(A, B, 32) + S[2*i];
		        B = B ^ A;
		        B = ROTL(B, A, 32) + S[2*i+1];
		    }   
		   
		    printf("nA: %xnB: %xn",A,B);

		    return (0);
		}

		////////////////

		-- 
		---
		Martin Klingensmith
		nnytech.net
		infoarchive.net

_______________________________________________
Hardware mailing list
Hardwarelists.distributed.net
http://lists.distributed.net/mailman/listinfo/hardware

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )