Byte Parallel Crc16

Snial · Post by **Snial** » Thu Oct 29, 2015 9:51 am

Hi folks,

Here's a nifty byte-parallel 16-bit CRC algorithm based on the algorithm here (works out as 9.75c/bit):

http://embdev.net/articles/CRC-16-CCITT ... _Assembler

Y=256-Length, BUFF^Actual buffer-256+Length.

Code: Select all

CRC16:
	LDA (BUFF),Y
	EOR CrcHi
	STA CrcHi	;really CrcLo
	LSRA
	LSRA
	LSRA
	LSRA	;(Crc&0xff)>>4.
	EOR CrcHi
	STA Temp0 ;proper CrcLo now.
	ASLA
	ASLA
	ASLA
	ASLA	;(Crc<<12)
	EOR CrcLo ;Actual CrcHi from swap.
	STA CrcHi	;this is the new CrcHi
	LDA Temp0	;CrcLo
	RORA
	RORA
	RORA ;so bottom 5 bits need EORing into CrcHi
	STA Temp1
	AND #31
	EOR CrcHi
	STA CrcHi
	LDA Temp1
	RORA ;top 3 bits need EORing into CRCLo
	AND #$E0
	EOR Temp0
	STA CrcLo
	INY
	BNE CRC16	
	RTS	;2*15+3*14+6 = 78c/byte.

teamtempest · Post by **teamtempest** » Fri Oct 30, 2015 5:23 pm

I'm having a little trouble this line in the 'C' version of the algorithm:

Code: Select all

crc ^= ((crc & 0xff) << 4) << 1;

To me it looks like the low byte is shifted left four times, then once more. Isn't that the same as:

Code: Select all

crc ^= (crc & 0xff) << 5;

Or am I missing something?

Snial · Post by **Snial** » Fri Oct 30, 2015 5:46 pm

Yep, it's the same.

However, the Avr has a nybble swapping instruction and the Gcc compiler could make better use of it if the code is expressed this way.

On the other hand it's in assembler- so what does it matter?

Note, the assembler code uses swap quite a bit! On an AtMega the speed can be improved further by using multiply instructions (*32 is faster than <<5).

It's worth comparing the code with the optimized one on the MDFS website. That Crc16 takes 20+8*(15+14/2)=196c/byte= about 24 cycles per bit, over 2x skewer than the algorithm presented here.

http://beebwiki.mdfs.net/CRC-16

The reason why I'm a little obsessed with byte parallel CRCs, is because a highly efficient 6800 implementation was published as a BSI standard in the 1970s, but somehow byte parallel versions have been obscured by bit and (sometimes erroneous) table-driven ones.

cbscpe · Post by **cbscpe** » Mon Nov 02, 2015 7:53 pm

And why not consider an algorithm with look up tables?

Snial · Post by **Snial** » Mon Nov 02, 2015 8:59 pm

Quote:

And why not consider an algorithm with look up tables?

About a decade ago, it was hard to find the byte-parallel version, but recently it's become more common.

It turns out there are good reasons to prefer non-table versions, a 512b table is a fairly significant size for an 8-bit CPU. Secondly, some of the CRC tables on the net actually contain errors (though you wouldn't notice it if you generate and compare the CRCs using the same tables). Lastly, on powerful CPUs, a CRC table will generate a lot of cache misses and therefore it's less fast than you might expect than the computational version.

cbscpe · Post by **cbscpe** » Wed Nov 04, 2015 7:00 am

Perhaps, but even for an 8bit computer a 512byte table is not much. The performance increase is important especially for high baud-rates you can calculate the CRC while receiving. So you have no delays between the receiption of a packet and a ACK or NAK. Which is one of the major reasons for low transfer rates. As for the cache misses, we are talking about normal 8bit computers, they mostly do not have caches. Todays computer with cache also include instructions that better support the calculation of CRC than shift and swap. And if you are afraid of wrong tables, then you could as well add a routine to your Xmodem that builds your table in advance in RAM. Or you challenge the table against your algorithm. You just need to run 2**24 cases. Takes about an hour on a 1Mhz 6502.

teamtempest · Post by **teamtempest** » Wed Nov 04, 2015 6:15 pm

I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights (crc ^= crc << 5). Maybe I'm wrong, but here's a couple of alternate versions:

Code: Select all

accumulate_crc16: 

/* Algorithm: CRC-16-CCITT, x16 + x12 + x5 + 1, 0x1021 / 0x8408 / 0x8810 
   Steps in C to update a running CRC with a new byte in ser_data:
    unsigned char ser_data;
   unsigned int crc;   -> accumulated CRC to update, initial value 0xFFFF
    crc  = (unsigned char)(crc >> 8) | (crc << 8);
   crc ^= ser_data;
   crc ^= (unsigned char)(crc & 0xff) >> 4;
   crc ^= (crc << 8) << 4;
    crc ^= ((crc & 0xff) << 4) << 1;
    Note that there are faster table-based implementations, but they
   consume more program space.
 */

crc_16_ccitt_update:

	; crc = (unsigned char)(crc >> 8 | crc << 8 );

	ldx crc_16_lo
	ldy crc_16_hi
	sty crc_16_lo
	stx crc_16_hi

	; crc ^= ser_data

	eor crc_16_lo
	sta crc_16_lo
	
	; crc ^= (unsigned char)(crc & 0xff) >> 4;

	lsr
	lsr
	lsr
	lsr
	eor crc_16_lo
	sta crc_16_lo

	; crc ^= (crc << 8) << 4;

	asl
	asl
	asl
	asl
	eor crc_16_hi
	sta crc_16_hi

	; crc ^= ((crc & 0xff) << 4) << 1;

	lda crc_16_lo
	cmp #$80
	ror
	cmp #$80
	ror
	cmp #$80
	ror
	tax
	and #%11100000
	eor crc_16_lo
	sta crc_16_lo 
	txa
	and #%00011111
	eor crc_16_hi
	sta crc_16_hi

	; done - 15*3 + 18*2 = 81 cycles

	rts

crc_16_ccitt_update2:

	; crc = (unsigned char)(crc >> 8 | crc << 8 );

	; crc ^= ser_data
	
	eor crc_16_hi	; implied swap
	sta temp	; crc_16_lo

	; crc ^= (unsigned char)(crc & 0xff) >> 4;

	lsr
	lsr
	lsr
	lsr
	eor temp
	sta temp

	; crc ^= (crc << 12);

	asl
	asl
	asl
	asl
	eor crc_16_lo	; implied swap
	sta crc_16_hi

	; crc ^= (unsigned char)(crc & 0xff) << 5;

	lda temp	; crc_16_lo
	cmp #$80
	ror
	cmp #$80
	ror
	cmp #$80
	ror
	tax
	and #%11100000
	eor temp
	sta crc_16_lo 
	txa
	and #%00011111
	eor crc_16_hi
	sta crc_16_hi

	; done - 11*3 + 18*2 = 69 cycles

	rts

Hey Ed, does this count as "a few lines" in your poll?

Snial · Post by **Snial** » Wed Nov 04, 2015 6:21 pm

Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".

But let's look at what the table driven algorithm gives us. If we start with the core part of the ‘C’ version:

Code: Select all

    const uint8_t *c = c_ptr;

    while (len--)
        crc = (crc << 8) ^ crctable[((crc >> 8) ^ *c++)];

    return crc;

In 6502 code that's:

Code: Select all

ldx #0
Crc16Loop:
lda (buff),y ;6c
iny ;2c
eor crc16+1 ;3c
sta crcTableLo
sta CrcTableHi ;both 3c
lda (crcTableLo,x) ;6c
sta crc16 ;3c
lda (CrcTableHi,x) ;6c
eor crc16+1 ;3c
sta crc16+1 ;3c
iny ;2c
bne crc16Loop ;3c

Total 53 cycles per byte, just 42% faster, for an algorithm that uses 760% more space.

So this illustrates the trade-offs we make in software. On a 6502, we can choose: a serial version which uses 47b and (15+13/2)*8+ 36c which is on average 208c/byte; or a routine that's 36% longer and over twice as fast; or one that’s 11.7x longer and 2.8x faster.

So you don't need much of a reason to choose the byte parallel version, it's only 16 bytes longer, but the only reason for choosing the table-driven algorithm is when an additional 42% of performance is more critical than 0.5Kb.

But if that's the case, wouldn't it make more sense to use a more powerful CPU or MCU? And in the MCU case memory is usually at a premium, which leaves powerful CPUs, which, if they use cache won't benefit much from table-driven methods.

And given what you say about really powerful CPUs having instructions to support crcs doesn't that swing things in favour of computational algorithms again?

Case closed I'd say

BigEd · Post by **BigEd** » Wed Nov 04, 2015 6:24 pm

(

teamtempest wrote:

...Hey Ed, does this count as "a few lines" in your poll?

I think that might be a few tens of lines... for interest, I see Snial has posted a couple of blog posts about bytewide CRC
)

Snial · Post by **Snial** » Wed Nov 04, 2015 6:53 pm

Quote:

I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights

The <<12 indeed leaves the carry in an unknown state, but the AND after the RORAs mask it out. After the first 3 RORAs we get:

<B2> B1:B0:??:B7:B6:B5:B4:B3

Where <xx> is the current carry contents and ?? Is the unknown state of the carry flag from the <<12, now in bit 5 of Acc. Then we save the value, then the AND #31 gives us:

<B2> 0:0:0:B7:B6:B5:B4:B3

Then we restore the rotated value and RORA again to get:

<B3> B2:B1:B0:??:B7:B6:B5:B4

So then when we AND #0xE0 we get:

<B3> B2:B1:B0:00:00:00:00:00

So in both cases, the unknown carry state is masked out. Using this approach we're able to shift an 8-bit value into a 16-bit result without having to shift/rotate a memory location.

cbscpe · Post by **cbscpe** » Wed Nov 04, 2015 8:37 pm

Code: Select all

	ldy	#0
	sty	crc
	sty	crc+1
loop:
	lda	(buf),y		; 5c
	eor 	crc+1		; 3c
	tax			; 2c 
	lda 	crc		; 3c
	eor 	crctabH,X	; 4c
	sta 	crc+1		; 4c
	lda 	crctabL,X	; 4c
	sta 	crc		; 4c
	iny			; 2c
	cpy	len		; 3c
	bne	loop		; 3c
	rts
;

37 cycles per byte, assuming crc, and len in zeropage, for a 256byte buffer cpy len is obsolete then it's 34 cycles per byte. For a sample program to generate the tables see

https://github.com/zqad/crc16-ccitt/blo ... te-table.c which of course must then be split into the H and L crc tables.

Snial · Post by **Snial** » Thu Nov 05, 2015 8:05 am

Well, it is true that your table driven version is faster. It turns out you can always leave the CPY Len out, because you can adjust the address of buf so that if y starts at (-Len) then (buf),y points to the first byte of buf (see how my byte parallel algorithm does it). That way, you always increment y until it gets to 0.

And yes I'm aware of the code to generate the table in the first place, it's simply an application of the bit serial algorithm.

But my point remains: even a 34 cycle table-driven CRC is only worthwhile if the performance is critical, which for an 8-bitter, means you should be using something else. And there are many situations where its benefit is questionable, some (such as in scripting applications over the net) where it's undesirable and some (constrained embedded applications) where it's not feasible. In addition, in practice the table-driven algorithm encourages lazy, erroneous implementations, and it's the sheer laziness which has made them so dominant. I'm well aware of them, they don't need any encouragement thanks.

The byte parallel version however doubles the performance with hardly any code cost and is easily replicable. But the level of awareness of them is low - they need encouragement. For the vast majority of applications, it's the best choice.

Tor · Post by **Tor** » Thu Nov 05, 2015 12:44 pm

Snial wrote:

Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".

I must disagree with that, I'm sure there are some that think like that but whenever CRC comes up in my line of work there is an obsession about making it as fast as possible. Even in this day and time with fast CPUs. There are some seriously huge datasets with CRC out there, and calculating the CRC is a major bottleneck because for many cases the CRC is defined in such a way that it more or less prevents effective parallelization of the processing. Two years ago I used several days to fine-tune my implementation of an old CRC-12 (!) algorithm to be as fast as possible on a 3.4GHz 8 core CPU - the algorithm included an intricate extra feedback so that a table couldn't be used. The original designers of the algorithm didn't see any problems with that back when it was created because it was all done in hardware on the satellite in question. But for me it created a problem to process the data fast enough before the next data dump arrived.

I think I (and my colleagues) have been fine-tuning CRCs for decades (you have to adjust implementations because CPUs dhange) and there's no end in sight. Of course these days it's very cheap to add CRC to protect small data structures (I do that now for small chunks of data passed around different processes) - it's fast. But for huge amounts of data, and there's more and more of that, CRC algorithms is not something you just find and throw in there because it works. It must be efficient too.

BigEd · Post by **BigEd** » Thu Nov 05, 2015 12:53 pm

(IIRC the ill-fated T9000 had a CRC instruction, the better to compute CRCs quickly, but it had the wrong bit-endianness and was not useful.)

floobydust · Post by **floobydust** » Thu Nov 05, 2015 11:52 pm

I have a CRC-16 routine in my Monitor code for doing the calculation for Xmodem-CRC via the console port driven from ExtraPutty or Teraterm. The executable code can be pretty short for a lookup table, but 512 bytes seem like a lot of overhead when trying to keep things small. You can use RAM to hold the tables, but you'll need the routine to build the tables.

I'm using similar code to Levanthal's but optimized for inline speed. I've used it with 4MHz 65C02 receiving data at 19.2K without any loss. The core routine is 64 bytes which resets the CRC-16 word, processes all 128 data bytes in the Xmodem receive buffer and compares the checksum to the one in the received frame.

Code: Select all

BLK_OKAY		STZ	CRCLO	;Reset the CRC value by
				STZ	CRCHI	;putting all bits off
				LDY	#$02	;Set index for data offset
CALCCRC		LDA	RBUFF,Y	;Get first data byte
				STA	VALUE	;Store it for CRC calculation
GENCRC		PHP	;Save registers
				PHY
				LDX	#$08	;Load index for 8 bits
CRCLOOP		ASL	VALUE	;Shift (next) bit to carry
				ROR	A	;Get bit into bit 7
				AND	#%10000000	;Mask off other bits
				EOR	CRCHI	;Exclusive OR with CRC hi
				ASL	CRCLO	;Shift CRC left
				ROL	A	;Then do high byte
				BCC	CRCLP1	;Branch is MSB is 1
				TAY	;Save CRC hi in Y reg
				LDA	CRCLO	;Get CRC lo
				EOR	#$21	;Exclusive OR with polynomial
				STA	CRCLO	;Store it
				TYA	;Get CRC hi back
				EOR	#$10	;Exclusive OR with polynomial
CRCLP1		STA	CRCHI	;Store CRC hi
				DEX	;Decrement index
				BNE	CRCLOOP	;Loop back for all 8 bits
				PLY	;Restore registers
				PLP
				INY	;Increment index to the next data byte
				CPY	#$82	;Check for 128 bytes read
				BNE	CALCCRC	;Branch back until all 128 fed to CRC routine
				LDA	RBUFF,Y	;Get received CRC hi byte
				CMP	CRCHI	;Compare against calculated CRC hi byte
				BNE	BADCRC	;If bad CRC, handle error
				INY	;Increment index pointer
				LDA	RBUFF,Y	;Get CRC lo byte
				CMP	CRCLO	;Compare against calculated CRC lo byte
				BEQ	GOODCRC	;If good, go move good frame to memory
; CRC was bad... need to retry and receive the last frame again
; Decrement the CRC retry count, send a NAK and try again
; Count allows up to 16 retries, then cancels the transfer
BADCRC		DEC	CRCCNT	;Decrement retry count
				BNE	CRCRTRY	;Retry again if count not zero

I've received hundreds of data files up to 30KB via ExtraPutty and Teraterm without any issues to date. I'm also doing double-data rate on a 65C51 (38.4KB) without issue.

Byte Parallel Crc16

Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16