Page 1 of 3
Byte Parallel Crc16
Posted: Thu Oct 29, 2015 9:51 am
by Snial
Hi folks,
Here's a nifty byte-parallel 16-bit CRC algorithm based on the algorithm here (works out as 9.75c/bit):
http://embdev.net/articles/CRC-16-CCITT ... _Assembler
Y=256-Length, BUFF^Actual buffer-256+Length.
Code: Select all
CRC16:
LDA (BUFF),Y
EOR CrcHi
STA CrcHi ;really CrcLo
LSRA
LSRA
LSRA
LSRA ;(Crc&0xff)>>4.
EOR CrcHi
STA Temp0 ;proper CrcLo now.
ASLA
ASLA
ASLA
ASLA ;(Crc<<12)
EOR CrcLo ;Actual CrcHi from swap.
STA CrcHi ;this is the new CrcHi
LDA Temp0 ;CrcLo
RORA
RORA
RORA ;so bottom 5 bits need EORing into CrcHi
STA Temp1
AND #31
EOR CrcHi
STA CrcHi
LDA Temp1
RORA ;top 3 bits need EORing into CRCLo
AND #$E0
EOR Temp0
STA CrcLo
INY
BNE CRC16
RTS ;2*15+3*14+6 = 78c/byte.
Re: Byte Parallel Crc16
Posted: Fri Oct 30, 2015 5:23 pm
by teamtempest
I'm having a little trouble this line in the 'C' version of the algorithm:
To me it looks like the low byte is shifted left four times, then once more. Isn't that the same as:
Or am I missing something?
Re: Byte Parallel Crc16
Posted: Fri Oct 30, 2015 5:46 pm
by Snial
Yep, it's the same.
However, the Avr has a nybble swapping instruction and the Gcc compiler could make better use of it if the code is expressed this way.
On the other hand it's in assembler- so what does it matter?

Note, the assembler code uses swap quite a bit! On an AtMega the speed can be improved further by using multiply instructions (*32 is faster than <<5).
It's worth comparing the code with the optimized one on the MDFS website. That Crc16 takes 20+8*(15+14/2)=196c/byte= about 24 cycles per bit, over 2x skewer than the algorithm presented here.
http://beebwiki.mdfs.net/CRC-16
The reason why I'm a little obsessed with byte parallel CRCs, is because a highly efficient 6800 implementation was published as a BSI standard in the 1970s, but somehow byte parallel versions have been obscured by bit and (sometimes erroneous) table-driven ones.
Re: Byte Parallel Crc16
Posted: Mon Nov 02, 2015 7:53 pm
by cbscpe
And why not consider an algorithm with look up tables?
Re: Byte Parallel Crc16
Posted: Mon Nov 02, 2015 8:59 pm
by Snial
And why not consider an algorithm with look up tables?
About a decade ago, it was hard to find the byte-parallel version, but recently it's become more common.
It turns out there are good reasons to prefer non-table versions, a 512b table is a fairly significant size for an 8-bit CPU. Secondly, some of the CRC tables on the net actually contain errors (though you wouldn't notice it if you generate and compare the CRCs using the same tables). Lastly, on powerful CPUs, a CRC table will generate a lot of cache misses and therefore it's less fast than you might expect than the computational version.
Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 7:00 am
by cbscpe
Perhaps, but even for an 8bit computer a 512byte table is not much. The performance increase is important especially for high baud-rates you can calculate the CRC while receiving. So you have no delays between the receiption of a packet and a ACK or NAK. Which is one of the major reasons for low transfer rates. As for the cache misses, we are talking about normal 8bit computers, they mostly do not have caches. Todays computer with cache also include instructions that better support the calculation of CRC than shift and swap. And if you are afraid of wrong tables, then you could as well add a routine to your Xmodem that builds your table in advance in RAM. Or you challenge the table against your algorithm. You just need to run 2**24 cases. Takes about an hour on a 1Mhz 6502.
Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 6:15 pm
by teamtempest
I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights (crc ^= crc << 5). Maybe I'm wrong, but here's a couple of alternate versions:
Code: Select all
accumulate_crc16:
/* Algorithm: CRC-16-CCITT, x16 + x12 + x5 + 1, 0x1021 / 0x8408 / 0x8810
Steps in C to update a running CRC with a new byte in ser_data:
unsigned char ser_data;
unsigned int crc; -> accumulated CRC to update, initial value 0xFFFF
crc = (unsigned char)(crc >> 8) | (crc << 8);
crc ^= ser_data;
crc ^= (unsigned char)(crc & 0xff) >> 4;
crc ^= (crc << 8) << 4;
crc ^= ((crc & 0xff) << 4) << 1;
Note that there are faster table-based implementations, but they
consume more program space.
*/
crc_16_ccitt_update:
; crc = (unsigned char)(crc >> 8 | crc << 8 );
ldx crc_16_lo
ldy crc_16_hi
sty crc_16_lo
stx crc_16_hi
; crc ^= ser_data
eor crc_16_lo
sta crc_16_lo
; crc ^= (unsigned char)(crc & 0xff) >> 4;
lsr
lsr
lsr
lsr
eor crc_16_lo
sta crc_16_lo
; crc ^= (crc << 8) << 4;
asl
asl
asl
asl
eor crc_16_hi
sta crc_16_hi
; crc ^= ((crc & 0xff) << 4) << 1;
lda crc_16_lo
cmp #$80
ror
cmp #$80
ror
cmp #$80
ror
tax
and #%11100000
eor crc_16_lo
sta crc_16_lo
txa
and #%00011111
eor crc_16_hi
sta crc_16_hi
; done - 15*3 + 18*2 = 81 cycles
rts
crc_16_ccitt_update2:
; crc = (unsigned char)(crc >> 8 | crc << 8 );
; crc ^= ser_data
eor crc_16_hi ; implied swap
sta temp ; crc_16_lo
; crc ^= (unsigned char)(crc & 0xff) >> 4;
lsr
lsr
lsr
lsr
eor temp
sta temp
; crc ^= (crc << 12);
asl
asl
asl
asl
eor crc_16_lo ; implied swap
sta crc_16_hi
; crc ^= (unsigned char)(crc & 0xff) << 5;
lda temp ; crc_16_lo
cmp #$80
ror
cmp #$80
ror
cmp #$80
ror
tax
and #%11100000
eor temp
sta crc_16_lo
txa
and #%00011111
eor crc_16_hi
sta crc_16_hi
; done - 11*3 + 18*2 = 69 cycles
rts
Hey Ed, does this count as "a few lines" in your poll?

Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 6:21 pm
by Snial
Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".
But let's look at what the table driven algorithm gives us. If we start with the core part of the ‘C’ version:
Code: Select all
const uint8_t *c = c_ptr;
while (len--)
crc = (crc << 8) ^ crctable[((crc >> 8) ^ *c++)];
return crc;
In 6502 code that's:
Code: Select all
ldx #0
Crc16Loop:
lda (buff),y ;6c
iny ;2c
eor crc16+1 ;3c
sta crcTableLo
sta CrcTableHi ;both 3c
lda (crcTableLo,x) ;6c
sta crc16 ;3c
lda (CrcTableHi,x) ;6c
eor crc16+1 ;3c
sta crc16+1 ;3c
iny ;2c
bne crc16Loop ;3c
Total 53 cycles per byte, just 42% faster, for an algorithm that uses 760% more space.
So this illustrates the trade-offs we make in software. On a 6502, we can choose: a serial version which uses 47b and (15+13/2)*8+ 36c which is on average 208c/byte; or a routine that's 36% longer and over twice as fast; or one that’s 11.7x longer and 2.8x faster.
So you don't need much of a reason to choose the byte parallel version, it's only 16 bytes longer, but the only reason for choosing the table-driven algorithm is when an additional 42% of performance is more critical than 0.5Kb.
But if that's the case, wouldn't it make more sense to use a more powerful CPU or MCU? And in the MCU case memory is usually at a premium, which leaves powerful CPUs, which, if they use cache won't benefit much from table-driven methods.
And given what you say about really powerful CPUs having instructions to support crcs doesn't that swing things in favour of computational algorithms again?
Case closed I'd say

Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 6:24 pm
by BigEd
(
...Hey Ed, does this count as "a few lines" in your poll?

I think that might be a few tens of lines... for interest, I see Snial has posted a
couple of blog posts about bytewide CRC
)
Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 6:53 pm
by Snial
I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights
The <<12 indeed leaves the carry in an unknown state, but the AND after the RORAs mask it out. After the first 3 RORAs we get:
<B2> B1:B0:??:B7:B6:B5:B4:B3
Where <xx> is the current carry contents and ?? Is the unknown state of the carry flag from the <<12, now in bit 5 of Acc. Then we save the value, then the AND #31 gives us:
<B2> 0:0:0:B7:B6:B5:B4:B3
Then we restore the rotated value and RORA again to get:
<B3> B2:B1:B0:??:B7:B6:B5:B4
So then when we AND #0xE0 we get:
<B3> B2:B1:B0:00:00:00:00:00
So in both cases, the unknown carry state is masked out. Using this approach we're able to shift an 8-bit value into a 16-bit result without having to shift/rotate a memory location.

Re: Byte Parallel Crc16
Posted: Wed Nov 04, 2015 8:37 pm
by cbscpe
Code: Select all
ldy #0
sty crc
sty crc+1
loop:
lda (buf),y ; 5c
eor crc+1 ; 3c
tax ; 2c
lda crc ; 3c
eor crctabH,X ; 4c
sta crc+1 ; 4c
lda crctabL,X ; 4c
sta crc ; 4c
iny ; 2c
cpy len ; 3c
bne loop ; 3c
rts
;
37 cycles per byte, assuming crc, and len in zeropage, for a 256byte buffer cpy len is obsolete then it's 34 cycles per byte. For a sample program to generate the tables see
https://github.com/zqad/crc16-ccitt/blo ... te-table.c which of course must then be split into the H and L crc tables.
Re: Byte Parallel Crc16
Posted: Thu Nov 05, 2015 8:05 am
by Snial
Well, it is true that your table driven version is faster. It turns out you can always leave the CPY Len out, because you can adjust the address of buf so that if y starts at (-Len) then (buf),y points to the first byte of buf (see how my byte parallel algorithm does it). That way, you always increment y until it gets to 0.
And yes I'm aware of the code to generate the table in the first place, it's simply an application of the bit serial algorithm.
But my point remains: even a 34 cycle table-driven CRC is only worthwhile if the performance is critical, which for an 8-bitter, means you should be using something else. And there are many situations where its benefit is questionable, some (such as in scripting applications over the net) where it's undesirable and some (constrained embedded applications) where it's not feasible. In addition, in practice the table-driven algorithm encourages lazy, erroneous implementations, and it's the sheer laziness which has made them so dominant. I'm well aware of them, they don't need any encouragement thanks.
The byte parallel version however doubles the performance with hardly any code cost and is easily replicable. But the level of awareness of them is low - they need encouragement. For the vast majority of applications, it's the best choice.
Re: Byte Parallel Crc16
Posted: Thu Nov 05, 2015 12:44 pm
by Tor
Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".
I must disagree with that, I'm sure there are some that think like that but whenever CRC comes up in my line of work there is an obsession about making it as fast as possible. Even in this day and time with fast CPUs. There are some seriously huge datasets with CRC out there, and calculating the CRC is a major bottleneck because for many cases the CRC is defined in such a way that it more or less prevents effective parallelization of the processing. Two years ago I used several days to fine-tune my implementation of an old CRC-12 (!) algorithm to be as fast as possible on a 3.4GHz 8 core CPU - the algorithm included an intricate extra feedback so that a table couldn't be used. The original designers of the algorithm didn't see any problems with that back when it was created because it was all done in hardware on the satellite in question. But for me it created a problem to process the data fast enough before the next data dump arrived.
I think I (and my colleagues) have been fine-tuning CRCs for decades (you have to adjust implementations because CPUs dhange) and there's no end in sight. Of course these days it's very cheap to add CRC to protect small data structures (I do that now for small chunks of data passed around different processes) - it's fast. But for huge amounts of data, and there's more and more of that, CRC algorithms is not something you just find and throw in there because it works. It must be efficient too.
Re: Byte Parallel Crc16
Posted: Thu Nov 05, 2015 12:53 pm
by BigEd
(IIRC the ill-fated T9000 had a CRC instruction, the better to compute CRCs quickly, but it had the wrong bit-endianness and was not useful.)
Re: Byte Parallel Crc16
Posted: Thu Nov 05, 2015 11:52 pm
by floobydust
I have a CRC-16 routine in my Monitor code for doing the calculation for Xmodem-CRC via the console port driven from ExtraPutty or Teraterm. The executable code can be pretty short for a lookup table, but 512 bytes seem like a lot of overhead when trying to keep things small. You can use RAM to hold the tables, but you'll need the routine to build the tables.
I'm using similar code to Levanthal's but optimized for inline speed. I've used it with 4MHz 65C02 receiving data at 19.2K without any loss. The core routine is 64 bytes which resets the CRC-16 word, processes all 128 data bytes in the Xmodem receive buffer and compares the checksum to the one in the received frame.
Code: Select all
BLK_OKAY STZ CRCLO ;Reset the CRC value by
STZ CRCHI ;putting all bits off
LDY #$02 ;Set index for data offset
CALCCRC LDA RBUFF,Y ;Get first data byte
STA VALUE ;Store it for CRC calculation
GENCRC PHP ;Save registers
PHY
LDX #$08 ;Load index for 8 bits
CRCLOOP ASL VALUE ;Shift (next) bit to carry
ROR A ;Get bit into bit 7
AND #%10000000 ;Mask off other bits
EOR CRCHI ;Exclusive OR with CRC hi
ASL CRCLO ;Shift CRC left
ROL A ;Then do high byte
BCC CRCLP1 ;Branch is MSB is 1
TAY ;Save CRC hi in Y reg
LDA CRCLO ;Get CRC lo
EOR #$21 ;Exclusive OR with polynomial
STA CRCLO ;Store it
TYA ;Get CRC hi back
EOR #$10 ;Exclusive OR with polynomial
CRCLP1 STA CRCHI ;Store CRC hi
DEX ;Decrement index
BNE CRCLOOP ;Loop back for all 8 bits
PLY ;Restore registers
PLP
INY ;Increment index to the next data byte
CPY #$82 ;Check for 128 bytes read
BNE CALCCRC ;Branch back until all 128 fed to CRC routine
LDA RBUFF,Y ;Get received CRC hi byte
CMP CRCHI ;Compare against calculated CRC hi byte
BNE BADCRC ;If bad CRC, handle error
INY ;Increment index pointer
LDA RBUFF,Y ;Get CRC lo byte
CMP CRCLO ;Compare against calculated CRC lo byte
BEQ GOODCRC ;If good, go move good frame to memory
; CRC was bad... need to retry and receive the last frame again
; Decrement the CRC retry count, send a NAK and try again
; Count allows up to 16 retries, then cancels the transfer
BADCRC DEC CRCCNT ;Decrement retry count
BNE CRCRTRY ;Retry again if count not zero
I've received hundreds of data files up to 30KB via ExtraPutty and Teraterm without any issues to date. I'm also doing double-data rate on a 65C51 (38.4KB) without issue.