6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 01, 2024 1:04 am

All times are UTC




Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Byte Parallel Crc16
PostPosted: Thu Oct 29, 2015 9:51 am 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Hi folks,

Here's a nifty byte-parallel 16-bit CRC algorithm based on the algorithm here (works out as 9.75c/bit):

http://embdev.net/articles/CRC-16-CCITT ... _Assembler

Y=256-Length, BUFF^Actual buffer-256+Length.

Code:
CRC16:
   LDA (BUFF),Y
   EOR CrcHi
   STA CrcHi   ;really CrcLo
   LSRA
   LSRA
   LSRA
   LSRA   ;(Crc&0xff)>>4.
   EOR CrcHi
   STA Temp0 ;proper CrcLo now.
   ASLA
   ASLA
   ASLA
   ASLA   ;(Crc<<12)
   EOR CrcLo ;Actual CrcHi from swap.
   STA CrcHi   ;this is the new CrcHi
   LDA Temp0   ;CrcLo
   RORA
   RORA
   RORA ;so bottom 5 bits need EORing into CrcHi
   STA Temp1
   AND #31
   EOR CrcHi
   STA CrcHi
   LDA Temp1
   RORA ;top 3 bits need EORing into CRCLo
   AND #$E0
   EOR Temp0
   STA CrcLo
   INY
   BNE CRC16   
   RTS   ;2*15+3*14+6 = 78c/byte.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Fri Oct 30, 2015 5:23 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
I'm having a little trouble this line in the 'C' version of the algorithm:

Code:
crc ^= ((crc & 0xff) << 4) << 1;


To me it looks like the low byte is shifted left four times, then once more. Isn't that the same as:

Code:
crc ^= (crc & 0xff) << 5;


Or am I missing something?


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Fri Oct 30, 2015 5:46 pm 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Yep, it's the same.

However, the Avr has a nybble swapping instruction and the Gcc compiler could make better use of it if the code is expressed this way.

On the other hand it's in assembler- so what does it matter? ;-) Note, the assembler code uses swap quite a bit! On an AtMega the speed can be improved further by using multiply instructions (*32 is faster than <<5).

It's worth comparing the code with the optimized one on the MDFS website. That Crc16 takes 20+8*(15+14/2)=196c/byte= about 24 cycles per bit, over 2x skewer than the algorithm presented here.

http://beebwiki.mdfs.net/CRC-16

The reason why I'm a little obsessed with byte parallel CRCs, is because a highly efficient 6800 implementation was published as a BSI standard in the 1970s, but somehow byte parallel versions have been obscured by bit and (sometimes erroneous) table-driven ones.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Mon Nov 02, 2015 7:53 pm 
Offline
User avatar

Joined: Sun Oct 13, 2013 2:58 pm
Posts: 491
Location: Switzerland
And why not consider an algorithm with look up tables?


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Mon Nov 02, 2015 8:59 pm 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Quote:
And why not consider an algorithm with look up tables?


About a decade ago, it was hard to find the byte-parallel version, but recently it's become more common.

It turns out there are good reasons to prefer non-table versions, a 512b table is a fairly significant size for an 8-bit CPU. Secondly, some of the CRC tables on the net actually contain errors (though you wouldn't notice it if you generate and compare the CRCs using the same tables). Lastly, on powerful CPUs, a CRC table will generate a lot of cache misses and therefore it's less fast than you might expect than the computational version.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 7:00 am 
Offline
User avatar

Joined: Sun Oct 13, 2013 2:58 pm
Posts: 491
Location: Switzerland
Perhaps, but even for an 8bit computer a 512byte table is not much. The performance increase is important especially for high baud-rates you can calculate the CRC while receiving. So you have no delays between the receiption of a packet and a ACK or NAK. Which is one of the major reasons for low transfer rates. As for the cache misses, we are talking about normal 8bit computers, they mostly do not have caches. Todays computer with cache also include instructions that better support the calculation of CRC than shift and swap. And if you are afraid of wrong tables, then you could as well add a routine to your Xmodem that builds your table in advance in RAM. Or you challenge the table against your algorithm. You just need to run 2**24 cases. Takes about an hour on a 1Mhz 6502.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 6:15 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights (crc ^= crc << 5). Maybe I'm wrong, but here's a couple of alternate versions:

Code:
accumulate_crc16:

/* Algorithm: CRC-16-CCITT, x16 + x12 + x5 + 1, 0x1021 / 0x8408 / 0x8810
   Steps in C to update a running CRC with a new byte in ser_data:
    unsigned char ser_data;
   unsigned int crc;   -> accumulated CRC to update, initial value 0xFFFF
    crc  = (unsigned char)(crc >> 8) | (crc << 8);
   crc ^= ser_data;
   crc ^= (unsigned char)(crc & 0xff) >> 4;
   crc ^= (crc << 8) << 4;
    crc ^= ((crc & 0xff) << 4) << 1;
    Note that there are faster table-based implementations, but they
   consume more program space.
 */

crc_16_ccitt_update:

   ; crc = (unsigned char)(crc >> 8 | crc << 8 );

   ldx crc_16_lo
   ldy crc_16_hi
   sty crc_16_lo
   stx crc_16_hi

   ; crc ^= ser_data

   eor crc_16_lo
   sta crc_16_lo
   
   ; crc ^= (unsigned char)(crc & 0xff) >> 4;

   lsr
   lsr
   lsr
   lsr
   eor crc_16_lo
   sta crc_16_lo

   ; crc ^= (crc << 8) << 4;

   asl
   asl
   asl
   asl
   eor crc_16_hi
   sta crc_16_hi

   ; crc ^= ((crc & 0xff) << 4) << 1;

   lda crc_16_lo
   cmp #$80
   ror
   cmp #$80
   ror
   cmp #$80
   ror
   tax
   and #%11100000
   eor crc_16_lo
   sta crc_16_lo
   txa
   and #%00011111
   eor crc_16_hi
   sta crc_16_hi

   ; done - 15*3 + 18*2 = 81 cycles

   rts

crc_16_ccitt_update2:

   ; crc = (unsigned char)(crc >> 8 | crc << 8 );

   ; crc ^= ser_data
   
   eor crc_16_hi   ; implied swap
   sta temp   ; crc_16_lo

   ; crc ^= (unsigned char)(crc & 0xff) >> 4;

   lsr
   lsr
   lsr
   lsr
   eor temp
   sta temp

   ; crc ^= (crc << 12);

   asl
   asl
   asl
   asl
   eor crc_16_lo   ; implied swap
   sta crc_16_hi

   ; crc ^= (unsigned char)(crc & 0xff) << 5;

   lda temp   ; crc_16_lo
   cmp #$80
   ror
   cmp #$80
   ror
   cmp #$80
   ror
   tax
   and #%11100000
   eor temp
   sta crc_16_lo
   txa
   and #%00011111
   eor crc_16_hi
   sta crc_16_hi

   ; done - 11*3 + 18*2 = 69 cycles

   rts


Hey Ed, does this count as "a few lines" in your poll? :?


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 6:21 pm 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".

But let's look at what the table driven algorithm gives us. If we start with the core part of the ‘C’ version:

Code:
    const uint8_t *c = c_ptr;

    while (len--)
        crc = (crc << 8) ^ crctable[((crc >> 8) ^ *c++)];

    return crc;


In 6502 code that's:

Code:
ldx #0
Crc16Loop:
lda (buff),y ;6c
iny ;2c
eor crc16+1 ;3c
sta crcTableLo
sta CrcTableHi ;both 3c
lda (crcTableLo,x) ;6c
sta crc16 ;3c
lda (CrcTableHi,x) ;6c
eor crc16+1 ;3c
sta crc16+1 ;3c
iny ;2c
bne crc16Loop ;3c


Total 53 cycles per byte, just 42% faster, for an algorithm that uses 760% more space.

So this illustrates the trade-offs we make in software. On a 6502, we can choose: a serial version which uses 47b and (15+13/2)*8+ 36c which is on average 208c/byte; or a routine that's 36% longer and over twice as fast; or one that’s 11.7x longer and 2.8x faster.

So you don't need much of a reason to choose the byte parallel version, it's only 16 bytes longer, but the only reason for choosing the table-driven algorithm is when an additional 42% of performance is more critical than 0.5Kb.

But if that's the case, wouldn't it make more sense to use a more powerful CPU or MCU? And in the MCU case memory is usually at a premium, which leaves powerful CPUs, which, if they use cache won't benefit much from table-driven methods.

And given what you say about really powerful CPUs having instructions to support crcs doesn't that swing things in favour of computational algorithms again?

Case closed I'd say ;-)


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 6:24 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
(
teamtempest wrote:
...Hey Ed, does this count as "a few lines" in your poll? :?

I think that might be a few tens of lines... for interest, I see Snial has posted a couple of blog posts about bytewide CRC
)


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 6:53 pm 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Quote:
I'm suspicious that your left shifts (crc ^= crc << 12) leave a bit of unknown value in the carry flag that you later bring into the accumulator with the rotate rights


The <<12 indeed leaves the carry in an unknown state, but the AND after the RORAs mask it out. After the first 3 RORAs we get:

<B2> B1:B0:??:B7:B6:B5:B4:B3

Where <xx> is the current carry contents and ?? Is the unknown state of the carry flag from the <<12, now in bit 5 of Acc. Then we save the value, then the AND #31 gives us:

<B2> 0:0:0:B7:B6:B5:B4:B3

Then we restore the rotated value and RORA again to get:

<B3> B2:B1:B0:??:B7:B6:B5:B4

So then when we AND #0xE0 we get:

<B3> B2:B1:B0:00:00:00:00:00

So in both cases, the unknown carry state is masked out. Using this approach we're able to shift an 8-bit value into a 16-bit result without having to shift/rotate a memory location.

:-)


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Wed Nov 04, 2015 8:37 pm 
Offline
User avatar

Joined: Sun Oct 13, 2013 2:58 pm
Posts: 491
Location: Switzerland
Code:
   ldy   #0
   sty   crc
   sty   crc+1
loop:
   lda   (buf),y      ; 5c
   eor    crc+1      ; 3c
   tax         ; 2c
   lda    crc      ; 3c
   eor    crctabH,X   ; 4c
   sta    crc+1      ; 4c
   lda    crctabL,X   ; 4c
   sta    crc      ; 4c
   iny         ; 2c
   cpy   len      ; 3c
   bne   loop      ; 3c
   rts
;


37 cycles per byte, assuming crc, and len in zeropage, for a 256byte buffer cpy len is obsolete then it's 34 cycles per byte. For a sample program to generate the tables see

https://github.com/zqad/crc16-ccitt/blob/master/crc16-ccitt-generate-table.c which of course must then be split into the H and L crc tables.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Thu Nov 05, 2015 8:05 am 
Offline

Joined: Mon Oct 17, 2011 7:50 am
Posts: 33
Well, it is true that your table driven version is faster. It turns out you can always leave the CPY Len out, because you can adjust the address of buf so that if y starts at (-Len) then (buf),y points to the first byte of buf (see how my byte parallel algorithm does it). That way, you always increment y until it gets to 0.

And yes I'm aware of the code to generate the table in the first place, it's simply an application of the bit serial algorithm.

But my point remains: even a 34 cycle table-driven CRC is only worthwhile if the performance is critical, which for an 8-bitter, means you should be using something else. And there are many situations where its benefit is questionable, some (such as in scripting applications over the net) where it's undesirable and some (constrained embedded applications) where it's not feasible. In addition, in practice the table-driven algorithm encourages lazy, erroneous implementations, and it's the sheer laziness which has made them so dominant. I'm well aware of them, they don't need any encouragement thanks.

The byte parallel version however doubles the performance with hardly any code cost and is easily replicable. But the level of awareness of them is low - they need encouragement. For the vast majority of applications, it's the best choice.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Thu Nov 05, 2015 12:44 pm 
Offline

Joined: Sun Apr 10, 2011 8:29 am
Posts: 597
Location: Norway/Japan
Snial wrote:
Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".
I must disagree with that, I'm sure there are some that think like that but whenever CRC comes up in my line of work there is an obsession about making it as fast as possible. Even in this day and time with fast CPUs. There are some seriously huge datasets with CRC out there, and calculating the CRC is a major bottleneck because for many cases the CRC is defined in such a way that it more or less prevents effective parallelization of the processing. Two years ago I used several days to fine-tune my implementation of an old CRC-12 (!) algorithm to be as fast as possible on a 3.4GHz 8 core CPU - the algorithm included an intricate extra feedback so that a table couldn't be used. The original designers of the algorithm didn't see any problems with that back when it was created because it was all done in hardware on the satellite in question. But for me it created a problem to process the data fast enough before the next data dump arrived.

I think I (and my colleagues) have been fine-tuning CRCs for decades (you have to adjust implementations because CPUs dhange) and there's no end in sight. Of course these days it's very cheap to add CRC to protect small data structures (I do that now for small chunks of data passed around different processes) - it's fast. But for huge amounts of data, and there's more and more of that, CRC algorithms is not something you just find and throw in there because it works. It must be efficient too.


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Thu Nov 05, 2015 12:53 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
(IIRC the ill-fated T9000 had a CRC instruction, the better to compute CRCs quickly, but it had the wrong bit-endianness and was not useful.)


Top
 Profile  
Reply with quote  
 Post subject: Re: Byte Parallel Crc16
PostPosted: Thu Nov 05, 2015 11:52 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
I have a CRC-16 routine in my Monitor code for doing the calculation for Xmodem-CRC via the console port driven from ExtraPutty or Teraterm. The executable code can be pretty short for a lookup table, but 512 bytes seem like a lot of overhead when trying to keep things small. You can use RAM to hold the tables, but you'll need the routine to build the tables.

I'm using similar code to Levanthal's but optimized for inline speed. I've used it with 4MHz 65C02 receiving data at 19.2K without any loss. The core routine is 64 bytes which resets the CRC-16 word, processes all 128 data bytes in the Xmodem receive buffer and compares the checksum to the one in the received frame.

Code:
BLK_OKAY      STZ   CRCLO   ;Reset the CRC value by
            STZ   CRCHI   ;putting all bits off
            LDY   #$02   ;Set index for data offset
CALCCRC      LDA   RBUFF,Y   ;Get first data byte
            STA   VALUE   ;Store it for CRC calculation
GENCRC      PHP   ;Save registers
            PHY
            LDX   #$08   ;Load index for 8 bits
CRCLOOP      ASL   VALUE   ;Shift (next) bit to carry
            ROR   A   ;Get bit into bit 7
            AND   #%10000000   ;Mask off other bits
            EOR   CRCHI   ;Exclusive OR with CRC hi
            ASL   CRCLO   ;Shift CRC left
            ROL   A   ;Then do high byte
            BCC   CRCLP1   ;Branch is MSB is 1
            TAY   ;Save CRC hi in Y reg
            LDA   CRCLO   ;Get CRC lo
            EOR   #$21   ;Exclusive OR with polynomial
            STA   CRCLO   ;Store it
            TYA   ;Get CRC hi back
            EOR   #$10   ;Exclusive OR with polynomial
CRCLP1      STA   CRCHI   ;Store CRC hi
            DEX   ;Decrement index
            BNE   CRCLOOP   ;Loop back for all 8 bits
            PLY   ;Restore registers
            PLP
            INY   ;Increment index to the next data byte
            CPY   #$82   ;Check for 128 bytes read
            BNE   CALCCRC   ;Branch back until all 128 fed to CRC routine
            LDA   RBUFF,Y   ;Get received CRC hi byte
            CMP   CRCHI   ;Compare against calculated CRC hi byte
            BNE   BADCRC   ;If bad CRC, handle error
            INY   ;Increment index pointer
            LDA   RBUFF,Y   ;Get CRC lo byte
            CMP   CRCLO   ;Compare against calculated CRC lo byte
            BEQ   GOODCRC   ;If good, go move good frame to memory
; CRC was bad... need to retry and receive the last frame again
; Decrement the CRC retry count, send a NAK and try again
; Count allows up to 16 retries, then cancels the transfer
BADCRC      DEC   CRCCNT   ;Decrement retry count
            BNE   CRCRTRY   ;Retry again if count not zero


I've received hundreds of data files up to 30KB via ExtraPutty and Teraterm without any issues to date. I'm also doing double-data rate on a 65C51 (38.4KB) without issue.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: