Byte Parallel Crc16

Snial · Post by **Snial** » Fri Nov 06, 2015 7:47 am

Quote:

The executable code can be pretty short for a lookup table, but 512 bytes seem like a lot of overhead when trying to keep things small

Exactly, that's why the routine I published at the beginning of this topic is a byte-parallel CRC16 routine of almost the same length, but over twice as fast and doesn't use a 512b table. It's a direct replacement for a bit-serial routine.

A 4MHz 6502 could process it at a rate of over 40K bytes per second, around 115200 baud

-cheers Julz

Snial · Post by **Snial** » Fri Nov 06, 2015 7:48 am

Oops correction: it should be able to process CRCs at 230.4Kbaud.

cbscpe · Post by **cbscpe** » Fri Nov 06, 2015 2:10 pm

What system that requires a fast down/up load has only little memory? I don't see this combination being useful. So in other words, systems that need a fast down/up load normally have a decent amount of memory. If you have a fast down/up load then you probably also want to have the fastest algorithms. So I see no objection against tables using 512bytes. As mentioned already, the table could be dynamically generated, the code takes only very little memory and then you can get rid of the tables to re-use the space. And using the tables I have 230.4kbaud on a 2MHz system. Or 115.2kbaud on a 1.8432MHz system (using only one crystal for the 6502 and the 6551 and using 0x0 as the baud-rate for the 6551 giving 115.2kbaud).

floobydust · Post by **floobydust** » Fri Nov 06, 2015 2:35 pm

I would agree that small memory systems "shouldn't" need such fast download speeds, but it all depends on the application.

My 65C02 based system has 32KB RAM and 32KB ROM (less 256 bytes for an I/O page). The loader command in my monitor can download ASCII or Binary data using Xmodem-CRC via the terminal program to a default or specified memory address. It also "automagically" detects Motorola S19 records (created by the WDC tools linker) and processes those into memory as executable code with an optional address offset.

With a console baud rate of 38.4Kb, the inline routine is more than fast enough and small enough, so I can't see any reason to waste ROM or RAM space for 512 bytes of tables and/or the code to soft-gen the tables, which would likely require more ROM space than my current approach.

BigEd · Post by **BigEd** » Fri Nov 06, 2015 3:00 pm

I fear we're approaching an endless debate here. Seems to me that sometimes the smallest code is best, and sometimes the fastest. It's worthwhile to have a record here of the byte-parallel approach, especially as it's not well known. And it seems a valid point that a table-driven approach allows for a different kind of bug, compared to an algorithmic approach.

floobydust · Post by **floobydust** » Fri Nov 06, 2015 8:46 pm

It's not an endless debate, we've not even filled two pages yet

One point that should be mentioned however is the physical transfer of the data itself. Using Xmodem as an example, once a frame is sent, the sending machine waits until an ACK is received before sending the next consecutive frame. This gives the receiving machine time to generate the CRC value, compare it to the one in the frame and either send an ACK for successful receive or send a NAK to try again. In this scenario, the speed of processing the CRC check is likely a moot point.

As a simple test, I took one of my 4MHz board sets and configured the 65C51 for 115,200 baud and have successfully received Xmodem-CRC data without any issues. I'm pretty certain I could drop to a 2MHz clock without a problem... might try that later.

cbscpe · Post by **cbscpe** » Mon Nov 09, 2015 5:28 pm

No I don't think it is an endless discusssion. It's rather stimulating. I just did not like the statement that a table is less reliable because it could be a "wrong" table. That's simply not true, as with all programs you need to be careful and when you use tables they must be accurate as must be the algorithm. And as I pointed out, you could use a proven algorithm to generate the tables and then you are automatically on the save side, except for copy and paste errors.

Now to the performance. As everybody knows the delay with which a ACK or NAK is generated has a major impact on the overall throughput and therefore you must minimize the time between the reception of the last byte of a frame and the creation of the ACK/NAK. Therefore the best solution is to incrementally calculate the CRC for every byte received. Depending on the CPU clock and the transmission rate you have some cycles between two bytes. In a 4MHz system and with 115200baud you have roughly 340 cycles to receive one byte and calculate the CRC. So any algorithm presented here can be used to incrementally calculate the CRC. This has the advantage that when you receive the 2 CRC bytes at the end, the only thing you need to do is compare them to your results and create the ACK. Which is only some cycles after you received the CRC.

Snial · Post by **Snial** » Mon Nov 09, 2015 10:28 pm

Quote:

table is less reliable because it could be a "wrong" table. That's simply not true,

It's always possible for people to choose a correct implementation of a table-driven algorithm. The difficulty of that you can't control for how people use the resources they have available. For example, if you have a lazy programmer, then in the same way they are not going to prefer the 'best' github source to generate the tables, then you can't even expect them to generate the tables at all - they will simply copy the first thing they see.

I have seen programmers do this, on more than one occasion. They are just minimizing effort. I'm afraid, it just is true.

So the reason tables can be less reliable, is because they're easier to get wrong, if you're copying a whole pile of data, how will you notice an error in 512 entries? Even testing on input data sets will only reveal it if you cover all the possible inputs.

And unfortunately, the same reasoning that leads people to say things like "well, it might be a 512b table, but hey memory is plentiful" is the same principle that leads people to choose quick & easy solutions in every domain.

The same is potentially true of the bit serial or byte parallel computational algorithm: programmers aren't idealistic. However, in these cases, there is less to go wrong without you noticing it. It's easier to compare a correct version with a faulty version; which means the algorithms are inherently more reliable, from a psychological perspective.

Byte-parallel computational, nevertheless is a hard sell. That's because the code looks nothing like the error correction polynomial it's supposed to implement and therefore without some persuasion of its merits, programmers are likely to play safe. That's why when I faced that issue in some development last year, the programmer referred me to his version and the web page & JavaScript CRC calculator he got it from, rather than convert the C algorithm I was using into C#. The onus was on me to prove correctness, because his code "just worked". In the end, I spent some time validating my algorithm and we found an error in his tables.

This is why, I have a hunch, that the engineers who first introduced me to the byte parallel computational version were so keen to do so, before I started using the first implementation I saw (which would probably have been a bit-serial version at the time).

Like I did earlier, it really is all about trade-offs. There are cases where a correct table driven version is best. There are cases where bit serial is best (a hardware implementation springs to mind where the algorithm is just a pair of shift registers and a few xor gates).

Any way, into the next point:

Quote:

minimize the time between the reception of the last byte of a frame and the creation of the ACK/NAK. Therefore the best solution is to incrementally calculate the CRC for every byte received

Agreed, and in a 4MHz system at 230.4Kbaud you have about 170 cycles per byte to compute the CRC (78 cycles spare), which the bit-serial algorithm cannot do, but the byte parallel one can. Thus using it we'd improve a 340+200=540 cycle/byte algorithm into a 170 cycle/byte version for the cost of 16 bytes of code. Over 3x faster. The table-driven one would of course be faster still, it could probably cope with a 460.8Kbaud (85-34 = 51 cycles spare).

Cheers julz

floobydust · Post by **floobydust** » Tue Nov 10, 2015 12:47 am

I agree it's a good conversation... makes you think a bit more about it.

I agree that tables are fine, if you write the code to generate the CRC, then you can basically use the same code to generate the tables. Creating them in reclaimable RAM should eliminate errors, unless you manage to clobber them somewhere else in your code. However, between having a routine to dynamically create tables in RAM and another routine to use the tables for a faster CRC lookup, that will likely take more ROM space than having a single routine that generates the CRC in real-time. If you're looking to conserve ROM space, dynamic tables could be a less attractive method.

Using the idea of generating a CRC as each character is received is a great idea, just as long as the CPU can execute the CRC gen code and keep up with the data rate, then the reason for using tables is of little to no value. The only possible downside is if you have a hefty background interrupt processing routine that is consuming a lot of clock cycles.

I can likely change my Xmodem loader to generate the CRC character per received byte, but still need to finish the rest of the (inline) S-record processing, which sorta negates the entire performance concept above

As for lazy programmers.... we just call it "code reuse"

floobydust · Post by **floobydust** » Tue Nov 10, 2015 11:38 pm

I opted to go back and examine the code running my 65C02 board setup to determine available processing bandwidth. I've written a small extendable BIOS which supports a 65C51 as a console and a 65C22 for timer services and simple port I/O. I have a ROM based routine which front ends the IRQ vector. This saves registers on the stack, determines if a BRK or actual IRQ occurred and jumps to a soft vector on Page $03 to handle the actual interrupt. There's also a ROM routine to finish up from an IRQ handler and return to normal processing. This creates a bit of overhead but the flexibility is nice to have. The 65C51 IRQ handler supports full duplex send/receive with 128 byte circular buffers, catches and flags errors and provides for RTS/CTS handshaking. The 65C22 IRQ handler supports a RTC via Timer1 with a 4ms jiffy clock and an accurate 1ms delay using Timer2. Each handler has it's own exit vector which can be used to extend pre or post the ROM handlers. The BIOS also has character in and out routines for receive and transmit which get/put data to/from the I/O buffers. This has some basic overhead which will continuously consume some number of clock cycles for the RTC handler and also for the console provided by the 65C51.

As I check the 65C22 first for interrupts, the 65C51 has some time to wait before being serviced. Between the ROM pre amd post processing and the maximum of 7 clock cycles for the 65C02 to respond to an IRQ, that accounts for 61 clock cycles of processing overhead. To get a character received from the 65C51 and placed into the buffer, that's another maximum of 63 clock cycles as it manages wrap-around of the receive buffer. Adding the time required to get a character from the buffer, which has a worst case of 52 clock cycles, the worst case scenario to receive a character via BIOS and get it into the A register (also via BIOS) adds up to 61+63+52=176 clock cycles. Note that the RTC also consumes clock cycles at a rate of 250 interrupts per second. Minimum service time is 47 clock cycles and maximum is 109 clock cycles. Note that these are additional times to the basic pre/post handlers and checking the IRQ source. Based on these numbers, a maximum number of clock cycles under the worst case scenario would be the 176 plus 109 during a single byte received for a total of 285 clock cycles. i.e., the RTC and the ACIA both required servicing within a single interrupt request.

Using this to determine available bandwidth with a 4MHz CPU clock, the following exists for some standard baud rates:

19200 = 2083 clock cycles between received characters less 285 clock cycles worst case = 1798 free clock cycles between characters
38400 = 1041 clock cycles between received characters less 285 clock cycles worst case = 756 free clock cycles between characters
57600 = 694 clock cycles between received characters less 285 clock cycles worst case = 409 free clock cycles between characters
115200 = 347 clock cycles between received characters less 285 clock cycles worst case = 62 free clock cycles between characters

As a side note, my current CRC routine can process a byte of data in 376 clock cycles worst case if done inline with character receive. This would limit the baud rate to 57600 for reliable operation for the worst case scenario. This basically shows that you'll need either a faster BIOS routine, faster CPU clock rate or a more efficient CRC routine to manage realtime processing of the CRC during character receive for higher baud rates. Moving the CRC routine to process after the frame is received negates the higher performance requirement. Using the OP's Byte Parallel routine would be on the edge every 256 days for 115200 baud rate

Snial · Post by **Snial** » Wed Nov 11, 2015 8:15 am

Quote:

Using the OP's Byte Parallel routine would be on the edge every 256 days for 115200 baud rate

Gosh-thank you for spending the time to work out whether my routine could be used!

It looks like it's fairly close. It's worth noting, the timings for my routine are based on evaluating a whole packet. If it only has calculate one byte at a time, and the value is already in A then we should be able to save 11 cycles because there's no loop and no LDA (buff),y. If any of the registers are available for general use, we can save 2 cycles by replacing the temp1 access with tax / txa. This gives us 65 cycles per byte. Getting closer

cbscpe · Post by **cbscpe** » Wed Nov 11, 2015 12:30 pm

Depends on how you look at it. When you have a second 65C22 interrupt as in the example of floobydust then the byte parallel algorithm still requires too many cycles. Also the following

floobydust wrote:

As I check the 65C22 first for interrupts, the 65C51 has some time to wait before being serviced. Between the ROM pre amd post processing and the maximum of 7 clock cycles for the 65C02 to respond to an IRQ, that accounts for 61 clock cycles of processing overhead.

is true only if this refers to the fact, that the interrupted instruction needs to finish, this can go from 1 to 7 additional cycles, and then the PC, PS is pushed and vector is read. Even tighter in this case. On the other hand if you don't have this RTC interrupt from the 65C22 during XMODEM transfers your byte parallel algorithm looks good and probably looks still good with 230400baud because the overhead for the worstcase is way more than processing a single byte with the byte parallel algorithm.

Under the condition that the byte is already in the ACCU the table driven algorithm would need only 24 cycles, but you need to save and restore the X-register if you don't already do so in your interrupt handling routine, adding a penalty of 7 cycles. But it would allow a system with a second interrupt as in floobydusts system.

So it all depends on your system what will be the best and feasible solution.

Snial · Post by **Snial** » Wed Nov 11, 2015 6:56 pm

Quote:

then the byte parallel algorithm still requires too many cycles.

And yet - you may be quite surprised to discover this, but further analysis will show there should be plenty of time for my byte parallel routine within your interrupt code at 115200 baud.

For the moment, I'll leave it for our avid readers to work it out, but tonight or tomorrow, I'll demonstrate why

-cheers Julz

cbscpe · Post by **cbscpe** » Wed Nov 11, 2015 8:14 pm

You need to give at least some hints about your assumptions. E.g. CPU Clock and type and number of other interrupt sources.

floobydust · Post by **floobydust** » Wed Nov 11, 2015 8:42 pm

For calculating required CPU bandwidth, I always go with the worst case scenario, hence my inclusion of 7 clock cycles to respond to the interrupt. The same goes for the RTC interrupt handler, the worst case only occurs every 256 days, but it does occur. I would agree that in most cases, you will certainly have more available clock cycles to dedicate to the inline CRC checking, Overall, it will likely work even without enough free clock cycles as the interrupt handler will buffer up to 128 bytes regardless of the CRC routine. The transfer will also stop until an ACK is received. This pause in transmission will give the CRC code a chance to catch up and finish the frame, compare the checksum to the received one and then an ACK can be sent to start the next frame.

Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Timing is everything....

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16

Re: Byte Parallel Crc16