Software engineers generally choose table-driven CRC16 algorithms through laziness rather than for performance reasons. They just pick the first thing they see on the net and copy it, because "it works".
But let's look at what the table driven algorithm gives us. If we start with the core part of the ‘C’ version:
Code:
const uint8_t *c = c_ptr;
while (len--)
crc = (crc << 8) ^ crctable[((crc >> 8) ^ *c++)];
return crc;
In 6502 code that's:
Code:
ldx #0
Crc16Loop:
lda (buff),y ;6c
iny ;2c
eor crc16+1 ;3c
sta crcTableLo
sta CrcTableHi ;both 3c
lda (crcTableLo,x) ;6c
sta crc16 ;3c
lda (CrcTableHi,x) ;6c
eor crc16+1 ;3c
sta crc16+1 ;3c
iny ;2c
bne crc16Loop ;3c
Total 53 cycles per byte, just 42% faster, for an algorithm that uses 760% more space.
So this illustrates the trade-offs we make in software. On a 6502, we can choose: a serial version which uses 47b and (15+13/2)*8+ 36c which is on average 208c/byte; or a routine that's 36% longer and over twice as fast; or one that’s 11.7x longer and 2.8x faster.
So you don't need much of a reason to choose the byte parallel version, it's only 16 bytes longer, but the only reason for choosing the table-driven algorithm is when an additional 42% of performance is more critical than 0.5Kb.
But if that's the case, wouldn't it make more sense to use a more powerful CPU or MCU? And in the MCU case memory is usually at a premium, which leaves powerful CPUs, which, if they use cache won't benefit much from table-driven methods.
And given what you say about really powerful CPUs having instructions to support crcs doesn't that swing things in favour of computational algorithms again?
Case closed I'd say