Firstly, I would have suggested a smaller problem if I knew the full scope of binary matrix transpose. Secondly, if you want something which will only be used occasionally and is easy to understand then use BitWise's implementation and don't look any further.
BitWise on Sun 27 Dec 2020 wrote:
Isn't it just something like this?
When I first encountered this problem, my inclination was to write the unrolled version of your implementation in MC68000. Your solution is, by far, the most concise, legible and easily maintained version that we are likely to see. Furthermore, it is ideally suited to be trimmed down for the subproblems. However, if all bits are passed through the carry flag then it is definitely not the fastest algorithm for the 8*8 case.
barrym95838 on Sun 27 Dec 2020 wrote:
EOR
You've given me an idea for implemenation. It is possible to swap variables without using a temporary variable by using
EOR swap. Many of the explanations for EOR swap are quite baffling. Therefore, I have drawn a Boolean diagram and truth table. This may or may not be helpful when satisfying yourself that EOR swap performs as expected. It may be useful to extend EOR swap to a bit matrix transpose by introducing bit masks, bit shifts and/or LUTs. (Apologies in advance for use of C notation. This directed at barrym95838 as pseudo-code for conversion into 6502 assembly.) The trivial EOR swap is:-
Code:
p ^= q;
q ^= p;
p ^= q;
With nybble swap LUT, this would be:-
Code:
p ^= lut4[q];
q ^= lut4[p];
p ^= lut4[q];
Unfortunately, this performs something slightly different to the desired algorithm. Specifically, this concurrently exchanges two variables while nybble swapping both variables. This can be adapted by using bit masks. One line of the 4*4 swap can be implemented as:-
Code:
p[x] ^= lut4[p[x+4] & 0x0F];
p[x+4] ^= lut4[p[x] & 0xF0];
p[x] ^= lut4[p[x+4] & 0x0F];
where the LUT swaps pairs of 4 bits, the inner array index and logical operation may be implemented using AND zp,X and the outer array index and logical operation may be implemented using EOR abs,Y. It may be desirable to substitute lut4 with bit shift operations but it is probably faster to use LUTs for the 2*2 and 1*1 cases. This solution exceeds 512 bytes and may exceed 256 clock cycles. It is probably less than 2x faster than BitWise's solution yet more than 16x larger. Regardless, it provides an example of why compilers fail to produce sensible output. Unaless the compiler's input is phrased in a very specific manner - with fore-knowledge of the target architecture - the compiler cannot apply a sensible set of transformations and optimizations.
barrym95838 on Sun 27 Dec 2020 wrote:
t = (x ^ (x >> 7)) & 0x00AA00AA;
I suspect the seven bit shift is derived from Donald Knuth's re-occurring binary fraction technique. If so, I sympathize with your bafflement.
I tried the suggested search and found a
similar question for PIC where someone suggested that it was a pointless codegolf exercise. Worse, many responded with Intel AVX512 "solutions". I also found a header in the FastLED library which is typically used to control WS2812 digital LEDs and similar devices which use incompatible serial formats. Anyhow, people are definitely running binary matrix transpose on 8 bit systems and people are definitely using it for practical tasks, such as light control.