6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Apr 25, 2024 9:25 am

All times are UTC




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Mon Dec 21, 2020 1:43 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I've encountered a gnarly problem repeatedly over many years. I don't have any immediate requirement for a solution on 6502 but it is likely that someone may require it sooner or later. Indeed, with a solution to hand, you may find immediate use for it.

The problem is a binary or bit matrix transpose. Starting with a square grid of 8*8 bits, flip the grid along one diagonal. Another explanation would be to take eight bytes of data and transform it such that all of the top bits are held in one byte, all of the next most significant bits are held in the next byte, all the way down to the least most significant bits from across the eight bytes grouped into the last byte. This problem occurs in numerous unrelated places. It occurs in audio and video compression. It was also used on the the Amiga port of Doom so that the chunky textures could be displayed on the planar video system of a 32 bit Amiga's AGA chipset. It can also be used to bit-bang synchronous serial streams. This could be for any purpose from MicroSD RAID, serial DACs to WS2812 digital RGB LEDs. There are related subproblems, such as 6*8 bit and 6*2 bit transpose, for example, to drive six 18 bit serial DACs.

There are places where bit transpose is not useful. For example, I thought that it would be a really great idea to save energy and incorporate an 8*8 bit transpose into a bit-banged, interrupt driven, eight port network switch. However, after realizing that it required a huge number of clock cycles every eight interrupts, I dropped it and used the simple method.

I have a good solution in C which is suitable for processor architectures with fairly symmetric registers. This is divided into separate cases for architectures with 64 bit registers (an absolute joy), 32 bit registers, 16 bit registers (should work really well on RCA1802 and TMS9900) and 8 bit registers. Unfortunately, it took me three days to get anything sane out of avr-gcc and I am concerned that any routine with, for example, char foo[7] will generate a huge amount of unnecessary 16 bit addition on 6502. C compilation on 65816 might be sane but I am not hopeful.

Starting with eight contiguous bytes in zero page, it is probably easiest to overwrite data. The dumbest technique to implement 8*8 binary transpose is 1-8 obtuse sequences of ROL zp instructions each ending with a fixup for the last bit. However, this requires more than 128 bytes, more than 256 clock cycles and moves everything through the carry bit. It is definitely possible to write a faster implementation. For 2^2N matrix bits (2^N along each side), it is possible to transpose in N steps. This is a fractal Battenberg cake. For 8*8 binary transpose, the first (or last) step is to swap only two opposite quadrants of 4*4 bits. This may or may not be aided with the six instruction nybble swap sequence. For 2*2 block swaps and 1*1 block swaps, shift operations should handle a minimum of four useful bits.

It may or may not be fastest to handle 4 bit, 2 bit and 1 bit left and right shift operations using LUTs [Look-Up Tables] because it allows masking and shifting to be handled in one step. Although it initially appears to require six pages of tables, left shifts only require 128, 64 and 16 table entries. It is possible for two of these tables to share one page by applying AND immediate or OR immediate to some table inputs while four bit left shift may be combined with nybble swap or implemented the hard way. This reduces tables to a maximum of four pages - and it may be possible to put the transpose algorithm in the 64 byte hole between the 128 and 64 entry tables.

Anyhow, if your solution requires more than 1KB or 256 clock cycles, there is definitely a smaller or faster solution. However, it is possible that the fastest solution exceeds 1.5KB.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 21, 2020 5:28 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
I might benefit from this transposing method when printing non-proportional fonts to the graphics screen if it is quite a bit faster than the ROL method you mention.

Do you have any sample code?


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 27, 2020 12:40 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
Isn't it just something like this?
Code:
                                             .page0
                                             .org    $20

00:0020  0000000000000000  : DATA            .space  8
00:0028  0000000000000000  : TRNS            .space  8

                                             .code
                                             .org    $300

                             Transpose
00:0300  A208              :                 ldx     #8
                                             repeat
00:0302  2620              :                  rol    <DATA+0
00:0304  2A                :                  rol    a
00:0305  2621              :                  rol    <DATA+1
00:0307  2A                :                  rol    a
00:0308  2622              :                  rol    <DATA+2
00:030A  2A                :                  rol    a
00:030B  2623              :                  rol    <DATA+3
00:030D  2A                :                  rol    a
00:030E  2624              :                  rol    <DATA+4
00:0310  2A                :                  rol    a
00:0311  2625              :                  rol    <DATA+5
00:0313  2A                :                  rol    a
00:0314  2626              :                  rol    <DATA+6
00:0316  2A                :                  rol    a
00:0317  2627              :                  rol    <DATA+7
00:0319  2A                :                  rol    a
00:031A  9528              :                  sta    <TRNS,x
00:031C  CA                :                  dex
00:031D  D0E3              :                 until eq
00:031F  60                :                 rts

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 27, 2020 1:08 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
Opps small out by one error in the output area.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 27, 2020 5:32 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1925
Location: Sacramento, CA, USA
There's a method that uses a bunch of 3-cycle EORs instead of a bunch of 5-cycle ROLs. I found the 32-bit C source, attributed to Henry S. Warren Jr., but I hesitate to post it here due to potential copyright issues. If I can successfully translate it to optimized 6502 assembly, I won't hestitate to post my version, but I'm having difficulty grokking the algorithm right now, and I have a bunch of other stuff on my plate. I won't be the least bit offended if someone else beats me to the punch.

Put:
t = (x ^ (x >> 7)) & 0x00AA00AA;
into your favorite search engine if you feel like giving it a go (it'll certainly be larger than Bitwise's 32-byte effort, and it's too early for me to tell if it has much chance of being any faster than 527 cycles).

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 15, 2021 11:32 am 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
Firstly, I would have suggested a smaller problem if I knew the full scope of binary matrix transpose. Secondly, if you want something which will only be used occasionally and is easy to understand then use BitWise's implementation and don't look any further.

BitWise on Sun 27 Dec 2020 wrote:
Isn't it just something like this?


When I first encountered this problem, my inclination was to write the unrolled version of your implementation in MC68000. Your solution is, by far, the most concise, legible and easily maintained version that we are likely to see. Furthermore, it is ideally suited to be trimmed down for the subproblems. However, if all bits are passed through the carry flag then it is definitely not the fastest algorithm for the 8*8 case.

barrym95838 on Sun 27 Dec 2020 wrote:
EOR


You've given me an idea for implemenation. It is possible to swap variables without using a temporary variable by using EOR swap. Many of the explanations for EOR swap are quite baffling. Therefore, I have drawn a Boolean diagram and truth table. This may or may not be helpful when satisfying yourself that EOR swap performs as expected. It may be useful to extend EOR swap to a bit matrix transpose by introducing bit masks, bit shifts and/or LUTs. (Apologies in advance for use of C notation. This directed at barrym95838 as pseudo-code for conversion into 6502 assembly.) The trivial EOR swap is:-

Code:
p ^= q;
q ^= p;
p ^= q;


With nybble swap LUT, this would be:-

Code:
p ^= lut4[q];
q ^= lut4[p];
p ^= lut4[q];


Unfortunately, this performs something slightly different to the desired algorithm. Specifically, this concurrently exchanges two variables while nybble swapping both variables. This can be adapted by using bit masks. One line of the 4*4 swap can be implemented as:-

Code:
p[x] ^= lut4[p[x+4] & 0x0F];
p[x+4] ^= lut4[p[x] & 0xF0];
p[x] ^= lut4[p[x+4] & 0x0F];


where the LUT swaps pairs of 4 bits, the inner array index and logical operation may be implemented using AND zp,X and the outer array index and logical operation may be implemented using EOR abs,Y. It may be desirable to substitute lut4 with bit shift operations but it is probably faster to use LUTs for the 2*2 and 1*1 cases. This solution exceeds 512 bytes and may exceed 256 clock cycles. It is probably less than 2x faster than BitWise's solution yet more than 16x larger. Regardless, it provides an example of why compilers fail to produce sensible output. Unaless the compiler's input is phrased in a very specific manner - with fore-knowledge of the target architecture - the compiler cannot apply a sensible set of transformations and optimizations.

barrym95838 on Sun 27 Dec 2020 wrote:
t = (x ^ (x >> 7)) & 0x00AA00AA;


I suspect the seven bit shift is derived from Donald Knuth's re-occurring binary fraction technique. If so, I sympathize with your bafflement.

I tried the suggested search and found a similar question for PIC where someone suggested that it was a pointless codegolf exercise. Worse, many responded with Intel AVX512 "solutions". I also found a header in the FastLED library which is typically used to control WS2812 digital LEDs and similar devices which use incompatible serial formats. Anyhow, people are definitely running binary matrix transpose on 8 bit systems and people are definitely using it for practical tasks, such as light control.


Attachments:
boolean-eor-swap0-0-1.pdf [99.46 KiB]
Downloaded 47 times
boolean-eor-swap0-0-0.odg [6.17 KiB]
Downloaded 44 times

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!
Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 16, 2021 7:57 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1925
Location: Sacramento, CA, USA
I took a brief swipe at translating the C to 6502 assembly, but gave up after about 15 minutes, when it became obvious that the attempt couldn't match Bitwise for size or cycle count, no matter which way I tried to slice it.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 16, 2021 2:24 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
barrym95838 wrote:
I took a brief swipe at translating the C to 6502 assembly, but gave up after about 15 minutes, when it became obvious that the attempt couldn't match Bitwise for size or cycle count, no matter which way I tried to slice it.

Good! I didn't think there was a faster way without creating huge tables, but still you made me doubt myself at writing effiecient code. :?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 28, 2021 1:09 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
EOR swap is a failure.
Code:
p[x] ^= lut4[p[x+4] & 0x0F];

becomes
Code:
LDA p+4,X
AND #$0F
TAY
LDA p,X
EOR lut4,Y
STA p,X

and
Code:
p[x+4] ^= lut4[p[x] & 0xF0];

becomes
Code:
LDA p,X
AND #$F0
TAY
LDA p+4,X
EOR lut4,Y
STA p+4,X

There are a few opportunities to optimize but the half byte swap takes about 41 clock cycles. Four sets are required at three different scales. There are minor opportunities to optimize but it requires something like 768 bytes of LUT, 360 bytes of program and 492 clock cycles or a little less. It might be a marginal speed improvement compared to the binary fraction technique but it is likely to be far larger.

Sheep64 on Fri 15 Jan 2021 wrote:
if all bits are passed through the carry flag then it is definitely not the fastest algorithm for the 8*8 case.


I was completely wrong about that. Despite the use of read-modify-write instructions and the possibility of dead cycles, passing all data through the carry flag and assembling rows of bits in the accumulator appears to be the fastest method by a wide margin. This is particularly true for smaller transpose operations. Assuming no dead cycles, BitWise's implementation requires 32 bytes and far less than 400 clock cycles.

With two extra instructions, it is possible to make BitWise's implementation perform two simultaneous matrix transpose operations. Specifically, the loop may begin with LDA <TRNS,x // ROL A and end with STA <TRNS,x // DEX. This disperses bits from one matrix while collating another. This doubles the efficiency of the rotate instructions and typically requires less than four clock cycles per bit transposed. The overhead of the tweaked algorithm is minimal to the extent that it may be preferable to only provide a double transpose operation. Even with this impediment, the tweaked algorithm is approximately three times faster than other suggestions and is a further demonstration of compiler inefficiency. This is especially true when an additional entry point allows simultaneous 8*X and X*8 transpose where X<=8.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 28, 2021 4:22 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
Sheep64 wrote:
With two extra instructions, it is possible to make BitWise's implementation perform two simultaneous matrix transpose operations. Specifically, the loop may begin with LDA <TRNS,x // ROL A and end with STA <TRNS,x // DEX. This disperses bits from one matrix while collating another. This doubles the efficiency of the rotate instructions and typically requires less than four clock cycles per bit transposed. The overhead of the tweaked algorithm is minimal to the extent that it may be preferable to only provide a double transpose operation. Even with this impediment, the tweaked algorithm is approximately three times faster than other suggestions and is a further demonstration of compiler inefficiency. This is especially true when an additional entry point allows simultaneous 8*X and X*8 transpose where X<=8.

This sounds good on paper but doesn't translate to code in some cases. It sounds like you are trying to use the Accumulator for both the DATA byte and the TRANSPOSE byte or transpose the horizontal and vertical axis in one pass. This might work if only transposing full bytes and not bits. This basically has a rotating square effect.

LDA DATA+0
STA TEMP
LDA DATA+8
STA DATA+0
LDA DATA+64
STA DATA+8
LDA DATA+56
STA DATA+64
LDA TEMP
STA DATA+56

Also comes to mind that speed may not be essential in some cases as once the data has been transposed, it can be saved as data that can take the place of the non-transposed data that is getting loaded each time and being transposed every time. Making proportional fonts out of bitmap fonts may fall in this category.

Rotating the graphics screen or window can be another use. But the question becomes, how wide is the pixel (in bits) if we want to preserve color?

Transposing 1-bit, 4-bits, 8-bits ..etc would require slight differences to the code.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: drogon and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: