AndrewP wrote:
I'm afraid I cannot work out how to parallelise the adders. Doesn't the first addition have to fully happen before the second can start using it? It would be great if that's possible as looking at the full blitter cycle I have a lot of time used already and a lot less nanoseconds to play with than I thought.
The first set of adders is computing A+B where A is a constant and B is the accumulated value. The second set is then computing (A+B)+C, where C is another constant. So the second set is computing (A+C)+B - and A+C is also constant and could be provided by the CPU instead of DEST_WIDTH (or calculated by the circuit with another set of adders).
Quote:
I've been so focused on stretching smaller to larger that I hadn't thought about shrinking. I glad you pointed out that run-slice is way more optimal there, thanks. Mip-maps would help (maybe to the point that I don't need run-slice) but there's also a palette involved so it's a tad trickier but quite solvable. Why would I not want run-slice? I have an (unfounded) suspicion I could use the bresenham line ICs backwards with minimal effort to shrink a sprite and I really would like to cut down the number of ICs I'm using. Gah, trade-offs.
I thought about this before, especially that maybe we can just swap the inputs to the calculator and swap the meanings of the outputs. The problem is that it will still iterate over all of the source pixels, even if it doesn't need to use them (because they get overwritten at the destination), and that's fundamental if the source address is tracked by a counter. If you want to skip source pixels then you need to be able to add other values to it instead of just 1. So there will be a hardware change needed to the way the source address is stored/updated.
Which values do we add? Well in normal operation for a stretch, you are adding 1 to the destination address every cycle, and either 0 or 1 to the source address, depending on the error term overflowing. For a squash, you can do the same thing, incrementing the destination every cycle, but add N or N+1 to the source address, instead of 0 or 1.
I think N is (SOURCE_WIDTH-1)/(DEST_WIDTH-1) rounded down, and you then need to supply R = (SOURCE_WIDTH-1) % (DEST_WIDTH-1) + 1 as input to the Bresenham part of the circuit instead of the full source width. I'm not sure about the "-1" bits, it feels weird but seems necessary, a bit like how you have to subtract one from the source width for the stretching calculation as well.
So given 17 source pixels to squash into 4 destination pixels, N would be 5 and R would be 2. So it's like stretching 2 source pixels over the 4 destination pixels, except after every pixel written we also increase the source address by N=5.
Here's a comparison between stretching 2 pixels over 4, and squashing 17 pixels into 4, to illustrate that - note that the squashing source addresses are just incrementing by an extra 5 each step:
Code:
Stretching 2 pixels over 4: s0 => d0, s0 => d1, s1 => d2, s1 => d3.
Squashing 17 pixels into 4: s0 => d0, s5 => d1, s11 => d2, s16 => d3.