6502.org • View topic - Worlds Worst Videocard BadApple Demo. I want more FPS!

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

Worlds Worst Videocard BadApple Demo. I want more FPS!

Page 3 of 4

[ 59 posts ]

Go to page Previous 1, 2, 3, 4 Next

Previous topic | Next topic

Author

Message

gfoot

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sat Dec 09, 2023 3:13 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

NormalLuser wrote:

Though I now have to find a good way to deal with the 10 mandatory CRC bytes that SD cards barf out every 512 bytes. I count every 2 bytes read currently so I can do a 8 bit counter and then do a unrolled read of 10 bytes when it rolls over to zero.. That's bad enough.
I don't want to count every bit.... Ouch... I'll have to think on how to handle that....

For this I was wondering about connecting PB6 up to the SD clock signal, and putting Timer 2 into the mode where it counts pulses on PB6 - then after the right number of pulses (512*8 I guess) you'd get an interrupt, and you could make the interrupt handler discard the next 80 bits, and reset Timer 2 to count the next 512 bytes, before returning. This way your mainline code doesn't need to count the bits at all, and you can treat it as a true bitstream.

NormalLuser wrote:

The below code takes something between 37 and 43 cycles a pixel/byte to fill the screen and I must encode the skips on the edge of the screen.
I'm sure there must be a faster way to do this:

Yes there are some decent optimisations to be had.

First, the way you're incrementing Screen is a bit clumsy - consider these changes:

Code:

  ;I don't like the screen inc code....
  ;seems like room for improvement?
  LDA Screen              ; remove this
  CMP #$FF                ; remove this, we'll just increment it and see if it hits zero
  BEQ .P1IncTop           ; remove this
  INC Screen
  BRA .P1DONE             ; change to BNE
.P1IncTop:
  INC ScreenH
  LDA #$00                ; remove this
  STA Screen              ; remove this

Next, you're keeping the Y register zero, but incrementing your pointer - it is more efficient to leave the low byte of the pointer zero, and increment Y instead using INY. I won't show the code for that change but it'll save a fair few cycles.

Next, this:

Code:

  LDA ScreenH
  CMP #$40
  BEQ .P1RstTop
  BRA .P1DONE
.P1RstTop:

Here you're comparing against #$40, but as ScreenH is only incremented, it's sufficient to just test that one bit, we don't need an exact comparison. This bit is easily tested using the BIT instruction, which sets the overflow flag based on bit 6 of the memory location read. The way you're branching is a bit backwards too. So replace the above five lines with this:

Code:

    BIT ScreenH
    BVC P1DONE    ; branch if bit 6 was not set yet

Now this:

Code:

  LDA #$20
  STA ScreenH

As ScreenH was $40 previously, you could just "LSR ScreenH" here.

And again here you're doing the branches backwards:

Code:

  BEQ .RLEDone
  BRA .RLETop
.RLEDone:

Why not just use a single "BNE RLETop" on its own? This trick with inverting jumps is only necessary if the branch target is out of range.

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sat Dec 09, 2023 8:54 pm

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

gfoot wrote:

For this I was wondering about connecting PB6 up to the SD clock signal, and putting Timer 2 into the mode where it counts pulses on PB6

That is an awesome idea!
Unfortunately I need that timer for square wave music. Because I'm clocked at 5 mhz I need to use both timer 1 and timer 2 together to keep the sound frequency in the audible range.

When I finally do digitized music, either via a FIFO on the VIA or by interleaving the VGA and CPU clock allowing me to just bitbang digital music out a VIA pin or 4 (since VGA halts the CPU 72% of the time currently), I will look at using the pulse counter to speed up the SD read further.
I'm already getting so much of a better SD read speed than I thought. Thanks so much for all the help on that!

And yes, that code is indeed so clumsy. Thanks for the help!
I look forward to cleaning it up with your suggestions.
Some of the badness is a result of the iterative process I had in making it..
As well as the fact that 'The avoidance of premature optimization led to less than ideal outcomes. '

IE, I hacked together random junk that 'worked' and am now picking up the pieces.

Oh btw, I really like that bit with the 'bit'!

Code:

    
    BIT ScreenH
    BVC P1DONE    ; branch if bit 6 was not set yet

I've never had a good reason to use BIT yet like this, and now I get to shave something like 8 CPU cycles a drawn pixel using it!
Thanks again!

Top

gfoot

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 12:27 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

NormalLuser wrote:

Some of the badness is a result of the iterative process I had in making it..
As well as the fact that 'The avoidance of premature optimization led to less than ideal outcomes. '

IE, I hacked together random junk that 'worked' and am now picking up the pieces.

It's called "technical debt", as over time you tend to end up "owing" more time to fix it than it would have costed if you'd fixed it sooner. The general idea is, it's fine to borrow, but you have to pay it back sooner rather than later to avoid being overwhelmed.

That's not to say you should optimise everything like crazy, but writing simpler code is always worth it, that much is never premature.

Top

barnacle

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 8:53 am

Joined: Mon Jan 19, 2004 12:49 pm
Posts: 660
Location: Potsdam, DE

NormalLuser wrote:

Without having looked at your design... if your audio is out of normal hearing, is it possible to stick a divider on the audio output? Something like a *393 would give you eight octave outputs... (I'm kinda assuming that you are selecting a frequency and need one timer as a pre-divide.

Neil

Top

BigEd

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 10:06 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Just a little surprised by "the 10 mandatory CRC bytes that SD cards barf out every 512 bytes" - CRCs are usually, and in this case, surely just two bytes? Or is it more that at the end of one sector you're counting overhead of accessing the next sector?

ref: http://www.rjhcoding.com/avrc-sd-interface-4.php

Top

gfoot

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 11:22 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

Good point, I believe he is using CMD18 to read multiple blocks, according to http://elm-chan.org/docs/mmc/mmc_e.html each block is followed by a two-byte CRC and preceded by a data token ($FE). In between it looks like the card may idle (reading as $FF) but it might not always be the same duration for different cards, unless that is specified somewhere. So you ought to read the 2 byte CRC and then keep reading until you get $FE (or just until the next zero bit?) rather than always reading 10 bytes.

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 6:20 pm

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

barnacle wrote:

Yea' I totally thought about using a filpflop or counter to divide that down, but at this stage I am working on a 'Stock' demo for the Ben Eater kits.
So right now I'm just using PB7 for audio since that is as simple as a resistor, capacitor and speaker.
Other than that, extra bypass capacitors and better ground/power distribution, the only hardware change from stock is 1 jumper wire from VGA Vsync to the NMI CPU pin, and 1 jumper from the 5 mhz VGA counter output to the CPU clock in.

For the moment I'm locked with this hardware setup.

There will be a 'phase 2' demo in the future where I do digital audio and increase performance by fiddling with the hardware a bit more.
As a FYI, I'm trying to push a (mostly) 'stock' Ben Eater setup just as far as I can to develop the software and workflow I need while also maybe contributing something for others getting into 6502/assembly.
My eventual long term goal is a 6502 based 'Arcade Board' PCB that will live inside a full size arcade cabinet running a game that I created.
Other than bitmapped graphics and a 6502 it may not have much else in common with my current build.
So any hardware suggestions will be appreciated and filed for longer term reference!

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Sun Dec 10, 2023 6:33 pm

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

gfoot wrote:

I'm only working with Class 10 cards and they always toss out 10 bytes every 512 byte block.
I think.. And I am new at this, that Class 10 are forced at 512 byte block and a 32 bit CRC. Everyone says that if you turn off CRC it just ignores you and sends it anyway in multi-block transfer mode.
Not sure if that 32 bit CRC is transmitted with extra buffer bytes, or if it is ASCII HEX format or what the deal is?
All I know is that if I toss 10 bytes every 512 my data transfers are perfect.

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Mon Dec 11, 2023 12:32 am

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

gfoot wrote:

First, the way you're incrementing Screen is a bit clumsy - consider these changes:

Oh I considered them...

Code:

.TriDone:
  LDX #$00 
  ldy #$00
.RLETop:
  DEC RLECount
  LDA PlotColor
  sta (Screen),y
  INC Screen
  BNE .P1DONE
.P1IncTop:
  INC ScreenH
  LDA ScreenH
  BVC .P1DONE
.P1RstTop:
  LSR ScreenH
.P1DONE:
  CPX RLECount
  BNE .RLETop
.RLEDone:
  dec Block_Counter
  BEQ .BLOCK
  JMP .readloop 

That code looks so much better already!
I have not done proper benchmarking but visually I can tell that the slow frames are much smoother!
Thanks!

gfoot wrote:

I agree, but what would be a good way to do it? Have a Y position zero page value that I store at the end of the routine and load at the beginning? I still need to roll-over that high address..
I suppose if the overhead of that load and store is less than the savings from the INY?

Anyway, thanks again!

Top

gfoot

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Mon Dec 11, 2023 1:12 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

NormalLuser wrote:

That code looks so much better already!
I have not done proper benchmarking but visually I can tell that the slow frames are much smoother!
Thanks!

Great! A couple more points:

Code:

.TriDone:
  LDX #$00 
  ldy #$00
.RLETop:
  DEC RLECount
  LDA PlotColor     ; you should be able to preload this before RLETop, as A isn't used elsewhere in the loop
  ...
.P1IncTop:
  INC ScreenH
  LDA ScreenH       ; this needs to be "BIT" not "LDA", LDA won't set/clear V, and it also clobbers A so that we can't preload it with PlotColor outside the loop
  BVC .P1DONE

Quote:

Yes, that's the idea. INY saves saves three cycles compared to INC <zp>, so so long as the loop executes more than three or four times it's worth paying the overhead of:

Code:

    LDY Screen
    STZ Screen

at the top and:

Code:

STY Screen

at the bottom. But if your loop is often very short then it could be slower this way - however that in itself could be a sign that the loop isn't right for you.

Also note that with the 65C02 you can use STA (Screen) which is one cycle cheaper that STA (Screen),Y. This may actually be better overall, because then you can use Y for something else.

Moving DEC RLECount to where CPX RLECount is (and removing the CPX) will I think work fine and be cheaper. Even better, LDX RLECount at the start and DEX in the loop, for the same reason as above.

Another higher-level thought is that if you can synchronise the RLE lengths with the line endings then you won't need to check for Screen/ScreenH overflow between all pixels, only after an entire RLE string is completed. It will use more data where you're splitting large spans, but could make this loop much faster as the BIT/BVC wouldn't be needed. If having larger data is not viable, then you could instead compare the RLECount to $40-Screen before starting the loop, and make your loop cycle only up to the lower of these two values, using a spare register to count down through these instead of looping all the way until RLECount is zero. Now you can store long RLE spans, but still not have to deal with Screen/ScreenH wrapping inside the loop.

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Mon Dec 11, 2023 4:13 am

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

Thanks for the tips!

gfoot wrote:

Yes, that's the idea. INY saves saves three cycles compared to INC <zp>, so so long as the loop executes more than three or four times it's worth paying the overhead of

OK, fix that up also...

Code:

.TriDone:
  LDX #$00 
  ldy Screen
  stz Screen
  LDA PlotColor
.RLETop:
  DEC RLECount
  sta (Screen),y
  INY
  BNE .P1DONE
.P1IncTop:
  INC ScreenH
  BIT ScreenH
  BVC .P1DONE
.P1RstTop:
  LSR ScreenH
.P1DONE:
  CPX RLECount
  BNE .RLETop
.RLEDone
  STY Screen
  dec Block_Counter
  BEQ .BLOCK
  JMP .readloop
 

For objective benchmarking I took off the Vsync's and got 41 FPS on average. A 10% or 4 FPS increase from 37 I had before encoding Vsync's on 'good' frames.

Subjectively the 'inversion' frames that switch black and white now fill noticeably quicker and the spiral sun and band zoom are much more fluid.
NICE!!!

Better though?
I am thinking that if I work on my other routines I can just 'share' y register and not do any of the loads and stores for y and just iny.
If so I think that will be good for another frame or two!
I bet these enhancements plus a 7 kb read buffer will get me pretty darn close to a silky smooth frame locked 30 FPS. And I still have performance gains from bit level encoding to tap!
Thanks a bunch!

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Tue Dec 12, 2023 12:14 am

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

Now I'm up to 46.4 FPS!

gfoot wrote:

Also note that with the 65C02 you can use STA (Screen) which is one cycle cheaper that STA (Screen),Y. This may actually be better overall, because then you can use Y for something else.

Hrm... I am using a 65C02.. and I did NOT know about the STA(), but I'm not sure I have a better use for Y?

gfoot wrote:

.....
Another higher-level thought is that if you can synchronise the RLE lengths with the line endings then you won't need to check for Screen/ScreenH overflow between all pixels, only after an entire RLE string is completed. ....

Yea' this was great to have on my mind when I realized that my encoder is ALWAYS in skip mode on the edge of the screen.
So I was also able to remove any checks for that in the Tri-pixel and RLE draw routine and just do a INY instead.
Neat!

I also worked through preserving the y register and now that my code is cleaned up a bit more I'm getting 46.4 FPS when run without vsync!
(Did I mention.. 46.4 fps?)

Now I'm working through this bit level encoding idea and with the stupid 10 CRC bytes from the SD card every 512 I need to do one of two things, either go ahead and count the darn little bits, or change the way I encode entirely to make it easier to keep track of the bits out of the SD card.

Right now an easy way to pack bits with little change to my existing code would be to simply use the first bit to indicate if it is a skip or a tripixel+RLE, if it is a skip I read 8 more bits and that is my RLE for a total of 9 bits for a skip instead of 16. If it is a TriPixel I read 6 bits for my 3 pixel 64 entry lookup table, then read 7 bits for my RLE. It is only 7 bits for RLE for this because I can't draw off-screen, so a RLE draw line is never more than 100 pixels currently. That plus the original 1 bit to decide on TriPixel or Skip puts me at 14 bits vrs 16.
There are savings to be had....
But I can't really think of a good way to do this without counting bits.

The other way would be to change my encoding. I'd need some way to keep the update messages 'balanced'

I'm thinking:

With the current encoding I need to encode:

Black
White
Light Grey
Dark Grey
Skip
Vsync/Frame

I can probably get rid of the Vsync/Frame packet. Since my decoder already rolls-over the screen on it's own I may as well use that to indicate a frame and read data into the buffer with whatever time there is until it is time to draw again.

But that would still leave me 5 things to encode.

So, I could do 3 bit to do the branch/lookup and then use only 5 bits to skip/RLE up to 32 pixels at a time.
It would make my 'packets' a nice little byte. I'm not sure what the tradeoff on the encode would be but it is possible that in the real world it would help. On the empty/low change frames it would cost some extra packets, but on the busy frames it might do nothing other than save cycles?
Now that I'm 'peeling the onion' that is bit level encoding I'm thinking differently as to how I would do this.

Right now the entire decoder (minus SD card stuff) looks like this:

Code:

  ;A has the color(skipping that now) tripixel lookup, or Skip Run token #64
  CMP #64
  BEQ .SkipRun ;it is 64, want to skip these. 
.TriPixel:
;ok, lame, but just hack the beep in here for now?
;only costs a few seconds in playtime.. but can be better.
; check for beep on skips instead maybe?
    CMP #255 ;Beep token
    BEQ .GotoBeep 

    TAX ;color/index to x
    LDA Array1-65,x
    sta (Screen),y
    INY
    LDA Array2-65,x
    sta (Screen),y
    INY
    LDA Array3-65,x
    sta (Screen),y
    STA PlotColor
.TriDone:
  LDX RLECount
  LDA PlotColor

.RLETop:
  sta (Screen),y
  INY
  DEX
  BNE .RLETop

.RLEDone
  dec Block_Counter
  BEQ .BLOCK
  JMP .readloop 

.SkipRun:
  clc  
  TYA
  adc RLECount
  TAY
  lda ScreenH
  adc #$00
  sta ScreenH
  CMP #$40
  BEQ .sRstTop
  
  dec Block_Counter
  BEQ .BLOCK
  JMP .readloop 
.sRstTop:
  LDA #$20
  STA ScreenH
  
  dec Block_Counter
  BEQ .BLOCK
  JMP .readloop 

But another thing I could maybe do for a quick change is that I could take my existing code and copy my 192 byte lookup table to zero page?
Is it worth it?
I'm not sure.. I got this from here: https://www.nesdev.org/wiki/6502_cycle_times

Mnemonic Description IMP IMM ZP ZP,X ZP,Y ABS ABS,X ABS,Y
LDA LoaD Accumulator 2 3 4 4 4+ 4+ 6 5+
STA Store Accumulator 3 4 4 5 5 6 6

I think if I move the lookup to zeropage I save 2 cycles a lookup without any other changes?
Correct? Because I like free cycles!

Again, thanks for all the help!

Top

gfoot

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Tue Dec 12, 2023 1:19 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

NormalLuser wrote:

Right now an easy way to pack bits with little change to my existing code would be to simply use the first bit to indicate if it is a skip or a tripixel+RLE, if it is a skip I read 8 more bits and that is my RLE for a total of 9 bits for a skip instead of 16. If it is a TriPixel I read 6 bits for my 3 pixel 64 entry lookup table, then read 7 bits for my RLE. It is only 7 bits for RLE for this because I can't draw off-screen, so a RLE draw line is never more than 100 pixels currently. That plus the original 1 bit to decide on TriPixel or Skip puts me at 14 bits vrs 16.
There are savings to be had....
But I can't really think of a good way to do this without counting bits.

If you can find a way to squeeze in a "read and discard the next 80 bits" command then you can make the encoder just insert this in the right places, i.e. any time the next command it wants to emit doesn't fit within the next 512 byte boundary. It may be tricky though as you may need to waste more than 80 bits in some cases, so you may also need to be able to specify a number of bits to skip.

This command is rare so needs to be added to the encoding cheaply. Can I presume that a tripixel command with zero length is invalid? Perhaps that's a good way to do it. So if the run length is zero then the decoder could ignore the tripixel command, and read and discard bits instead.

How many bits? At least 80 I guess but possibly up to 14 bits more. You could add the 6 bit tripixel lookup index to 80 to get a total number of bits to skip, perhaps.

It doesn't matter if the code to skip the bits is slow, it doesn't run often. The gain from doing it this way is that code that does run often (reading a bit in general) doesn't need to take any overhead counting bits or bytes. So every bit you read will then be a little bit faster.

Quote:

I think if I move the lookup to zeropage I save 2 cycles a lookup without any other changes?
Correct? Because I like free cycles!

If it happens often then it's worth doing; if these lookups are rare then it's not really worth it. I think you were using Kowalski's simulator to profile the code which seems a good approach to check you're optimising the right things.

On that note you also only need to optimise scenes where the frame rate is low, and understand why that is - is it the reading, is it the decoding, is it the drawing, in fact is there just so much drawing to do on these frames that the cpu will never be able to do it faster? Is it worth using a different encoding just for problem frames? Or as I said before can you make the encoder simplify the deltas, accepting a slightly lossy render to keep the frame rate up even if a few pixels aren't updated until the next frame?

Top

NormalLuser

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Wed Dec 13, 2023 12:03 am

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45

gfoot wrote:

NormalLuser wrote:

I'm thinking there is a gain if I just count the darn bits. It does not need to really be that much of a gain to be worth it. Already I am SO CLOSE to full 30FPS all the time.

gfoot wrote:

Quote:

I think if I move the lookup to zeropage I save 2 cycles a lookup without any other changes?
Correct? Because I like free cycles!

Strange thing happened on the way to the simulator.
Unless I'm missing something it looks like:

Code:

 LDA $00,x
;and
 LDA $9000,x

Both use 4 cycles as long as you don't cross a page boundary?
So no gain by moving the lookup table to zero page like I would have thought?

gfoot wrote:

On that note you also only need to optimise scenes where the frame rate is low, and understand why that is - is it the reading, is it the decoding, is it the drawing, in fact is there just so much drawing to do on these frames that the cpu will never be able to do it faster? Is it worth using a different encoding just for problem frames? Or as I said before can you make the encoder simplify the deltas, accepting a slightly lossy render to keep the frame rate up even if a few pixels aren't updated until the next frame?

Right now I am SO CLOSE. The throwing of the apple and the spiral sun are full speed now, and the ying yang spiral at the end seems pretty close to full speed as well. Only the girls with the wings sliding/rotating into view have any real slowdown anymore.

I think I might be able to get what I'm looking for with just the 7k read buffer now.
I'm very interested to see how well I can make a very simple one work.

Top

Dr Jefyll

Post subject: Re: Worlds Worst Videocard BadApple Demo. I want more FPS!

Posted: Wed Dec 13, 2023 4:00 am

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3346
Location: Ontario, Canada

NormalLuser wrote:

Both use 4 cycles as long as you don't cross a page boundary?

'Fraid so!

Here's the play-by-play of CPU operations for LDA z-pg,x:

... and for LDA abs,x with no page crossing it is:

Notice the second case involves two simultaneous operations in cycle 3; that's why the total cycle count may seem surprising.

For better detail than I've provided here, see Appendix A of the MOS MCS6500 Family Hardware Manual. Links to this manual can be found here.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

Page 3 of 4

[ 59 posts ]

Go to page Previous 1, 2, 3, 4 Next

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 22 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum