6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 6:09 pm

All times are UTC




Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4
Author Message
PostPosted: Fri Dec 15, 2023 11:55 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
Dr Jefyll wrote:
NormalLuser wrote:
Both use 4 cycles as long as you don't cross a page boundary?
'Fraid so!
Here's the play-by-play of CPU operations for LDA z-pg,x:
..
Notice the second case involves two simultaneous operations in cycle 3; that's why the total cycle count may seem surprising.
-- Jeff


I did find that surprising. Thanks!
It got me wondering what the use would be of bothering to put a table in zero page (assuming space elsewhere)....?
Then I realized that my table was 192 bytes!
So, if I understand correctly my table AND the code that uses it needs to be in the same page to avoid the page crossing penalty you avoid with having it live in zero page?
Sound's tough with larger tables... Like mine.

Crazy enough even with only 64 bytes to play with around the lookup table(s) I managed to by chance wedge the only 3 spots that use the table into the 64 bytes in the same page as the table!

Is that called accidental optimization?

:)

If I want to use that table elsewhere I'd pay the penalty. But I don't expect to.
That said, I'm only using one or two dozen bytes of zero page now.

So I wonder if there is any code I SHOULD copy in there?
What are some 'DEMO'/dedicated task optimizations done out of zeropage?

What code IS worth putting in zero page in my case? I copy Bits from a SD card, and I write Bytes to RAM. I don't do much else.


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 16, 2023 3:20 am 
Offline

Joined: Fri Apr 15, 2016 1:03 am
Posts: 136
The page crossing penalty cycle for abs,x addressing doesn't care where the instruction is, only where the table is.

If the instruction does a write at the calculated address, the extra cycle will be used.
If the instruction does a read at the calculated address and adding x to the address increments the hi byte of the address, the extra cycle will be used.


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 16, 2023 12:47 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
leepivonka wrote:
If the instruction does a write at the calculated address, the extra cycle will be used.
If the instruction does a read at the calculated address and adding x to the address increments the hi byte of the address, the extra cycle will be used.



Thanks!
This is a very good explanation!


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 16, 2023 1:37 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
The main reason you might - very rarely! - want to put code in zero page is if you are using self modifying code - it saves one cycle each time you update a byte of code, and also allows you to use byte pairs in the code in other indirect instructions. The "Exile" game's sprite plotting is an example of that, I think it also found a way to benefit from storing a palette lookup table in zero page as well - again self-modifying code, rewriting the zero page operand of an LDA instruction with the index to look up in the palette.

This is only necessary in extremely finely tuned situations, where you have one small routine that needs to run faster and can benefit from these tricks.


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 16, 2023 2:05 pm 
Online
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
gfoot wrote:
The main reason you might - very rarely! - want to put code in zero page is if you are using self modifying code - it saves one cycle each time you update a byte of code, and also allows you to use byte pairs in the code in other indirect instructions. The "Exile" game's sprite plotting is an example of that, I think it also found a way to benefit from storing a palette lookup table in zero page as well - again self-modifying code, rewriting the zero page operand of an LDA instruction with the index to look up in the palette.

This is only necessary in extremely finely tuned situations, where you have one small routine that needs to run faster and can benefit from these tricks.


And note that the Microsoft 6502 BASICs all copy the CHRGET code to ZP and call it there to save a few cycles...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 17, 2023 1:56 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
I'm wondering... Is there a 2 bit sd card read mode?
Since I'm using that fast setup gfoot helped me figure out with data in hooked up to bit 0 of an empty VIA port all I do to read is this:

Code:
lda VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA
    asl
    ora VIA_PORTA

 sta RLECount


But looking at this SPI memory
[url]
https://www.adafruit.com/product/5636?g ... QEQAvD_BwE[/url]

It has a 2 bit spi fast read mode that both data in and data out pump out bits when clocked.
If I moved that input to bit 1 of the via and could get data from it and doubled the asl I'd use 8 asl a read still but use only 3 ora instead of 7.
Can sd cards do that?


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 03, 2024 11:55 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
Steamboat Willie demo on my 6502..
https://youtu.be/8Jy-tLoXLFw

I wanted something to show off a few changes I made but wanted something different than Bad Apple while doing this testing.
Well, I made some optimizations realizing that since I always encode skips for every frame and there are 28 off-screen pixels on the edge of the screen I only need to check for screen roll-over on skips and I also can use this to indicate a 'frame' is about to happen for a Vsync routine.
My VGA is 60 FPS and my source is 30 FPS.
So I would ideally display each image for 2 frames.
But since I sometimes get behind I don't want to always wait the 2 frames.
So I load 252 into the VGAClock zero page location.
This location is decremented each time the VGA Vsync triggers the NMI on the 6502.
Then I decode as normal running this code on any 'Skips' and if it is a roll-over I run the Vsync routine that allows the decode to get behind up to 250 images and then still catch back up by skipping Vsyncs on following frames until caught up.

Code:
.SkipRun: ;Just add this amount to the screen pointer
  clc 
  TYA
  adc RLECount
  TAY
  lda ScreenH
  adc #$00
  sta ScreenH ;Since we always encode skips on the edge of the screen we only need to check
  CMP #$40    ;for the screen roll-over at $4000 in the Skip routine.   
  BEQ .sRstTop;We never roll-over while in the draw routines becuse we are always on-screen.
 
  BRA .readloop
.sRstTop:
  LDA #$20   ;New Frame starts at $2000   
  STA ScreenH;Reset screen pointer to first pixel

.Vsync:
  LDX VGAClock
  CPX #250
  BCS .EGVsync ; = > 200
; Less than 200 no wait. Try to catch up.
.Synced:
  INC VGAClock ;Add two for 30 Frames a Second
  INC VGAClock ;VGA is 60 FPS
  BRA .readloop;** Start next frame **
.EGVsync: ; = > 200
  BEQ .Synced  ; = 200 ;Good Vsync. Done waiting.
  BRA .Vsync   ; > 200 ;Wait for Vsync


Also, since I don't need to check to Vsync I also could drop the Beep commands since they were nested branches on the decode. This removed that branch decision from half the bytes read speeding it up more.
It is working pretty well, looking forward to some more advances in the new year!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 04, 2024 1:47 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Nice, that makes much more use of the greyscale!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 16, 2024 4:26 am 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
53 Frames a second!
I have my new one byte encoder/decoder and am now pushing 53 frames a second with vsync off.
It has Run Length, Differential and Tri Pixel Encoding.
With Vsync the video plays almost perfectly in sync down to the second.
The one byte encoder has the average bytes per frame down to 402 from 472.
20:1 compression ratio.

I am not using any kind of read buffer. I still stream off the SD card with just a VIA.
So one benefit of this encoding scheme is that since SD cards stream the bits highest to lowest I get to do this:

Code:
; Load Control Bit 1
  lda VIA_PORTA; Read bit 8 of byte from SD card.
  bne .SkipRun ; If 1 Skip Pixels
; Load Control Bit 2 
  lda VIA_PORTA ; Else check bit 7
  bne .RLE  ; If 1 Repeat Pixels
  jmp .TriPixel ; Else TriPixel

.SkipRun: ;Just add this amount to the screen pointer
; Load 7 bits for a skip value up to 127


This is neat because now I don't have shift these control bits like this:

Code:
 ora VIA_PORTA
  asl


That makes the decoder routine up there just about the fastest thing I can figure out to do.

I need some way to keep track of bytes because SD cards have 10 CRC bytes every 512 bytes.
I don't want to waste decode time counting bits, so I compromised and kept everything encoded with 8 bits.

For skips I get to use 7 bits and skip up to 127. Skip is used a lot and can span lines.
I already only need 6 bits to encode the 64 lookup values for the 'TriPixel' runs of 3 pixels so good there.
That means that a repeat can only be 6 bits and a max of 64 also. Less than ideal.
But, since the screen is only 100 pixels wide and the end of every line has a minimum of 28 pixels to skip, IE a full black or white fill will have a skip byte at the end of all 64 lines, you are at worst using two bytes to fill the line.
The old two byte routine was already using two bytes to do that and since the end of every line has a skip you were never filling more than 100 pixels with a 2 byte update anyway.
Add to that the fact that usually you are not filling more than half the screen with any given color and that 64 repeat is not so limiting.

With this simple routine I got down to 2,569 KB from 50MB of uncompressed frames. This shaved around 500KB from my previous best. It also added almost 7 frames per second to the average framerate for the faster decode and reduced bit reads.
Even better is that it lowered the number of frames that missed vsync by quite a lot. There is almost no visible 'rubber banding' and the only way to tell it happens at all is to play it side by side with a mp4 on my PC.

So now I think I have just about the fastest SD card read possible with a plain VIA, and I think I have just about the fastest possible greyscale decoder for a 6502+Worlds Worst Video card. ( prove me wrong :D )
And while I really would love any tips or hints for improvements, I think I am about done with this portion.

I was afraid that the buffer would be too small given the overhead of writing to the buffer and then pulling from it and that it would end up slowing it down in problem areas and not helping anywhere else.
But now that I have lowered the bytes per frame and increased the decode speed that 7.5KB of ram I have available (if I put the decoder in ROM) might be enough to actually improve the 30 FPS lock.

I still need to finish up VIA square wave music as well for a proper demo as well, but the heavy lift of this encoder/decoder seems to be done.

https://github.com/Fifty1Ford/Ben-Eater-Bad-Apple/blob/main/BadAppleNew1Byte7.asm
https://github.com/Fifty1Ford/Ben-Eater-Bad-Apple/blob/main/BadAppleNew1Byte7.py


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 16, 2024 8:31 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Great result! There's a couple of things you can cheaply do with your counts - both your skip counts and your repeat counts. One is to apply an offset, so 0-63 becomes 10-73 or any other range which happens to work best. Another is to apply scaling, so 0-63 becomes 0-126, or combine both so 10-73 becomes 20-146, or maybe even scaling by 4 would be an advantage.

It might be that neither help, in your case, but it could be worth a little experimentation. Scaling only costs you two cycles because it's just a shift.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 17, 2024 1:34 am 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
BigEd wrote:
Great result! There's a couple of things you can cheaply do with your counts - both your skip counts and your repeat counts. One is to apply an offset, so 0-63 becomes 10-73 or any other range which happens to work best. Another is to apply scaling, so 0-63 becomes 0-126, or combine both so 10-73 becomes 20-146, or maybe even scaling by 4 would be an advantage.
It might be that neither help, in your case, but it could be worth a little experimentation. Scaling only costs you two cycles because it's just a shift.


Thanks!
I really like your idea of scaling x2 for only two cycles a byte!
I could also do it for the skips also to get back to the 256 skip ability.
The question then is how to encode that correctly.
I'd have to end some skips or repeats 1 pixel early and then fill that pixel in with a TriPixel because everything has to be powers of two IE I need to deal with the 'odd' pixel. Since I already sometimes fill in 1 or 2 pixels I don't need to on TriPixels the fact that I'm stopping some skips or repeats 1 pixel 'early' might not even add to any overhead or encoded bytes?
Maybe a two pass encoder would be a easy way?
I'll have to think on it.

Neat idea!


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2024 11:39 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
Hey everyone!
An update on the progress I've made in my quest for FPS.
I just can't leave well enough alone and had to try to squeeze a little more performance out of this.
And I've finally found something useful to do on this project with basic self modifying code!
For whatever reason I really wanted at least some routine to use it and gain performance.

I made a simple unrolled routine that draws a pixel, increments y, and then draws another for a total of 64 times.
This is around 8.2 cycles a pixel vrs 13.3 for a loop that did it before.
I could shift and add to get the x3 I need for the jump, but a lookup is a few cycles faster.

Code:
 ...
  TAX ; RLE count
  LDA .RLEArray,x ; Load JMP location
  STA .RLEJump+1  ; Store low byte location
  lda PlotColor   ; Last pixel color used
.RLEJump:
  JMP .RLERender ;Self modify! Jump to count


.RLERender:
  sta (Screen),y; Draw it! 64 ; This is location 0 index 64
  iny ; Next pixel
  sta (Screen),y; Draw it! 63 ; This is location 3 index 63
  iny ; Next pixel
 
  ... ETC ...

  sta (Screen),y; Draw it! 2; This is location 186 index 2
  iny ; Next pixel
  sta (Screen),y; Draw it! 1 ; This is location 189 index 1
  iny ; Next pixel
  jmp .readloop ; Decode another byte
 
  .ORG $800 ;Page align
.RLEArray: ;Decending Array of 3 byte skips
; I don't use 'zero' so use '1' twice IE two 189
    .BYTE 189,189,186,183,180,177,174,171,168,165,162
        .BYTE 159,156,153,150,147,144,141,138,135,132
 .. ETC ..
       .BYTE   9,  6,  3,  0
       


What surprised me was that this did not really add that much to the overall FPS. I only increased it by 0.3 FPS on average to 50.4 FPS.
However, after turning on a debug pixel that shows red when it can’t complete in a vsync and watching the playback and thinking about it I realized a couple of things.
Most frames are fine now, approaching 99%+.
And of the vast majority of frames that were still slow some had lots of small variations over several frames, and some large variations over a few frames.
The long runs of pixels seem to only happen often in the second case.
This means that while the routine is a lot faster, the number of frames that can really take advantage of it is actually pretty limited.

The good news is that these frames with improvement were some of the worst visually.
While objectively it did not add much to the performance using FPS as the metric, the subjective smoothness is much better now that these frames lag less.
I wonder if I'll find anything else to improve?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2024 11:45 pm 
Offline

Joined: Sun Sep 24, 2023 3:45 pm
Posts: 45
BigEd wrote:
Great result! There's a couple of things you can cheaply do with your counts - both your skip counts and your repeat counts. One is to apply an offset, so 0-63 becomes 10-73 or any other range which happens to work best. Another is to apply scaling, so 0-63 becomes 0-126, or combine both so 10-73 becomes 20-146, or maybe even scaling by 4 would be an advantage.

It might be that neither help, in your case, but it could be worth a little experimentation. Scaling only costs you two cycles because it's just a shift.


FYI, I tried the scaling for skip and RLE and I did get a slight reduction in size. It worked out to a 70 KB reduction from the ~2.5 meg file, or ~10 bytes a frame on average, but the overhead of scaling meant that it did not help the FPS overall.
However I suspect that another source would have improvement.
When I try to compress full color later on I will need to keep scaling in mind. Thanks!


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2024 6:37 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
I just wanted to mention, its not hard to get a desired frame-rate. You have to realize that its not necessary to update the entire set of pixels. Just pick the bigger blocks and cut the ones you don't have time to display. This basic principle is used in all temporal video codecs. We can't really see the details when they are in motion.
It's been done on a 1MHz 6502 with average 70 bytes per frame at 12 fps, while also streaming from disk and playing music, here:
https://csdb.dk/release/?id=131628


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4

All times are UTC


Who is online

Users browsing this forum: No registered users and 26 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: