6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Sep 23, 2024 3:22 am

All times are UTC




Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Apr 20, 2020 3:40 pm 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Hi all,

I'm cleaning up a project I started late last year to play 'DIGIs' (digital samples) on the C64 SID. After figuring out how to process the sample is Audacity for good results: 8-bit samples at 8khz. I then reduce the samples to 4-bit, doing some rounding up, and pack two per byte.

I also tried to do a simple Run Length Encoding, which reduced the sample file size by about 25%. And while I can decode it easy enough on the C64 it takes far too long to do to process in an 8khz interrupt loop. It would work at 4khz I suspect though.

Anyhow, I was optimizing my simple, non-RLE, interrupt routine this morning and was unhappy with how long it took to toggle a bit in zero page that I'm using as a flag to know which nibble should be played next.

Originally I was doing this to check, and then toggle the bit at the end. The #- indicates clock cycles with (#) as the total for that block. So originally it took 14 clock cycles ignoring the single-cycle difference in BNE if branch take or not.

Code:
         ;every other NMI do *1 or *2
         LDA flag               ; 3- if flag==0 we just played upper nibble
         BNE lower             ; 2- (5) so skip ahead to load new byte

...other code

         LDA flag               ; 3- toggle hi/low nibble flag and exit NMI
         EOR #$1              ; 3-
         STA flag               ; 3- (9)


I reduced this to 12 cycles with the following:

Code:
         LDA #$01               ; 2-
         AND flag               ; 3-
         BNE lower              ; 2- (7)

         INC flag               ; 5 (5)


So, not a huge savings but even two clock cycles in an 8khz interrupt is a lot of cycles per second. It still seems like it could be simplified though. I'm not sure what else could be done with this structure but maybe it would be possible to combine the toggle and test into one block?

It got me wondering what clever ways had been devised to handle a simple task like this over the years? I did a search on the forum but the terms 'bit toggle' brings up many, many threads of course.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 5:04 pm 
Offline

Joined: Sat Jan 02, 2016 10:22 am
Posts: 197
IF you don't mind using a whole zero page byte for one bit then $FF and be flipped to $00 with INC zp and flipped back again with DEC zp.

BIT ZP will then set the sign flag according to the state of the top bit, which is flipping between 0 and 1 (as are all the other bits).

That's 5 cycles to flip the flag and 3 to test, and it doesn't upset any registers.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 5:27 pm 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Martin A wrote:
IF you don't mind using a whole zero page byte for one bit then $FF and be flipped to $00 with INC zp and flipped back again with DEC zp.


If I understand you correctly I would have to decide when to inc and when to dec which I don't want to do. Each time through the interrupt it just needs to toggle the flag. Using a whole byte of ZP for the flag is OK in this case.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 6:28 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Jeff_Birt wrote:
Code:
         LDA #$01               ; 2-
         AND flag               ; 3-
         BNE lower              ; 2- (7)
...
         INC flag               ; 5 (5)

I can save you a byte (or maybe 3), but without further context I can't save you any cycles:
Code:
         LDA flag               ; 3
         LSR A                  ; 2
         INC flag               ; 5
         BCS lower              ; 2+ (12+)
...

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 6:37 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
If you had a CMOS 6502, you could use the TRB and TSB instructions. On Rockwell and WDC versions, you even have BBR/Sx and R/SMBx to assist. But on the C64, you don't have those niceties.

The most straightforward approach, which you seem to have missed, would be to use exclusive-or:
Code:
  LDA #1    ; 2 cyc - any non-zero constant will do here
  EOR flag  ; 3 cyc
  STA flag  ; 3 cyc
  BNE FlagWasClear  ; 2.5 cyc
FlagWasSet:
  ...
FlagWasClear:
  ...
That gives you a 10.5 cycle (on average) flag test and toggle, including the branch.

Using the INC/DEC method proposed by Martin:
Code:
  BIT flag  ; 3 cyc
  BPL FlagWasClear  ; 2.5 cyc
FlagWasSet:
  INC flag  ; 5 cyc
  ...
FlagWasClear:
  DEC flag  ; 5 cyc
  ...
Including the branch, that is also 10.5 cycles on average. Both of the above versions are also 8 bytes.

For yet another approach, we start by initialising the flag byte to $AA or $55, depending on which case you want to run first. Then we can get an alternating series with the following:
Code:
  ASL flag  ; 5 cyc
  BCC FlagWasClear  ; 2.5 cyc
  INC flag  ; 5 cyc - inject a set bit at the right-hand end of the byte to restore the pattern
FlagWasSet:
  ...
FlagWasClear:
  ...
This reduces the average cycle count to 10 cycles exactly, saving half a cycle, because the INC is only executed half the time. It also saves two bytes, for a total of 6.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 7:44 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Chromatix wrote:
Code:
  ASL flag  ; 5 cyc
  BCC FlagWasClear  ; 2.5 cyc
  INC flag  ; 5 cyc - inject a set bit at the right-hand end of the byte to restore the pattern
FlagWasSet:
  ...
FlagWasClear:
  ...

That's too cool for school! 8)

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 8:33 pm 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 275
Location: Placerville, CA
Yup, that's a bingo.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 9:03 pm 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Chromatix wrote:
The most straightforward approach, which you seem to have missed, would be to use exclusive-or:
Code:
  LDA #1    ; 2 cyc - any non-zero constant will do here
  EOR flag  ; 3 cyc
  STA flag  ; 3 cyc
  BNE FlagWasClear  ; 2.5 cyc
FlagWasSet:
  ...
FlagWasClear:
  ...
That gives you a 10.5 cycle (on average) flag test and toggle, including the branch.

That is what I was talking about with 'combiining into one block' rather than having the bit test and toggle in seperate places, not even sure why I had them seperated to begin with. This is also the most striaghtforward approach and is very clear to anyone else reading the code what is happening.


Chromatix wrote:
Using the INC/DEC method proposed by Martin:
Code:
  BIT flag  ; 3 cyc
  BPL FlagWasClear  ; 2.5 cyc
FlagWasSet:
  INC flag  ; 5 cyc
  ...
FlagWasClear:
  DEC flag  ; 5 cyc
  ...
Including the branch, that is also 10.5 cycles on average. Both of the above versions are also 8 bytes.

That would work as well.


Chromatix wrote:
For yet another approach, we start by initialising the flag byte to $AA or $55, depending on which case you want to run first. Then we can get an alternating series with the following:
Code:
  ASL flag  ; 5 cyc
  BCC FlagWasClear  ; 2.5 cyc
  INC flag  ; 5 cyc - inject a set bit at the right-hand end of the byte to restore the pattern
FlagWasSet:
  ...
FlagWasClear:
  ...
This reduces the average cycle count to 10 cycles exactly, saving half a cycle, because the INC is only executed half the time. It also saves two bytes, for a total of 6.

I think this wins the clever award! It might well leave someone else scractching thier head or even me scratching my head 6 months from now when wondering why I did that :)

I think all three approaches have their own merit. I think I'll implement #1 and then post the whole ISR to see if there are other savings to be had.

Thanks everyone!


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 9:15 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
The same code, with different initialisation values, can be used to generate a set (or cleared) bit every 2 cycles, every 4 cycles, or every 8 cycles - indeed, in any repeating pattern of length 8. The execution time is 12 cycles for every set bit, and 8 cycles for every cleared bit. It has the additional advantage that the only register it clobbers is the status bits.

For an arbitrary interval between events, consider the following which is just as fast, and only goes back up to 8 bytes of code:
Code:
  DEC count  ; 5 cyc
  BNE NotYet ; 2-3 cyc
  LDA #interval ; 2 cyc
  STA count  ; 3 cyc
DoEvent:
  ...
NotYet:
  ...
This does clobber the accumulator, but you can substitute X or Y if that's more convenient. And yes, you can set interval=2 to get the behaviour desired by the OP.

There's always more than one way to do it. :-)


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 11:04 pm 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Chromatix wrote:
The same code, with different initialisation values, can be used to generate a set (or cleared) bit every 2 cycles, every 4 cycles, or every 8 cycles - indeed, in any repeating pattern of length 8. The execution time is 12 cycles for every set bit, and 8 cycles for every cleared bit. It has the additional advantage that the only register it clobbers is the status bits.


Another very interesting trick!


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 20, 2020 11:53 pm 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Here is the NMI for the DIGI player. One thing that just occurred to me is that I could make sure that the samples always end on a page boundary. That way I only need to compare the high byte of the sample ending address. That would save 9-10 cycles.

It might also be interesting to look at using a 4khz sample rate and interpolating between samples. Maybe something like '(current_nibble + last_nibble) > 1'

Code:
;-------------------------------------------------------------------------------
; NMI handler routine, plays one 4bit sample per pass
NMI_HANDLER       
        ; start with saving state
        PHA                     ; 3- will restore when returning
        TYA                     ; 2- from interrupt handler
        PHA                     ; 3- (8)

        ; play 4-bit sample, first sample byte saved during Init
        LDA sample              ; 3- load sample byte
        ORA #$10                ; 2- make sure we don't kill filter settings
        AND #$1F                ; 2- git rid of any dangling high bits
        STA SID+$18             ; 4- save to SID volume regsiter
        STA $D020               ; 4- change border color for something to look at
        LDA $DD0D               ; 4- (19)clear NMI

         ;every other NMI do *1 or *2
        LDA #$01                ; 2- if flag!=0 we just played upper nibble
        EOR flag                ; 3-
        STA flag                ; 3-
        BNE lower               ; 2-3 (10-11) so skip ahead to load new byte

upper   LDA sample              ; 3- *1 shift upper nibble down
        LSR a                   ; 2-
        LSR a                   ; 2-
        LSR a                   ; 2-
        LSR a                   ; 2-
        STA sample              ; 3- store it back to play next pass
        JMP exit                ; 3- (17) all done for this pass

lower   LDY #$0                 ; 2- *2 get a new packed sample byte
        LDA (ptr),y             ; 5-     
        STA sample              ; 3- save to temp location
        INC ptr                 ; 5- inc point to next sample byte
        BNE checkend            ; 2-3 did we roll low byte over to zero?
        INC ptr+1               ; 5- (22-23) if so inc the high byte of ptr too
   
checkend
        LDA ptr                 ; 3- if not @ end of sample exit/return from NMI
        CMP #<DATASTOP          ; 4- low byte
        BNE exit                ; 2-3 (10)
        LDA ptr+1               ; 3- high byte
        CMP #>DATASTOP          ; 4
        BNE exit                ; 2-3 (18-19)

         ; when done with sample, turn off NMI interrupt
        LDA #$00                ; 2- turn off NMI
        STA $DD0E               ; 4- timer A stop-CRA, CIA #1 DC0E
        LDA #$4F                ; 2- disable all CIA-2 NMIs
        STA $DD0D               ; 4- ICR - interrupt control / status
        LDA $DD0D               ; 4- (16) sta/lda to ack any pending int
 
        LDA #$37                ; 2- reset kernal banking
        STA $01                 ; 3- (5)

exit     
        PLA                     ; 3- restore state
        TAY                     ; 2-
        PLA                     ; 3-
        RTI                     ; 6- (14) return from this pass of NMI
                                ; 68 *1, 84-85,93-94 *2


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 21, 2020 5:03 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
A couple of good ways to save a few cycles, if you need them:

In general the fall-through branch is faster than taking it, so you should use the fall-through for the common case and take the branch for the exceptional case. That means moving code that is executed rarely out-of-line, and branching back to the main line. If you align the sample with the page end, as suggested, that means you should swap the code blocks beginning "checkend" and "exit".

In the ISR entry and exit, it's faster by 3 cycles each to preserve Y in a ZP byte than on the stack. If you have a spare byte, do that. If you have *two* spare bytes, put A there too to save another cycle at each end. In total that buys you 8 cycles. But you might not need to use Y at all, see below.

I assume this routine will be in RAM. In that case, save yet more cycles and some ZP space by switching the LDA (ptr),Y for an LDA absolute, and self-modify its operand. This saves 3 cycles directly; the instruction itself takes one fewer cycle, and you eliminate LDY #0. On the downside, the INC ptr (where ptr is now an absolute address in the NMI routine) in the fast path takes an extra cycle. But the biggest saving is 12 cycles in the ISR prologue/epilogue, as you no longer need TYA:PHA and PLA:TAY; this is the only place you needed the Y register.

Putting all of this together:
Code:
NMI_Handler:
  PHA

  ; load sample into SID
  LDA sample
  ORA #$10
  AND #$1F
  STA SID+$18
  STA $D020  ; border colour
  LDA $DD0D  ; clear NMI

  ; alternate high/low nybble
  ASL flag
  BCC @lower
  INC flag

@upper:
  LDA sample
  LSR A
  LSR A
  LSR A
  LSR A
  STA sample

  ; local exit code is now both smaller and faster than jumping to it
  PLA
  RTI

@lower:
  LDA $FFFF  ; self-mod pointer to sample - overwrite this operand during init
  STA sample
  INC @lower+1
  BEQ @nextPage

  ; local exit code is now both smaller and faster than jumping to it
  PLA
  RTI

@nextPage:
  ; this is only executed every 512 samples
  LDA @lower+2
  ADC #1     ; carry is clear from flag test
  CMP #stopPage
  BEQ @stop
  STA @lower+2

  ; local exit code is now both smaller and faster than jumping to it
  PLA
  RTI

@stop:
  ; switch off the timers
  LDA #0
  STA $DD0E
  LDA #$4F
  STA $DD0D
  LDA $DD0D

  ; restore banking
  LDA #$37
  STA $01
  PLA
  RTI


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 22, 2020 1:34 am 
Offline

Joined: Wed Jul 18, 2018 12:12 pm
Posts: 96
Thank you Chromatix. I did not get a chance to get back to this today. I set out to test how the C64 diagnostic harness worked on the C128 with a PCB a friend designed for the KB dongle and the C128 quit right in front of me. No clock coming out of the 8701 clock generation chip, but the 8701 itself was OK. The bloody crystal died sitting right there on the bench. Was using it, turned it off and back on about 15 minutes later and it was 'Dead Jim'.

Anyhow your changes make sense. I tend to shy away from self modifying code unless it has a distinct advantage and I think your use case shows a strong advantage. I'll give this a go in the next few days and then work on the interpolated 4khz version.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2020 7:15 am 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
All the above I think leads to now removing the zp sample location which make the average case even quicker

Code:
NMI_Handler:
  PHA

  ; load sample into SID
@sample:
  LDA $FFFF  ; self-mod pointer to sample - overwrite this operand during init

; alternate high/low nybble
  ASL flag
  BCC @writeSID
  INC flag

  LSR A
  LSR A
  LSR A
  LSR A

  INC @sample+1
  BEQ @nextpage

@writesample
  ORA #$10
  AND #$1F
  STA SID+$18
  STA $D020  ; border colour
  LDA $DD0D  ; clear NMI

  ; local exit code is now both smaller and faster than jumping to it
  PLA
  RTI

@nextPage:
  ; this is only executed every 512 samples
  INC @sample+2
  PHA
  LDA @sample+2
  CMP #stopPage+1
  PLA
  BCC  writesample 

@stop:
  ORA #$10
  AND #$1F
  STA SID+$18
  STA $D020  ; border colour

  ; switch off the timers
  LDA #0
  STA $DD0E
  LDA #$4F
  STA $DD0D
  LDA $DD0D

  ; restore banking
  LDA #$37
  STA $01

  PLA
  RTI


The maybe typos but you should get the idea.
The ending looks as though It can be tidied up.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2020 7:28 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I see what you're trying to do there, and I considered making a similar adjustment myself, but this now means there's a different delay between /NMI and the sample appearing in the SID for alternate samples; one path has to go through the LSR chain and possibly the page-increment code, while the other goes directly. That sort of timing effect could produce an audible distortion.

And, for the sake of one ZP byte, the code is a great deal messier and harder to read. That's in part because the labels aren't really consistent with function any more, but there is also more duplication of code. I think if you were that desperate for a ZP byte, you should just move the sample storage to an absolute address, with very little code size/speed penalty.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: