6502.org • View topic - Optimization competition: An efficient calling convention

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

Optimization competition: An efficient calling convention

Page 1 of 2

[ 26 posts ]

Go to page 1, 2 Next

Previous topic | Next topic

Author

Message

johnwbyrd

Post subject: Optimization competition: An efficient calling convention

Posted: Sun Jun 21, 2020 10:34 pm

Joined: Mon May 01, 2017 7:13 am
Posts: 82

Howdy programmers,

Here's a fun little mini programming challenge that you may enjoy.

Your task is to design and implement a highly specific 6502 calling convention.

In your calling convention, all the operands must be 8-bit pointers representing zero-page addresses. The function being called, will ONLY touch zero-page memory to do whatever work it needs to do -- it will NOT touch 16-bit memory.

All functions using your calling convention will have three or fewer operands. You don't need to worry about arbitrary numbers of operands.

For example, one function might take the value that the first operand points to, subtract the value that the second operand points to, and store the result in the location that the third operand points to.

You can use all the 6502 registers, the stack, zero page, main memory, or anything else you want in your calling convention.

For example, one calling convention might be to store the operands in three consecutive zero page locations.

Another calling convention would be to store the first operand in the X register, the second operand in the Y register, and the third operand in a fixed zero page location.

Another might be to store all the operands on the stack.

To demonstrate your calling convention, please show an efficient implementation of an addition function that:

- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.

The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.

Another win should also go to the calling convention with the smallest code size.

Thank you very much for playing with me! Looking forward to seeing your brilliancies.

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Sun Jun 21, 2020 11:06 pm

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

Ground rule questions:

6502 instructions only or are 65C02 instructions allowed?

Is self-modifying code allowed?

Top

johnwbyrd

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Sun Jun 21, 2020 11:50 pm

Joined: Mon May 01, 2017 7:13 am
Posts: 82

BillG wrote:

6502 instructions only or are 65C02 instructions allowed?

It must be shown that a reasonably performant NMOS 6502 solution exists. However, a clever 65C02-only solution is definitely of interest. (The compiler can be told to target a specific 6502 variant.)

For that reason, a 65CE02-only solution, that may or may not use the Z accumulator, is also of interest.

However, a winning entry must show that you can do it cleanly, via documented instructions only, on the original 6502 metal, as per http://archive.6502.org/books/mcs6500_family_programming_manual.pdf .

BillG wrote:

Is self-modifying code allowed?

Same as above. It must be shown that a reasonably performant, non self modifying solution, exists. However, if there is a clever self modifying solution, that is definitely of interest as well. (Some targets will need the executable code to be in ROM. and some will not.)

Any other rules of engagement questions?

Last edited by johnwbyrd on Mon Jun 22, 2020 10:01 pm, edited 1 time in total.

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 12:07 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

This is my entry for fastest NMOS 6502:

Code:

        org     0

P1      dw      0       ; Pointer to first operand
P2      dw      0       ; Pointer to second operand

Num1    dw      1234    ; First number
Num2    dw      5678    ; Second number
Num3    dw      0       ; Sum

;
; (P1) + (P2) -> (P3)
;
Add
        sta     P1
        sty     P2

        ldy     #0
        clc
 
        lda     (P1),Y
        adc     (P2),Y
        sta     0,X
 
        iny
 
        lda     (P1),Y
        adc     (P2),Y
        sta     1,X

        rts

;
Start
        lda     #<Num1  ; Store pointer to first number

        ldy     #<Num2  ; Store pointer to second number
 
        ldx     #<Num3  ; Store pointer to sum

        jsr     Add     ; Add them

        end     Start

Edit: Made faster.

Last edited by BillG on Mon Jun 22, 2020 1:58 am, edited 1 time in total.

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 12:11 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

This is my entry for fastest 65C02:

Code:

        org     0

P1      dw      0       ; Pointer to first operand
P2      dw      0       ; Pointer to second operand

Num1    dw      1234    ; First number
Num2    dw      5678    ; Second number
Num3    dw      0       ; Sum

;
; (P1) + (P2) -> (P3)
;
Add
        sta     P1
        sty     P2

        ldy     #1
        clc
 
        lda     (P1)
        adc     (P2)
        sta     0,X
 
        lda     (P1),Y
        adc     (P2),Y
        sta     1,X

        rts

;
Start
        lda     #<Num1  ; Store pointer to first number

        ldy     #<Num2  ; Store pointer to second number
 
        ldx     #<Num3  ; Store pointer to sum

        jsr     Add     ; Add them

        end     Start

Edit: Made faster.

Last edited by BillG on Mon Jun 22, 2020 2:10 am, edited 1 time in total.

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 12:16 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

This is my entry for smallest NMOS 6502:

Code:

        org     0

P1      dw      0       ; Pointer to first operand
P2      dw      0       ; Pointer to second operand

Num1    db      >1234    ; First number (big endian)
        db      <1234
Num2    db      >5678    ; Second number (big endian)
        db      <5678
Num3    dw      0       ; Sum (big endian)

;
; (P1) + (P2) -> (P3)
;
Add
        sta     P1
        sty     P2

        ldy     #1
        clc
 
Loop
        lda     (P1),Y
        adc     (P2),Y
        sta     0,X

        dex 
        dey
 
        bne     Loop

        rts

;
Start
        lda     #<Num1  ; Store pointer to first number

        ldy     #<Num2  ; Store pointer to second number
 
        ldx     #<Num3+1; Store pointer to sum (Note the +1)

        jsr     Add     ; Add them

        end     Start

Edit: revised to benefit Add being called more than once.

Further edit: Made smaller.

Last edited by BillG on Mon Jun 22, 2020 2:32 am, edited 1 time in total.

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 12:41 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

This is my entry for smallest self-modifying 6502:

Code:

        org     0

Num1    db      >1234    ; First number (big endian)
        db      <1234
Num2    db      >5678    ; Second number (big endian)
        db      <5678
Num3    dw      0       ; Sum (big endian)

;
; (P1) + (P2) -> (P3)
;
Add
        sta     P1
        stx     P2
        sty     P3

        ldx     #1
        clc
 
Loop
P1      equ     *+1
        lda     0,X

P2      equ     *+1
        adc     0,X

P3      equ     *+1
        sta     0,X
 
        dex
 
        bne     Loop

        rts

;
Start
        lda     #<Num1  ; Store pointer to first number

        ldx     #<Num2  ; Store pointer to second number
 
        ldy     #<Num3  ; Store pointer to sum

        jsr     Add     ; Add them

        end     Start

Edit: Allow the code to be outside of the zero page.

Top

GARTHWILSON

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 3:46 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California

Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)? If you use a ZP data stack (yes, just like Forth does), your inputs and outputs will be on the data stack. So then it's just

Code:

JSR ADD

(This does the actual addition. It's not adding the parameters to the stack. They would already be there from previous operations, even if just a literal or a fetch.)

If you still want the inputs preserved, they'll need to be duplicated first, so you have

Code:

        JSR  2DUP
        JSR  ADD

Any desired level of equation complexity, address indirects, etc. can be done this way, with implicit parameter-passing.

The fourth page of the 6502 treatise on stacks (plural, not just the page-1 hardware stack) starts into this subject, at http://wilsonminesco.com/stacks/virtualstacks.html "Virtual stacks and various ways to implement them," and subsequent pages build on it.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

johnwbyrd

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 4:25 am

Joined: Mon May 01, 2017 7:13 am
Posts: 82

GARTHWILSON wrote:

[color=#000000]Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)?

Not merely the calling convention, but a specific demonstration of its use, so that its speed and size overhead can be fairly benchmarked against other proposals.

Recall that the goal of the exercise is not to leave a result on the stack, but to perform exactly the zero-page addition operation described.

Top

johnwbyrd

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 4:43 am

Joined: Mon May 01, 2017 7:13 am
Posts: 82

BillG clearly understands the intention of the exercise. I'm curious to see if anyone can come up with a method that works in fewer cycles.

Top

BigDumbDinosaur

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 4:47 am

Joined: Thu May 28, 2009 9:46 pm
Posts: 8176
Location: Midwestern USA

johnwbyrd wrote:

For that reason, a 65816-only solution, that may or may not use the Z accumulator, is also of interest.

There is no "Z accumulator" in the 65C816. The 65C816's registers are .A, .B, .X, .Y, DB, DP, PB, PC and SP.

_________________
x86? We ain't got no x86. We don't NEED no stinking x86!

Top

barrym95838

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 5:12 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA

johnwbyrd wrote:

... To demonstrate your calling convention, please show an efficient implementation of an addition function that:

- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.

The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.

Another win should also go to the calling convention with the smallest code size.

Thank you very much for playing with me! Looking forward to seeing your brilliancies.

I certainly don't know if I'm misunderstanding or oversimplifying, but my attempt which does what you seem to describe would be:

Code:

; add two 16-bit numbers in zp:
; x and y point to the addends, a points to the sum
; a and x are modified, y is preserved
add:
     sta  zptemp ; stash destination pointer
     clc
     lda  0,x
     adc  0,y
     pha         ; save lo result
     lda  1,x
     adc  1,y
     ldx  zptemp ; retrieve destination pointer
     sta  1,x    ; store hi result
     pla  
     sta  0,x    ; store lo result
     rts
main:
     lda  #dest
     ldx  #source1
     ldy  #source2
     jsr  add
     brk

If I count correctly (big if), 57 cycles between jmp main and brk, one cycle faster than BillG's fastest NMOS attempt.

[Edit: added "#"s to main, derp]

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 2:55 pm

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

barrym95838 wrote:

If I count correctly (big if), 57 cycles between jmp main and brk, one cycle faster than BillG's fastest NMOS attempt.

Nicely done!

I'll see your cycle and raise (or is that lower) you another...

Code:

        org     0

P3      db      0       ; Pointer to third operand
SumLow  db      0

Num1    dw      1234    ; First number
Num2    dw      5678    ; Second number
Num3    dw      0       ; Sum

;
; (P1) + (P2) -> (P3)
;
Add
        sta     P3

        clc
 
        lda     0,Y
        adc     0,X
        sta     SumLow
 
        lda     1,Y
        adc     1,X
        ldx     P3      ; Point to sum
        sta     1,X
        lda     SumLow
        sta     0,X

        rts

;
Start
        ldy     #Num1   ; Store pointer to first number

        ldx     #Num2   ; Store pointer to second number
 
        lda     #Num3   ; Store pointer to sum

        jsr     Add     ; Add them

        end     Start

Code:

                          00011 ;
                          00012 ; (P1) + (P2) -> (P3)
                          00013 ;
 000C                     00014 Add
 000C 85 00           [3] 00015          sta    P1
 000E 84 02           [3] 00016          sty    P2
                          00017
 0010 A0 00           [2] 00018          ldy    #0
 0012 18              [2] 00019          clc
                          00020    
 0013 B1 00         [5/6] 00021          lda    (P1),Y
 0015 71 02         [5/6] 00022          adc    (P2),Y
 0017 95 00           [4] 00023          sta    0,X
                          00024    
 0019 C8              [2] 00025          iny
                          00026    
 001A B1 00         [5/6] 00027          lda    (P1),Y
 001C 71 02         [5/6] 00028          adc    (P2),Y
 001E 95 01           [4] 00029          sta    1,X
                          00030
 0020 60              [6] 00031          rts
                          00032
                          00033 ;
 0021                     00034 Start
 0021 A9 06           [2] 00035          lda    #Num1     ; Store pointer to first number
                          00036
 0023 A0 08           [2] 00037          ldy    #Num2     ; Store pointer to second number
                          00038    
 0025 A2 0A           [2] 00039          ldx    #Num3     ; Store pointer to sum
                          00040
 0027 20 000C         [6] 00041          jsr    Add       ; Add them

Σ = 58

vs.

Code:

                          00006 ; add two 16-bit numbers in zp:
                          00007 ; x and y point to the addends, a points to the sum
                          00008 ; a and x are modified, y is preserved
 0004                     00009 add:
 0004 85 00           [3] 00010      sta  zptemp ; stash destination pointer
 0006 18              [2] 00011      clc
 0007 B5 00           [4] 00012      lda  0,x
 0009 79 0000       [4/5] 00013      adc  0,y
 000C 48              [3] 00014      pha         ; save lo result
 000D B5 01           [4] 00015      lda  1,x
 000F 79 0001       [4/5] 00016      adc  1,y
 0012 A6 00           [3] 00017      ldx  zptemp ; retrieve destination pointer
 0014 95 01           [4] 00018      sta  1,x    ; store hi result
 0016 68              [4] 00019      pla 
 0017 95 00           [4] 00020      sta  0,x    ; store lo result
 0019 60              [6] 00021      rts
 001A                     00022 main:
 001A A9 01           [2] 00023      lda  #dest
 001C A2 02           [2] 00024      ldx  #source1
 001E A0 03           [2] 00025      ldy  #source2
 0020 20 0004         [6] 00026      jsr  add

Σ = 57

vs.

Code:

                          00010 ;
                          00011 ; (P1) + (P2) -> (P3)
                          00012 ;
 000A                     00013 Add
 000A 85 00           [3] 00014          sta    P3
                          00015
 000C 18              [2] 00016          clc
                          00017    
 000D B9 0000       [4/5] 00018          lda    0,Y
 0010 75 00           [4] 00019          adc    0,X
 0012 85 02           [3] 00020          sta    SumLow
                          00021    
 0014 B9 0001       [4/5] 00022          lda    1,Y
 0017 75 01           [4] 00023          adc    1,X
 0019 A6 00           [3] 00024          ldx    P3        ; Point to sum
 001B 95 01           [4] 00025          sta    1,X
 001D A5 02           [3] 00026          lda    SumLow
 001F 95 00           [4] 00027          sta    0,X
                          00028
 0021 60              [6] 00029          rts
                          00030
                          00031 ;
 0022                     00032 Start
 0022 A0 04           [2] 00033          ldy    #Num1     ; Store pointer to first number
                          00034
 0024 A2 06           [2] 00035          ldx    #Num2     ; Store pointer to second number
                          00036    
 0026 A9 08           [2] 00037          lda    #Num3     ; Store pointer to sum
                          00038
 0028 20 000A         [6] 00039          jsr    Add       ; Add them

Σ = 56

Keep the ideas coming...

Top

BillG

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 3:54 pm

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas

GARTHWILSON wrote:

[color=#000000]Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)? If you use a ZP data stack (yes, just like Forth does), your inputs and outputs will be on the data stack. So then it's just

Code:

JSR ADD

(This does the actual addition. It's not adding the parameters to the stack. They would already be there from previous operations, even if just a literal or a fetch.)

Even though it does not meet the requirements of this competition, I think it will be interesting to see how the ADD word compares.

Can you provide that?

Edit: I am guessing that if you are using parallel stacks for low and high bytes, the code may look something like this:

Code:

ADD
        clc

        lda     StackL,X
        adc     StackL+1,X
        sta     StackL+1,X

        lda     StackH,X
        adc     StackH+1,X
        sta     StackH+1,X

        inx

        rts
:
:
        jsr     ADD

for a total of 40 cycles.

If this is used within FORTH, the cost of a trip through the inner interpreter would have to be added to the total to be fair. :wink:

Top

GARTHWILSON

Post subject: Re: Optimization competition: An efficient calling conventio

Posted: Mon Jun 22, 2020 10:22 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California

BillG wrote:

Even though it does not meet the requirements of this competition, I think it will be interesting to see how the ADD word compares.

Can you provide that?

Code:

ADD:    CLC             ; ( n1 n2  --  n1+n2 )
        LDA  0,X
        ADC  2,X
        STA  2,X
        LDA  1,X
        ADC  3,X
        STA  3,X
        INX
        INX
        RTS
 ;--------------

But after re-reading the first post, I am reminded that there's also the requirement that what you start with is not the actual numbers, but rather ZP pointers to them. So then you need to be able to fetch them:

Code:

                        ; Fetch the value at the addr pointed to by the top stack cell.
FETCH:  LDA  (0,X)      ; Get the first (ie, low) byte.                 ( addr -- n )
        PHA             ; Save it for later since we still need A & the addr at the top of the stack.
        INC  0,X        ; Increment the low byte of the address we're fetching.
        BNE  1$         ; If that turned it into 00,
        INC  1,X        ; then we have to increment the high byte too.
 1$:    LDA  (0,X)      ; Now get the second (ie, high) byte, and
                        ; PUT replaces the contents of the top stack cell.
PUT:    STA  1,X        ; Store the high byte in the high byte of the top-of-stack cell.
        PLA             ; Get the first (ie, low) byte back,
        STA  0,X        ; and store it in the low byte of the top-of-stack cell.
        RTS             ; Since you start and end with the same stack depth, you don't need INX or DEX.
 ;--------------

The code length is only a fraction as much on the 65816 though, where the 16-bit FETCH just becomes:

Code:

        LDA  (0,X)
        STA  0,X

and might as well be inlined as it takes only one more byte than a JSR, and runs twice as fast (12 cycles versus 24).

These routines are there for the calling in hundreds of other places in an application, so the amount of program memory taken, amortized, is small, unless you need something optimized badly enough to inline it and avoid the JSR-RTS overhead (which was easy to justify above in the case of FETCH on the 65816).

Quote:

Edit: I am guessing that if you are using parallel stacks for low and high bytes, the code may look something like this:

Doing it that way means you cannot use the (ZP,X) addressing mode. You could copy the two bytes to someplace where they can be together for that, but I don't think it really results in any net improvement in overall system performance.

Quote:

If this is used within FORTH, the cost of a trip through the inner interpreter would have to be added to the total to be fair. :wink:

NEXT, commonly called the "inner loop" or "inner interpreter" is not really a loop or an interpreter. Yes, it adds overhead if you're using indirect-threaded code (ITC) or direct-threaded code (DTC) Forth; but there is no NEXT in subroutine-threaded code (STC) Forth in which longer routines are called as subroutines, and the shorter ones can be inlined with no JSR-RTS overhead. The threshold between subroutine-calling and inlining depends on the desired balance between the priorities of speed and program memory size. A few of the shortest ones win in both areas, like DROP being just INX INX. But in optimization, I recently did this Forth line

Code:

R> DROP F_ISR_LIST 6 - >R

all in just two 65816 machine-language instructions, with zero NEXT overhead. (So yes, Forth compilation can be highly optimized.)

But back to the subject of calling convention:
The original post called for 16-bit numbers added to get a 16-bit sum, and one of the suggestions was putting it on the stack. Hence my suggestion to use a ZP data stack, which solves various problems of putting parameters on the hardware stack, as discussed in the 6502 stacks treatise. Two-byte stack cells will be overkill and inefficient for operations where the 6502 is only dealing with 8 bits, like with ASCII characters for example; but it sure works out easier to make all stack cells the same size. In any case, it sure makes for very small code size, as also brought up in the original post:

Quote:

Another win should also go to the calling convention with the smallest code size.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Page 1 of 2

[ 26 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 16 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum