Optimization competition: An efficient calling convention
Optimization competition: An efficient calling convention
Howdy programmers,
Here's a fun little mini programming challenge that you may enjoy.
Your task is to design and implement a highly specific 6502 calling convention.
In your calling convention, all the operands must be 8-bit pointers representing zero-page addresses. The function being called, will ONLY touch zero-page memory to do whatever work it needs to do -- it will NOT touch 16-bit memory.
All functions using your calling convention will have three or fewer operands. You don't need to worry about arbitrary numbers of operands.
For example, one function might take the value that the first operand points to, subtract the value that the second operand points to, and store the result in the location that the third operand points to.
You can use all the 6502 registers, the stack, zero page, main memory, or anything else you want in your calling convention.
For example, one calling convention might be to store the operands in three consecutive zero page locations.
Another calling convention would be to store the first operand in the X register, the second operand in the Y register, and the third operand in a fixed zero page location.
Another might be to store all the operands on the stack.
To demonstrate your calling convention, please show an efficient implementation of an addition function that:
- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.
The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.
Another win should also go to the calling convention with the smallest code size.
Thank you very much for playing with me! Looking forward to seeing your brilliancies.
Here's a fun little mini programming challenge that you may enjoy.
Your task is to design and implement a highly specific 6502 calling convention.
In your calling convention, all the operands must be 8-bit pointers representing zero-page addresses. The function being called, will ONLY touch zero-page memory to do whatever work it needs to do -- it will NOT touch 16-bit memory.
All functions using your calling convention will have three or fewer operands. You don't need to worry about arbitrary numbers of operands.
For example, one function might take the value that the first operand points to, subtract the value that the second operand points to, and store the result in the location that the third operand points to.
You can use all the 6502 registers, the stack, zero page, main memory, or anything else you want in your calling convention.
For example, one calling convention might be to store the operands in three consecutive zero page locations.
Another calling convention would be to store the first operand in the X register, the second operand in the Y register, and the third operand in a fixed zero page location.
Another might be to store all the operands on the stack.
To demonstrate your calling convention, please show an efficient implementation of an addition function that:
- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.
The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.
Another win should also go to the calling convention with the smallest code size.
Thank you very much for playing with me! Looking forward to seeing your brilliancies.
Re: Optimization competition: An efficient calling conventio
Ground rule questions:
6502 instructions only or are 65C02 instructions allowed?
Is self-modifying code allowed?
6502 instructions only or are 65C02 instructions allowed?
Is self-modifying code allowed?
Re: Optimization competition: An efficient calling conventio
BillG wrote:
6502 instructions only or are 65C02 instructions allowed?
For that reason, a 65CE02-only solution, that may or may not use the Z accumulator, is also of interest.
However, a winning entry must show that you can do it cleanly, via documented instructions only, on the original 6502 metal, as per http://archive.6502.org/books/mcs6500_f ... manual.pdf .
BillG wrote:
Is self-modifying code allowed?
Any other rules of engagement questions?
Last edited by johnwbyrd on Mon Jun 22, 2020 10:01 pm, edited 1 time in total.
Re: Optimization competition: An efficient calling conventio
This is my entry for fastest NMOS 6502:
Edit: Made faster.
Code: Select all
org 0
P1 dw 0 ; Pointer to first operand
P2 dw 0 ; Pointer to second operand
Num1 dw 1234 ; First number
Num2 dw 5678 ; Second number
Num3 dw 0 ; Sum
;
; (P1) + (P2) -> (P3)
;
Add
sta P1
sty P2
ldy #0
clc
lda (P1),Y
adc (P2),Y
sta 0,X
iny
lda (P1),Y
adc (P2),Y
sta 1,X
rts
;
Start
lda #<Num1 ; Store pointer to first number
ldy #<Num2 ; Store pointer to second number
ldx #<Num3 ; Store pointer to sum
jsr Add ; Add them
end Start
Last edited by BillG on Mon Jun 22, 2020 1:58 am, edited 1 time in total.
Re: Optimization competition: An efficient calling conventio
This is my entry for fastest 65C02:
Edit: Made faster.
Code: Select all
org 0
P1 dw 0 ; Pointer to first operand
P2 dw 0 ; Pointer to second operand
Num1 dw 1234 ; First number
Num2 dw 5678 ; Second number
Num3 dw 0 ; Sum
;
; (P1) + (P2) -> (P3)
;
Add
sta P1
sty P2
ldy #1
clc
lda (P1)
adc (P2)
sta 0,X
lda (P1),Y
adc (P2),Y
sta 1,X
rts
;
Start
lda #<Num1 ; Store pointer to first number
ldy #<Num2 ; Store pointer to second number
ldx #<Num3 ; Store pointer to sum
jsr Add ; Add them
end Start
Last edited by BillG on Mon Jun 22, 2020 2:10 am, edited 1 time in total.
Re: Optimization competition: An efficient calling conventio
This is my entry for smallest NMOS 6502:
Edit: revised to benefit Add being called more than once.
Further edit: Made smaller.
Code: Select all
org 0
P1 dw 0 ; Pointer to first operand
P2 dw 0 ; Pointer to second operand
Num1 db >1234 ; First number (big endian)
db <1234
Num2 db >5678 ; Second number (big endian)
db <5678
Num3 dw 0 ; Sum (big endian)
;
; (P1) + (P2) -> (P3)
;
Add
sta P1
sty P2
ldy #1
clc
Loop
lda (P1),Y
adc (P2),Y
sta 0,X
dex
dey
bne Loop
rts
;
Start
lda #<Num1 ; Store pointer to first number
ldy #<Num2 ; Store pointer to second number
ldx #<Num3+1; Store pointer to sum (Note the +1)
jsr Add ; Add them
end Start
Further edit: Made smaller.
Last edited by BillG on Mon Jun 22, 2020 2:32 am, edited 1 time in total.
Re: Optimization competition: An efficient calling conventio
This is my entry for smallest self-modifying 6502:
Edit: Allow the code to be outside of the zero page.
Code: Select all
org 0
Num1 db >1234 ; First number (big endian)
db <1234
Num2 db >5678 ; Second number (big endian)
db <5678
Num3 dw 0 ; Sum (big endian)
;
; (P1) + (P2) -> (P3)
;
Add
sta P1
stx P2
sty P3
ldx #1
clc
Loop
P1 equ *+1
lda 0,X
P2 equ *+1
adc 0,X
P3 equ *+1
sta 0,X
dex
bne Loop
rts
;
Start
lda #<Num1 ; Store pointer to first number
ldx #<Num2 ; Store pointer to second number
ldy #<Num3 ; Store pointer to sum
jsr Add ; Add them
end Start
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Optimization competition: An efficient calling conventio
Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)? If you use a ZP data stack (yes, just like Forth does), your inputs and outputs will be on the data stack. So then it's just
(This does the actual addition. It's not adding the parameters to the stack. They would already be there from previous operations, even if just a literal or a fetch.)
If you still want the inputs preserved, they'll need to be duplicated first, so you have
Any desired level of equation complexity, address indirects, etc. can be done this way, with implicit parameter-passing.
The fourth page of the 6502 treatise on stacks (plural, not just the page-1 hardware stack) starts into this subject, at http://wilsonminesco.com/stacks/virtualstacks.html "Virtual stacks and various ways to implement them," and subsequent pages build on it.
Code: Select all
JSR ADD(This does the actual addition. It's not adding the parameters to the stack. They would already be there from previous operations, even if just a literal or a fetch.)
If you still want the inputs preserved, they'll need to be duplicated first, so you have
Code: Select all
JSR 2DUP
JSR ADDAny desired level of equation complexity, address indirects, etc. can be done this way, with implicit parameter-passing.
The fourth page of the 6502 treatise on stacks (plural, not just the page-1 hardware stack) starts into this subject, at http://wilsonminesco.com/stacks/virtualstacks.html "Virtual stacks and various ways to implement them," and subsequent pages build on it.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Optimization competition: An efficient calling conventio
GARTHWILSON wrote:
Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)?
Not merely the calling convention, but a specific demonstration of its use, so that its speed and size overhead can be fairly benchmarked against other proposals.
Recall that the goal of the exercise is not to leave a result on the stack, but to perform exactly the zero-page addition operation described.
Re: Optimization competition: An efficient calling conventio
BillG clearly understands the intention of the exercise. I'm curious to see if anyone can come up with a method that works in fewer cycles.
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Optimization competition: An efficient calling conventio
johnwbyrd wrote:
For that reason, a 65816-only solution, that may or may not use the Z accumulator, is also of interest.
x86? We ain't got no x86. We don't NEED no stinking x86!
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Optimization competition: An efficient calling conventio
johnwbyrd wrote:
... To demonstrate your calling convention, please show an efficient implementation of an addition function that:
- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.
The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.
Another win should also go to the calling convention with the smallest code size.
Thank you very much for playing with me! Looking forward to seeing your brilliancies.
- adds a 16-bit value at a zero-page location pointed to by the first operand;
- to another 16-bit value at another zero-page location pointed to by the second operand;
- and stores the result at a third zero-page location pointed to by the third operand.
The winning proposal is the calling convention that sets up arbitrary operands, and completes the addition function, in the fewest clock cycles.
Another win should also go to the calling convention with the smallest code size.
Thank you very much for playing with me! Looking forward to seeing your brilliancies.
Code: Select all
; add two 16-bit numbers in zp:
; x and y point to the addends, a points to the sum
; a and x are modified, y is preserved
add:
sta zptemp ; stash destination pointer
clc
lda 0,x
adc 0,y
pha ; save lo result
lda 1,x
adc 1,y
ldx zptemp ; retrieve destination pointer
sta 1,x ; store hi result
pla
sta 0,x ; store lo result
rts
main:
lda #dest
ldx #source1
ldy #source2
jsr add
brk[Edit: added "#"s to main, derp]
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
Re: Optimization competition: An efficient calling conventio
barrym95838 wrote:
If I count correctly (big if), 57 cycles between jmp main and brk, one cycle faster than BillG's fastest NMOS attempt.
I'll see your cycle and raise (or is that lower) you another...
Code: Select all
org 0
P3 db 0 ; Pointer to third operand
SumLow db 0
Num1 dw 1234 ; First number
Num2 dw 5678 ; Second number
Num3 dw 0 ; Sum
;
; (P1) + (P2) -> (P3)
;
Add
sta P3
clc
lda 0,Y
adc 0,X
sta SumLow
lda 1,Y
adc 1,X
ldx P3 ; Point to sum
sta 1,X
lda SumLow
sta 0,X
rts
;
Start
ldy #Num1 ; Store pointer to first number
ldx #Num2 ; Store pointer to second number
lda #Num3 ; Store pointer to sum
jsr Add ; Add them
end Start
Code: Select all
00011 ;
00012 ; (P1) + (P2) -> (P3)
00013 ;
000C 00014 Add
000C 85 00 [3] 00015 sta P1
000E 84 02 [3] 00016 sty P2
00017
0010 A0 00 [2] 00018 ldy #0
0012 18 [2] 00019 clc
00020
0013 B1 00 [5/6] 00021 lda (P1),Y
0015 71 02 [5/6] 00022 adc (P2),Y
0017 95 00 [4] 00023 sta 0,X
00024
0019 C8 [2] 00025 iny
00026
001A B1 00 [5/6] 00027 lda (P1),Y
001C 71 02 [5/6] 00028 adc (P2),Y
001E 95 01 [4] 00029 sta 1,X
00030
0020 60 [6] 00031 rts
00032
00033 ;
0021 00034 Start
0021 A9 06 [2] 00035 lda #Num1 ; Store pointer to first number
00036
0023 A0 08 [2] 00037 ldy #Num2 ; Store pointer to second number
00038
0025 A2 0A [2] 00039 ldx #Num3 ; Store pointer to sum
00040
0027 20 000C [6] 00041 jsr Add ; Add them
Σ = 58
Code: Select all
00006 ; add two 16-bit numbers in zp:
00007 ; x and y point to the addends, a points to the sum
00008 ; a and x are modified, y is preserved
0004 00009 add:
0004 85 00 [3] 00010 sta zptemp ; stash destination pointer
0006 18 [2] 00011 clc
0007 B5 00 [4] 00012 lda 0,x
0009 79 0000 [4/5] 00013 adc 0,y
000C 48 [3] 00014 pha ; save lo result
000D B5 01 [4] 00015 lda 1,x
000F 79 0001 [4/5] 00016 adc 1,y
0012 A6 00 [3] 00017 ldx zptemp ; retrieve destination pointer
0014 95 01 [4] 00018 sta 1,x ; store hi result
0016 68 [4] 00019 pla
0017 95 00 [4] 00020 sta 0,x ; store lo result
0019 60 [6] 00021 rts
001A 00022 main:
001A A9 01 [2] 00023 lda #dest
001C A2 02 [2] 00024 ldx #source1
001E A0 03 [2] 00025 ldy #source2
0020 20 0004 [6] 00026 jsr add
Σ = 57
Code: Select all
00010 ;
00011 ; (P1) + (P2) -> (P3)
00012 ;
000A 00013 Add
000A 85 00 [3] 00014 sta P3
00015
000C 18 [2] 00016 clc
00017
000D B9 0000 [4/5] 00018 lda 0,Y
0010 75 00 [4] 00019 adc 0,X
0012 85 02 [3] 00020 sta SumLow
00021
0014 B9 0001 [4/5] 00022 lda 1,Y
0017 75 01 [4] 00023 adc 1,X
0019 A6 00 [3] 00024 ldx P3 ; Point to sum
001B 95 01 [4] 00025 sta 1,X
001D A5 02 [3] 00026 lda SumLow
001F 95 00 [4] 00027 sta 0,X
00028
0021 60 [6] 00029 rts
00030
00031 ;
0022 00032 Start
0022 A0 04 [2] 00033 ldy #Num1 ; Store pointer to first number
00034
0024 A2 06 [2] 00035 ldx #Num2 ; Store pointer to second number
00036
0026 A9 08 [2] 00037 lda #Num3 ; Store pointer to sum
00038
0028 20 000A [6] 00039 jsr Add ; Add them
Σ = 56
Re: Optimization competition: An efficient calling conventio
GARTHWILSON wrote:
Calling convention (as opposed to a complete source code without necessarily having a particular calling convention)? If you use a ZP data stack (yes, just like Forth does), your inputs and outputs will be on the data stack. So then it's just
(This does the actual addition. It's not adding the parameters to the stack. They would already be there from previous operations, even if just a literal or a fetch.)
Code: Select all
JSR ADDEven though it does not meet the requirements of this competition, I think it will be interesting to see how the ADD word compares.
Can you provide that?
Edit: I am guessing that if you are using parallel stacks for low and high bytes, the code may look something like this:
Code: Select all
ADD
clc
lda StackL,X
adc StackL+1,X
sta StackL+1,X
lda StackH,X
adc StackH+1,X
sta StackH+1,X
inx
rts
:
:
jsr ADD
If this is used within FORTH, the cost of a trip through the inner interpreter would have to be added to the total to be fair.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Optimization competition: An efficient calling conventio
BillG wrote:
Even though it does not meet the requirements of this competition, I think it will be interesting to see how the ADD word compares.
Can you provide that?
Can you provide that?
Code: Select all
ADD: CLC ; ( n1 n2 -- n1+n2 )
LDA 0,X
ADC 2,X
STA 2,X
LDA 1,X
ADC 3,X
STA 3,X
INX
INX
RTS
;--------------But after re-reading the first post, I am reminded that there's also the requirement that what you start with is not the actual numbers, but rather ZP pointers to them. So then you need to be able to fetch them:
Code: Select all
; Fetch the value at the addr pointed to by the top stack cell.
FETCH: LDA (0,X) ; Get the first (ie, low) byte. ( addr -- n )
PHA ; Save it for later since we still need A & the addr at the top of the stack.
INC 0,X ; Increment the low byte of the address we're fetching.
BNE 1$ ; If that turned it into 00,
INC 1,X ; then we have to increment the high byte too.
1$: LDA (0,X) ; Now get the second (ie, high) byte, and
; PUT replaces the contents of the top stack cell.
PUT: STA 1,X ; Store the high byte in the high byte of the top-of-stack cell.
PLA ; Get the first (ie, low) byte back,
STA 0,X ; and store it in the low byte of the top-of-stack cell.
RTS ; Since you start and end with the same stack depth, you don't need INX or DEX.
;--------------The code length is only a fraction as much on the 65816 though, where the 16-bit FETCH just becomes:
Code: Select all
LDA (0,X)
STA 0,Xand might as well be inlined as it takes only one more byte than a JSR, and runs twice as fast (12 cycles versus 24).
These routines are there for the calling in hundreds of other places in an application, so the amount of program memory taken, amortized, is small, unless you need something optimized badly enough to inline it and avoid the JSR-RTS overhead (which was easy to justify above in the case of FETCH on the 65816).
Quote:
Edit: I am guessing that if you are using parallel stacks for low and high bytes, the code may look something like this:
Doing it that way means you cannot use the (ZP,X) addressing mode. You could copy the two bytes to someplace where they can be together for that, but I don't think it really results in any net improvement in overall system performance.
Quote:
If this is used within FORTH, the cost of a trip through the inner interpreter would have to be added to the total to be fair. 
NEXT, commonly called the "inner loop" or "inner interpreter" is not really a loop or an interpreter. Yes, it adds overhead if you're using indirect-threaded code (ITC) or direct-threaded code (DTC) Forth; but there is no NEXT in subroutine-threaded code (STC) Forth in which longer routines are called as subroutines, and the shorter ones can be inlined with no JSR-RTS overhead. The threshold between subroutine-calling and inlining depends on the desired balance between the priorities of speed and program memory size. A few of the shortest ones win in both areas, like DROP being just INX INX. But in optimization, I recently did this Forth line
Code: Select all
R> DROP F_ISR_LIST 6 - >RBut back to the subject of calling convention:
The original post called for 16-bit numbers added to get a 16-bit sum, and one of the suggestions was putting it on the stack. Hence my suggestion to use a ZP data stack, which solves various problems of putting parameters on the hardware stack, as discussed in the 6502 stacks treatise. Two-byte stack cells will be overkill and inefficient for operations where the 6502 is only dealing with 8 bits, like with ASCII characters for example; but it sure works out easier to make all stack cells the same size. In any case, it sure makes for very small code size, as also brought up in the original post:
Quote:
Another win should also go to the calling convention with the smallest code size.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?