M65C02A Core - Page 9 - 6502.org

6502.org Forum

Projects Code Documents Tools Forum

M65C02A Core

137 posts

Page 9 of 10
- Jump to page:
Previous
1
…
6
7
8
9
10
Next

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Fri Sep 21, 2018 10:24 pm

I have almost achieved my goals for the M65C02A soft-core processor.

I have a model for the Py65 environment that implements the desired features. I am beginning the task of testing the 2013 distinct instructions consisting of the base instruction set of the Rockwell 65C02 plus 8 prefix instructions, 24 Forth-oriented instructions, five stack-related instructions, and a block move instruction. Remaining to implement is a co-processor and instructions to support fixed-point multiplies / divides.

In recent weeks, I've been focused on the Pascal Compiler and Assembler for the M65C02A core. I developed a generator for the instruction tables that the assembler uses to generate the object file. That instruction table generator outputs 2013 distinct encodings for 142 distinct instructions that support up to 49 separate addressing modes. If the various accumulator and implicit addressing modes are treated as a single addressing mode, then there are 46 distinct addressing modes.

Many of the addressing modes may not be particularly useful, but they are a result of the way I've built the logic in the M65C02A to support swapping index registers and the accumulator. For example, applying the OAY and the OSX+SIZ+IND (OIS) prefix instruction to the SBC (zp),Y instruction yields the SBC.yw ((zp,S)),A instruction. This instruction performs a double indirection post-indexed by A using the address found on the top of the stack and performs a 16-bit subtraction using the Y register as left source operand and the destination. So far, I've not been able to come up with an algorithm that would require such contortions, but one can never tell. However, the instruction is the natural result of applying two of the 8 prefix instructions: OAY (0xFB) - swap A and Y, and OIS (0xDB) - apply stack-relative addressing plus promote the operation to 16 bits and add indirection.

I have cleaned up the Pascal Compiler and will now be starting its verification using the assembler and the Py65 processor model. Instead of writing the run-time library at this time, I am going to take a short cut and change all subroutine calls to run-time library elements into COP (co-processor) instructions. These instructions will be trapped by the Py65 processor model and simulated in that environment. Once all of the instructions and logic of the compiler are verified, I'll move forward on the run-time library. At the present time, I am leaning toward implementing a co-processor to handle multiplies and divides for both integers (16-bits) and basic floating point operations: fadd, fsub, fmul, and fdiv. I may also support type conversions using the co-processor: ftoi, itof, etc.

The following is a print listing from the most recent assembler for one of the Pascal Compiler's test programs:

Code: Select all

(   1)                  ; ;    1: PROGRAM eratosthenes(output);
(   2)                  ; 	.stk 1024
(   3)                  ; 	.cod 512
(   4)                  ; STATIC_LINK .equ +5
(   5)                  ; RETURN_VALUE .equ -3
(   6)                  ; HIGH_RETURN_VALUE .equ -1
(   7)                  ; _start
(   8) 0200 ABBA        ; 	tsx.w
(   9) 0202 CBA21D10    ; 	lds.w #_stk_top
(  10) 0206 9C4204      ; 	stz _bss_start
(  11) 0209 ABA24204    ; 	ldx.w #_bss_start
(  12) 020D ABA04304    ; 	ldy.w #_bss_start+1
(  13) 0211 ABA91D10    ; 	lda.w #_stk_top
(  14) 0215 38          ; 	sec
(  15) 0216 ABE94204    ; 	sbc.w #_bss_start
(  16) 021A 540F        ; 	mov #15
(  17) 021C 4C1F02      ; 	jmp _pc65_main
(  18)                  ; ;    2: 
(  19)                  ; ;    3: CONST
(  20)                  ; ;    4:     max = 1000;
(  21)                  ; ;    5: 
(  22)                  ; ;    6: VAR
(  23)                  ; ;    7:     sieve : ARRAY [1..max] OF BOOLEAN;
(  24)                  ; ;    8:     i, j, limit, prime, factor : INTEGER;
(  25)                  ; ;    9: 
(  26)                  ; ;   10: BEGIN
(  27)                  ; _pc65_main .sub
(  28) 021F ABDA        ; 	phx.w
(  29) 0221 ABBA        ; 	tsx.w
(  30)                  ; ;   11:     limit := max DIV 2;
(  31) 0223 ABE2E803    ; 	psh.w #1000
(  32)                  ; 	pha.w
(  33) 0227 ABE20200    ; 	psh.w #2
(  34)                  ; 	pha.w
(  35) 022B 20FFFF      ; 	jsr _idiv
(  36) 022E C204        ; 	adj #4
(  37) 0230 AB8D170C    ; 	sta.w limit_005
(  38)                  ; ;   12:     sieve[1] := FALSE;
(  39) 0234 ABE24304    ; 	psh.w #sieve_002
(  40) 0238 A901        ; 	lda #1
(  41) 023A AB3A        ; 	dec.w a
(  42) 023C AB0A        ; 	asl.w a
(  43) 023E 18          ; 	clc
(  44) 023F CB7501      ; 	adc.w 1,S
(  45) 0242 CB9501      ; 	sta.w 1,S
(  46) 0245 A900        ; 	lda #0
(  47) 0247 6B          ; 	pli
(  48) 0248 AB8300      ; 	sta.w 0,I++
(  49)                  ; ;   13: 
(  50)                  ; ;   14:     FOR i := 2 TO max DO
(  51) 024B A902        ; 	lda #2
(  52) 024D AB8D130C    ; 	sta.w i_003
(  53)                  ; L_008
(  54) 0251 ABA9E803    ; 	lda.w #1000
(  55) 0255 ABCD130C    ; 	cmp.w i_003
(  56) 0259 AB5003      ; 	bge L_009
(  57) 025C 4C7F02      ; 	jmp L_010
(  58)                  ; L_009
(  59)                  ; ;   15:         sieve[i] := TRUE;
(  60) 025F ABE24304    ; 	psh.w #sieve_002
(  61) 0263 ABAD130C    ; 	lda.w i_003
(  62) 0267 AB3A        ; 	dec.w a
(  63) 0269 AB0A        ; 	asl.w a
(  64) 026B 18          ; 	clc
(  65) 026C CB7501      ; 	adc.w 1,S
(  66) 026F CB9501      ; 	sta.w 1,S
(  67) 0272 A901        ; 	lda #1
(  68) 0274 6B          ; 	pli
(  69) 0275 AB8300      ; 	sta.w 0,I++
(  70) 0278 ABEE130C    ; 	inc.w i_003
(  71) 027C 4C5102      ; 	jmp L_008
(  72)                  ; L_010
(  73) 027F ABCE130C    ; 	dec.w i_003
(  74)                  ; ;   16: 
(  75)                  ; ;   17:     prime := 1;
(  76) 0283 A901        ; 	lda #1
(  77) 0285 AB8D190C    ; 	sta.w prime_006
(  78)                  ; ;   18: 
(  79)                  ; ;   19:     REPEAT
(  80)                  ; L_011
(  81)                  ; ;   20:         prime := prime + 1;
(  82) 0289 ABAD190C    ; 	lda.w prime_006
(  83) 028D AB48        ; 	pha.w
(  84) 028F A901        ; 	lda #1
(  85) 0291 18          ; 	clc
(  86) 0292 CB7501      ; 	adc.w 1,S
(  87) 0295 C202        ; 	adj #2
(  88) 0297 AB8D190C    ; 	sta.w prime_006
(  89)                  ; ;   21:         WHILE NOT sieve[prime] DO
(  90)                  ; L_013
(  91) 029B ABE24304    ; 	psh.w #sieve_002
(  92) 029F ABAD190C    ; 	lda.w prime_006
(  93) 02A3 AB3A        ; 	dec.w a
(  94) 02A5 AB0A        ; 	asl.w a
(  95) 02A7 18          ; 	clc
(  96) 02A8 CB7501      ; 	adc.w 1,S
(  97) 02AB CB9501      ; 	sta.w 1,S
(  98) 02AE 6B          ; 	pli
(  99) 02AF ABA300      ; 	lda.w 0,I++
( 100) 02B2 4901        ; 	eor #1
( 101) 02B4 ABC90100    ; 	cmp.w #1
( 102) 02B8 F003        ; 	beq L_014
( 103) 02BA 4CD202      ; 	jmp L_015
( 104)                  ; L_014
( 105)                  ; ;   22:             prime := prime + 1;
( 106) 02BD ABAD190C    ; 	lda.w prime_006
( 107) 02C1 AB48        ; 	pha.w
( 108) 02C3 A901        ; 	lda #1
( 109) 02C5 18          ; 	clc
( 110) 02C6 CB7501      ; 	adc.w 1,S
( 111) 02C9 C202        ; 	adj #2
( 112) 02CB AB8D190C    ; 	sta.w prime_006
( 113) 02CF 4C9B02      ; 	jmp L_013
( 114)                  ; L_015
( 115)                  ; ;   23: 
( 116)                  ; ;   24:         factor := 2*prime;
( 117) 02D2 ABE20200    ; 	psh.w #2
( 118)                  ; 	pha.w
( 119) 02D6 ABAD190C    ; 	lda.w prime_006
( 120) 02DA AB48        ; 	pha.w
( 121) 02DC 20FFFF      ; 	jsr _imul
( 122) 02DF C204        ; 	adj #4
( 123) 02E1 AB8D1B0C    ; 	sta.w factor_007
( 124)                  ; ;   25: 
( 125)                  ; ;   26:         WHILE factor <= max DO BEGIN
( 126)                  ; L_016
( 127) 02E5 ABAD1B0C    ; 	lda.w factor_007
( 128) 02E9 AB48        ; 	pha.w
( 129) 02EB ABA9E803    ; 	lda.w #1000
( 130) 02EF CB4401      ; 	xma.w 1,S
( 131) 02F2 CBD501      ; 	cmp.w 1,S
( 132) 02F5 C202        ; 	adj #2
( 133) 02F7 08          ; 	php
( 134) 02F8 A901        ; 	lda #1
( 135) 02FA 28          ; 	plp
( 136) 02FB AB3002      ; 	ble L_019
( 137) 02FE A900        ; 	lda #0
( 138)                  ; L_019
( 139) 0300 ABC90100    ; 	cmp.w #1
( 140) 0304 F003        ; 	beq L_017
( 141) 0306 4C3903      ; 	jmp L_018
( 142)                  ; L_017
( 143)                  ; ;   27:             sieve[factor] := FALSE;
( 144) 0309 ABE24304    ; 	psh.w #sieve_002
( 145) 030D ABAD1B0C    ; 	lda.w factor_007
( 146) 0311 AB3A        ; 	dec.w a
( 147) 0313 AB0A        ; 	asl.w a
( 148) 0315 18          ; 	clc
( 149) 0316 CB7501      ; 	adc.w 1,S
( 150) 0319 CB9501      ; 	sta.w 1,S
( 151) 031C A900        ; 	lda #0
( 152) 031E 6B          ; 	pli
( 153) 031F AB8300      ; 	sta.w 0,I++
( 154)                  ; ;   28:             factor := factor + prime;
( 155) 0322 ABAD1B0C    ; 	lda.w factor_007
( 156) 0326 AB48        ; 	pha.w
( 157) 0328 ABAD190C    ; 	lda.w prime_006
( 158) 032C 18          ; 	clc
( 159) 032D CB7501      ; 	adc.w 1,S
( 160) 0330 C202        ; 	adj #2
( 161) 0332 AB8D1B0C    ; 	sta.w factor_007
( 162)                  ; ;   29:         END
( 163)                  ; ;   30:     UNTIL prime > limit;
( 164) 0336 4CE502      ; 	jmp L_016
( 165)                  ; L_018
( 166) 0339 ABAD190C    ; 	lda.w prime_006
( 167) 033D AB48        ; 	pha.w
( 168) 033F ABAD170C    ; 	lda.w limit_005
( 169) 0343 CB4401      ; 	xma.w 1,S
( 170) 0346 CBD501      ; 	cmp.w 1,S
( 171) 0349 C202        ; 	adj #2
( 172) 034B 08          ; 	php
( 173) 034C A901        ; 	lda #1
( 174) 034E 28          ; 	plp
( 175) 034F AB1002      ; 	bgt L_020
( 176) 0352 A900        ; 	lda #0
( 177)                  ; L_020
( 178) 0354 ABC90100    ; 	cmp.w #1
( 179) 0358 F003        ; 	beq L_012
( 180) 035A 4C8902      ; 	jmp L_011
( 181)                  ; L_012
( 182)                  ; ;   31: 
( 183)                  ; ;   32:     writeln('Sieve of Eratosthenes');
( 184) 035D ABE22D04    ; 	psh.w #S_021
( 185) 0361 ABE20000    ; 	psh.w #0
( 186) 0365 ABE21500    ; 	psh.w #21
( 187) 0369 20FFFF      ; 	jsr _swrite
( 188) 036C C206        ; 	adj #6
( 189) 036E 20FFFF      ; 	jsr _writeln
( 190)                  ; ;   33:     writeln;
( 191) 0371 20FFFF      ; 	jsr _writeln
( 192)                  ; ;   34: 
( 193)                  ; ;   35:     i := 1;
( 194) 0374 A901        ; 	lda #1
( 195) 0376 AB8D130C    ; 	sta.w i_003
( 196)                  ; ;   36:     REPEAT
( 197)                  ; L_022
( 198)                  ; ;   37:         FOR j := 0 TO 19 DO BEGIN
( 199) 037A A900        ; 	lda #0
( 200) 037C AB8D150C    ; 	sta.w j_004
( 201)                  ; L_024
( 202) 0380 A913        ; 	lda #19
( 203) 0382 ABCD150C    ; 	cmp.w j_004
( 204) 0386 AB5003      ; 	bge L_025
( 205) 0389 4CEA03      ; 	jmp L_026
( 206)                  ; L_025
( 207)                  ; ;   38:             prime := i + j;
( 208) 038C ABAD130C    ; 	lda.w i_003
( 209) 0390 AB48        ; 	pha.w
( 210) 0392 ABAD150C    ; 	lda.w j_004
( 211) 0396 18          ; 	clc
( 212) 0397 CB7501      ; 	adc.w 1,S
( 213) 039A C202        ; 	adj #2
( 214) 039C AB8D190C    ; 	sta.w prime_006
( 215)                  ; ;   39:             IF sieve[prime] THEN
( 216) 03A0 ABE24304    ; 	psh.w #sieve_002
( 217) 03A4 ABAD190C    ; 	lda.w prime_006
( 218) 03A8 AB3A        ; 	dec.w a
( 219) 03AA AB0A        ; 	asl.w a
( 220) 03AC 18          ; 	clc
( 221) 03AD CB7501      ; 	adc.w 1,S
( 222) 03B0 CB9501      ; 	sta.w 1,S
( 223) 03B3 6B          ; 	pli
( 224) 03B4 ABA300      ; 	lda.w 0,I++
( 225) 03B7 ABC90100    ; 	cmp.w #1
( 226) 03BB F003        ; 	beq L_027
( 227) 03BD 4CD203      ; 	jmp L_028
( 228)                  ; L_027
( 229)                  ; ;   40:                 write(prime:3)
( 230) 03C0 ABAD190C    ; 	lda.w prime_006
( 231) 03C4 AB48        ; 	pha.w
( 232) 03C6 ABE20300    ; 	psh.w #3
( 233)                  ; 	pha.w
( 234) 03CA 20FFFF      ; 	jsr _iwrite
( 235) 03CD C204        ; 	adj #4
( 236)                  ; ;   41:             ELSE
( 237) 03CF 4CE303      ; 	jmp L_029
( 238)                  ; L_028
( 239)                  ; ;   42:                 write('   ');
( 240) 03D2 ABE22A04    ; 	psh.w #S_030
( 241) 03D6 ABE20000    ; 	psh.w #0
( 242) 03DA ABE20300    ; 	psh.w #3
( 243) 03DE 20FFFF      ; 	jsr _swrite
( 244) 03E1 C206        ; 	adj #6
( 245)                  ; L_029
( 246)                  ; ;   43:         END;
( 247) 03E3 ABEE150C    ; 	inc.w j_004
( 248) 03E7 4C8003      ; 	jmp L_024
( 249)                  ; L_026
( 250) 03EA ABCE150C    ; 	dec.w j_004
( 251)                  ; ;   44:         writeln;
( 252) 03EE 20FFFF      ; 	jsr _writeln
( 253)                  ; ;   45:         i := i + 20
( 254) 03F1 ABAD130C    ; 	lda.w i_003
( 255) 03F5 AB48        ; 	pha.w
( 256) 03F7 A914        ; 	lda #20
( 257)                  ; ;   46:     UNTIL i > max
( 258) 03F9 18          ; 	clc
( 259) 03FA CB7501      ; 	adc.w 1,S
( 260) 03FD C202        ; 	adj #2
( 261) 03FF AB8D130C    ; 	sta.w i_003
( 262) 0403 ABAD130C    ; 	lda.w i_003
( 263) 0407 AB48        ; 	pha.w
( 264)                  ; ;   47: END.
( 265) 0409 ABA9E803    ; 	lda.w #1000
( 266) 040D CB4401      ; 	xma.w 1,S
( 267) 0410 CBD501      ; 	cmp.w 1,S
( 268) 0413 C202        ; 	adj #2
( 269) 0415 08          ; 	php
( 270) 0416 A901        ; 	lda #1
( 271) 0418 28          ; 	plp
( 272) 0419 AB1002      ; 	bgt L_031
( 273) 041C A900        ; 	lda #0
( 274)                  ; L_031
( 275) 041E ABC90100    ; 	cmp.w #1
( 276) 0422 F003        ; 	beq L_023
( 277) 0424 4C7A03      ; 	jmp L_022
( 278)                  ; L_023
( 279) 0427 ABFA        ; 	plx.w
( 280) 0429 60          ; 	rts
( 281)                  ; 	.end _pc65_main
( 282)                  ; 
( 283)                  ; 	.dat
( 284)                  ; 
( 285) 042A 202020      ; S_030 .str "   "
( 286) 042D 53696576    ; S_021 .str "Sieve of Eratosthenes"
       0431 65206F6620457261746F737468656E6573
( 287) 0442 00          ; _bss_start .byt 1
( 288) 0443 00000000    ; sieve_002 .byt 2000
       0447 0000000000000000000000000000000000000000000000000000000000000000
       0467 0000000000000000000000000000000000000000000000000000000000000000
       0487 0000000000000000000000000000000000000000000000000000000000000000
       04A7 0000000000000000000000000000000000000000000000000000000000000000
       04C7 0000000000000000000000000000000000000000000000000000000000000000
       04E7 0000000000000000000000000000000000000000000000000000000000000000
       0507 0000000000000000000000000000000000000000000000000000000000000000
       0527 0000000000000000000000000000000000000000000000000000000000000000
       0547 0000000000000000000000000000000000000000000000000000000000000000
       0567 0000000000000000000000000000000000000000000000000000000000000000
       0587 0000000000000000000000000000000000000000000000000000000000000000
       05A7 0000000000000000000000000000000000000000000000000000000000000000
       05C7 0000000000000000000000000000000000000000000000000000000000000000
       05E7 0000000000000000000000000000000000000000000000000000000000000000
       0607 0000000000000000000000000000000000000000000000000000000000000000
       0627 0000000000000000000000000000000000000000000000000000000000000000
       0647 0000000000000000000000000000000000000000000000000000000000000000
       0667 0000000000000000000000000000000000000000000000000000000000000000
       0687 0000000000000000000000000000000000000000000000000000000000000000
       06A7 0000000000000000000000000000000000000000000000000000000000000000
       06C7 0000000000000000000000000000000000000000000000000000000000000000
       06E7 0000000000000000000000000000000000000000000000000000000000000000
       0707 0000000000000000000000000000000000000000000000000000000000000000
       0727 0000000000000000000000000000000000000000000000000000000000000000
       0747 0000000000000000000000000000000000000000000000000000000000000000
       0767 0000000000000000000000000000000000000000000000000000000000000000
       0787 0000000000000000000000000000000000000000000000000000000000000000
       07A7 0000000000000000000000000000000000000000000000000000000000000000
       07C7 0000000000000000000000000000000000000000000000000000000000000000
       07E7 0000000000000000000000000000000000000000000000000000000000000000
       0807 0000000000000000000000000000000000000000000000000000000000000000
       0827 0000000000000000000000000000000000000000000000000000000000000000
       0847 0000000000000000000000000000000000000000000000000000000000000000
       0867 0000000000000000000000000000000000000000000000000000000000000000
       0887 0000000000000000000000000000000000000000000000000000000000000000
       08A7 0000000000000000000000000000000000000000000000000000000000000000
       08C7 0000000000000000000000000000000000000000000000000000000000000000
       08E7 0000000000000000000000000000000000000000000000000000000000000000
       0907 0000000000000000000000000000000000000000000000000000000000000000
       0927 0000000000000000000000000000000000000000000000000000000000000000
       0947 0000000000000000000000000000000000000000000000000000000000000000
       0967 0000000000000000000000000000000000000000000000000000000000000000
       0987 0000000000000000000000000000000000000000000000000000000000000000
       09A7 0000000000000000000000000000000000000000000000000000000000000000
       09C7 0000000000000000000000000000000000000000000000000000000000000000
       09E7 0000000000000000000000000000000000000000000000000000000000000000
       0A07 0000000000000000000000000000000000000000000000000000000000000000
       0A27 0000000000000000000000000000000000000000000000000000000000000000
       0A47 0000000000000000000000000000000000000000000000000000000000000000
       0A67 0000000000000000000000000000000000000000000000000000000000000000
       0A87 0000000000000000000000000000000000000000000000000000000000000000
       0AA7 0000000000000000000000000000000000000000000000000000000000000000
       0AC7 0000000000000000000000000000000000000000000000000000000000000000
       0AE7 0000000000000000000000000000000000000000000000000000000000000000
       0B07 0000000000000000000000000000000000000000000000000000000000000000
       0B27 0000000000000000000000000000000000000000000000000000000000000000
       0B47 0000000000000000000000000000000000000000000000000000000000000000
       0B67 0000000000000000000000000000000000000000000000000000000000000000
       0B87 0000000000000000000000000000000000000000000000000000000000000000
       0BA7 0000000000000000000000000000000000000000000000000000000000000000
       0BC7 0000000000000000000000000000000000000000000000000000000000000000
       0BE7 0000000000000000000000000000000000000000000000000000000000000000
       0C07 000000000000000000000000
( 289) 0C13 0000        ; i_003 .wrd 1
( 290) 0C15 0000        ; j_004 .wrd 1
( 291) 0C17 0000        ; limit_005 .wrd 1
( 292) 0C19 0000        ; prime_006 .wrd 1
( 293) 0C1B 0000        ; factor_007 .wrd 1
( 294) 0C1D 00          ; _bss_end .byt 1
( 295) 0C1E 00000000    ; _stk .byt 1023
       0C22 0000000000000000000000000000000000000000000000000000000000000000
       0C42 0000000000000000000000000000000000000000000000000000000000000000
       0C62 0000000000000000000000000000000000000000000000000000000000000000
       0C82 0000000000000000000000000000000000000000000000000000000000000000
       0CA2 0000000000000000000000000000000000000000000000000000000000000000
       0CC2 0000000000000000000000000000000000000000000000000000000000000000
       0CE2 0000000000000000000000000000000000000000000000000000000000000000
       0D02 0000000000000000000000000000000000000000000000000000000000000000
       0D22 0000000000000000000000000000000000000000000000000000000000000000
       0D42 0000000000000000000000000000000000000000000000000000000000000000
       0D62 0000000000000000000000000000000000000000000000000000000000000000
       0D82 0000000000000000000000000000000000000000000000000000000000000000
       0DA2 0000000000000000000000000000000000000000000000000000000000000000
       0DC2 0000000000000000000000000000000000000000000000000000000000000000
       0DE2 0000000000000000000000000000000000000000000000000000000000000000
       0E02 0000000000000000000000000000000000000000000000000000000000000000
       0E22 0000000000000000000000000000000000000000000000000000000000000000
       0E42 0000000000000000000000000000000000000000000000000000000000000000
       0E62 0000000000000000000000000000000000000000000000000000000000000000
       0E82 0000000000000000000000000000000000000000000000000000000000000000
       0EA2 0000000000000000000000000000000000000000000000000000000000000000
       0EC2 0000000000000000000000000000000000000000000000000000000000000000
       0EE2 0000000000000000000000000000000000000000000000000000000000000000
       0F02 0000000000000000000000000000000000000000000000000000000000000000
       0F22 0000000000000000000000000000000000000000000000000000000000000000
       0F42 0000000000000000000000000000000000000000000000000000000000000000
       0F62 0000000000000000000000000000000000000000000000000000000000000000
       0F82 0000000000000000000000000000000000000000000000000000000000000000
       0FA2 0000000000000000000000000000000000000000000000000000000000000000
       0FC2 0000000000000000000000000000000000000000000000000000000000000000
       0FE2 0000000000000000000000000000000000000000000000000000000000000000
       1002 000000000000000000000000000000000000000000000000000000
( 296) 101D 00          ; _stk_top .byt 1
( 297)                  ; 
( 298)                  ; 	.end

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Sat Sep 22, 2018 3:39 pm

Added the capability to assemble sbc.yw ((1,S)),A to the assembler, and also the capability to generate binary output files that Py65 can load and run. The following is the PyAsm65 source file used for this test in the printer format:

Code: Select all

(   1)                  ;     .cod    512
(   2)                  ; ;
(   3)                  ; ;   Setup
(   4)                  ; ;
(   5) 0200 ABE28000    ;     psh.w #128
(   6) 0204 AB9C8000    ;     stz.w 0x80
(   7) 0208 AB9C0000    ;     stz.w 0x00
(   8) 020C ABA90002    ;     lda.w #512
(   9) 0210 ABAC0002    ;     ldy.w 0o1000
(  10) 0214 38          ;     sec
(  11)                  ; ;
(  12)                  ; ;   Instruction Under Test
(  13)                  ; ;
(  14) 0215 FBDBF101    ;     sbc.yw ((1,S)),A
(  15)                  ; ;
(  16)                  ;     .end

I've got a bit more work to do on the assembler to support the full instruction set. One issue I found last night is that I'm not correctly determining whether zero page or absolute addressing should / can be used for a particular instruction in Pass 1 of the assembler. Will have to work on that over the next week or so. However, it is gratifying that the Python builtin function, eval(), can be used to evaluate expressions for the memory operand. As can be seen in the source listing above, a combination of decimal, hexadecimal, and octal representations were used for the various operands.

The following is a capture of the console output for a simple test program that I assembled, loaded, and ran on Py65 to test that instruction. The load function of Py65 is used to load the test program, and then the go command is used to run the program. I modified Py65 to print out the cycle-by-cycle operations performed, and this is shown below where IR refers to instruction fetches, RdPM refers to reads of instruction operands, RdDM refers to data memory reads, and WrDM refers to data memory writes. I also modified the Py65 cycle counting feature to distinguish between the various operations, and to provide the average CPI and instruction length for a particular program. Examining the tail of the instruction trace clearly shows that the sbc.yw ((1,S),A instruction is four bytes long and executes in 10 cycles with the last 6 cycles being the 2 indirection cycles and the final operand fetch cycle.

Code: Select all

.load tst_sbc_zpSIIA.bin 200
Wrote +25 bytes from $0200 to $0218

          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0200 0000 0000 0000 01FF 0000 00110000
              0000 0000 0000 01FF 0000
              0000 0000 0000
.g 200
   IR: AB <= mem[0200]
   IR: E2 <= mem[0201]
 rdPM: 80 <= mem[0202]
 rdPM: 00 <= mem[0203]
 wrDM: 00 => mem[01FF]
 wrDM: 80 => mem[01FE]
   IR: AB <= mem[0204]
   IR: 9C <= mem[0205]
 rdPM: 80 <= mem[0206]
 rdPM: 00 <= mem[0207]
 wrDM: 00 => mem[0080]
 wrDM: 00 => mem[0081]
   IR: AB <= mem[0208]
   IR: 9C <= mem[0209]
 rdPM: 00 <= mem[020A]
 rdPM: 00 <= mem[020B]
 wrDM: 00 => mem[0000]
 wrDM: 00 => mem[0001]
   IR: AB <= mem[020C]
   IR: A9 <= mem[020D]
 rdPM: 00 <= mem[020E]
 rdPM: 02 <= mem[020F]
   IR: AB <= mem[0210]
   IR: AC <= mem[0211]
 rdPM: 00 <= mem[0212]
 rdPM: 02 <= mem[0213]
 rdDM: AB <= mem[0200]
 rdDM: E2 <= mem[0201]
   IR: 38 <= mem[0214]
   IR: FB <= mem[0215]
   IR: DB <= mem[0216]
   IR: F1 <= mem[0217]
 rdPM: 01 <= mem[0218]
 rdDM: 80 <= mem[01FE]
 rdDM: 00 <= mem[01FF]
 rdDM: 00 <= mem[0080]
 rdDM: 00 <= mem[0081]
 rdDM: AB <= mem[0200]
 rdDM: E2 <= mem[0201]

          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0219 0200 0000 0000 01FD 0000 00110011
              0000 0000 0000 01FF 0000
              0000 0000 0000
.cycles
Total = 39, Num Inst = 7, Pgm Rd = 25, Data Rd = 8, Data Wr = 6, Dummy Cycles = 0
  CPI = 5.57, Avg Inst Len = 3.57


          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0219 0200 0000 0000 01FD 0000 00110011
              0000 0000 0000 01FF 0000
              0000 0000 0000
.d 200:218
$0200  AB        SIZ
$0201  E2 80     PSH #$80
$0203  00        ???
$0204  AB        SIZ
$0205  9C 80 00  STZ $0080
$0208  AB        SIZ
$0209  9C 00 00  STZ $0000
$020C  AB        SIZ
$020D  A9 00     LDA #$00
$020F  02        ???
$0210  AB        SIZ
$0211  AC 00 02  LDY $0200
$0214  38        SEC
$0215  FB        OAY
$0216  DB        OIS
$0217  F1 01     SBC ($01),Y

          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0219 0200 0000 0000 01FD 0000 00110011
              0000 0000 0000 01FF 0000
              0000 0000 0000
.

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Wed Sep 26, 2018 11:16 pm

Well I had an OMG moment, several actually, this morning. As I related above, I am working toward testing the Python model of the M65C02A, with the end result being the development of test vectors from this effort to wrap around the Verilog RTL implementation.

Earlier in the week I succeeded in getting Mike Naberezny's test_monitor.py unit tests for Py65's monitor.py adjusted to my M65C02A model. I got all 130 tests that he'd written running and passing correctly. I only needed to make changes to 20 tests to compensate for the different widths and number of registers that the M65C02A core has relative to a 6502/65C02 processor.

This morning I began to develop the unit tests for the 112 implicit addressing mode instructions. Some of these instructions are simple modifications using the M65C02A prefix instructions of well known instructions like php/plp. After some effort, much of it simply studying the example provided by the Py65 monitor unit test functions, I succeeded in getting the kernel and user mode versions of the php/plp instructions tested and passing. This took me about 1 hour. That is certainly not a particularly fast pace, and the routines for the php.s/plp.s instructions were completed in less than 0.5 additional hours.

The OMG moment came when I realized I needed to wrap up and get ready for work. I had been working about 1.5 hours, written about 20 lines of python code for the first instruction, copied and edited it 3 times for a total of about 80 lines of working code. There are 2006 more instructions to go. Hopefully, I can get much faster at writing these tests, or it's going to take several more years to finish this task.

The second OMG moment: I wrote the python code models for these instructions, and they are essentially 2 line functions. The two instructions I selected for my first effort are pretty simple: only the sec/clc, sei/cli, etc. instructions are simpler. There are about 80 lines of test code to verify the operation of what can really be characterized as 4 lines of python code. That is a 20x expansion of test code to deliverable, working code. Granted that much of the code is copied and pasted, but virtually every line in the edited code sets up or tests a different feature/function of these instructions. More to the point, I did not fully verify the processor state. IOWs, for these instructions I really only tested that the stack, stack pointers, program counter, and PSW were correct. I did not check the rest of the processor state to ensure the instruction model did not have unspecified / unintended side effects.

The third OMG moment came when I realized that this test code is almost, if not more complicated, than the functions it was checking. I carefully followed Mike's example and essentially reestablished the processor state for each test function. This means that the processor state at the beginning of each test routine is pristine and can be independently called by the test harness in any order.

From reading Mike's test code, I know that he's been particularly careful. But I found a small bug in one of his routines. The "bug" did not affect the result of the test function, but it does point out that the particular feature he was attempting to test in that function was not successfully tested in that function. It does not mean that the function was not successfully tested later, but it does point out that the test code is not infallible / error free. For me, this means that all of the 20x increase in code will likely have a few errors, or better stated as hopefully only a few errors. That really drives home for me that point recently made to me that the test code to support a commercial aviation flight SW certification is easily 3x or more the size (and schedule) the code being certified for flight.

Before this effort on my part, I was somewhat skeptical of those production vs test code ratios. Perhaps, I will get better and the 20x ratio I have produced so far will go down drastically, but I seriously doubt it. I will have to adopt another strategy in order to complete the task I set out for myself on this Python model. There are only a limited number of instructions and addressing modes. Their product currently stands at 2010, but in reality it's much greater than that. I have purposely restricted the tables driving the assembler to only one variation of a particular instruction. Given the number of prefix instructions, it is possible to create the same instruction using more than one base instruction and different combinations of prefix instructions; and there's also the potential to create non-optimal sequences of prefix instructions and base instructions that provide the same end result.

Michael A.

BigEd: Posts: 11464; Joined: 11 Dec 2008; Location: England; Contact:
Contact BigEd

Website

Re: M65C02A Core

Quote

Post by BigEd » Thu Sep 27, 2018 6:10 am

Very interesting observations - hopefully not disheartening but certainly a calibration. I'm not skilled in the art of processor verification but I know it is a specialism, and I know there's scope for automation and using lots of CPU time.

At minimum, I think these observations show the merits of a symmetrical instruction set: you can get a long way with A+B tests, a much smaller number than A x B. Possibly you can write A+B fragments and automatically generate A x B tests?

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Fri Sep 28, 2018 12:21 pm

I am starting the process of rethinking the approach to testing that I will take. I spent some time going over my implementation for the M65C02A Python model. There are some opportunities for refactoring that I will probably undertake. Next it looks like I can construct tests of the various routines that are common among the various basis and enhanced instructions. Thus, an approach that verifies the 49 addressing modes, followed by the verification of the 142 instructions may suffice to test the model. In some cases, one set of tests will cover both parts of the instruction. For example, instructions described as using the implicit / accumulator addressing mode must be tested individually.

Michael A.

Rob Finch: Posts: 465; Joined: 29 Dec 2002; Location: Canada; Contact:
Contact Rob Finch

Website

Re: M65C02A Core

Quote

Post by Rob Finch » Sat Oct 06, 2018 10:15 pm

It sounds like your testing goals are pretty stringent. Is this core being built with some form of certification in mind? I find for my own personal use I don’t test that much beyond common cases. Corner cases and unusual cases are unlikely to be tested. As the size of the project grows I find there’s more to it than I could ever test by myself. Sometimes testing is sheer pain to work through and this is supposed to be fun! At some point one can only say the thing is believed to be working as opposed to provably working.

http://www.finitron.ca

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Sat Oct 06, 2018 11:23 pm

Rob Finch wrote:

Sometimes testing is sheer pain to work through and this is supposed to be fun!

You nailed it. I want to enjoy the journey. Sometimes I extend what I am doing on the core to what I'm doing in my day job. And that's the rub. The level of testing that I need to do, from a day job perspective, just got to be a bit more than I want to do using the first approach. I am still looking for a way to make the task less of a drudge job and more of a technical / fun task.

I took the day off, and built another tool for one of those projects that I just can't seem to get finished with.

Rob Finch wrote:

Is this core being built with some form of certification in mind?

In my current day job, I am working a problem where certification of processors and software is required. The current approach relies heavily on testing, and my little exercise on my core was designed to give me some perspective on the scope of the problem for a processor architecture significantly more complicated than the M65C02A. The results are not encouraging unless some type of automation can be brought to bear on the problem. I have yet to convince myself that randomized testing, or even constrained randomized testing, of HW / SW systems is sufficient to establish the level of confidence that I feel is necessary for "safety-critical" systems. So I am looking for ways to tackle the problem of verifying my M65C02A Python model, and extending those results to my day job.

Last edited by MichaelM on Mon Oct 15, 2018 10:44 pm, edited 1 time in total.

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Mon Oct 15, 2018 10:40 pm

Over the last few days, I've been working toward executing the Sieve.pas program in the Py65 environment. I have written the 5 run-time functions needed by the program: _idiv(), _imul(), _writeln(), _swrite(), and _iwrite(). After a few days of work, I have successfully executed the program. The _iwrite() function is incomplete; it does not currently handle the field width specifier. I had to modify the _imul() previously described to return the 32-bit product in {ATOS, ANOS} as {PL, PH} instead of {PH, PL}. Also, the PyAsm65 assembler does not automatically link the program with the run-time library functions that it identifies; I manually linked together the Sieve.asm file with the five run-time functions, creating the function Sieve2.asm that I assembled into the executable image loaded into Py65. The following is a screen capture of the Sieve program successfully running in Py65 in a terminal window:

Code: Select all

.load Sieve2.bin 200
Wrote +3799 bytes from $0200 to $10D6

          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0200 0000 0000 0000 01FF 0000 00110000
              0000 0000 0000 01FF 0000 DL YXSIZ
              0000 0000 0000           00 00000

.g 200
Sieve of Eratosthenes

   0000200003   00005   00007         00011   00013         00017   00019   
      00023               00029   00031               00037         
00041   00043         00047               00053               00059   
00061               00067         00071   00073               00079   
      00083               00089                     00097         
00101   00103         00107   00109         00113                     
                  00127         00131               00137   00139   
                        00149   00151               00157         
      00163         00167               00173               00179   
00181                           00191   00193         00197   00199   
                              00211                           
      00223         00227   00229         00233               00239   
00241                           00251               00257         
      00263               00269   00271               00277         
00281   00283                           00293                     
                  00307         00311   00313         00317         
                              00331               00337         
                  00347   00349         00353               00359   
                  00367               00373               00379   
      00383               00389                     00397         
00401                     00409                           00419   
00421                           00431   00433               00439   
      00443               00449                     00457         
00461   00463         00467                                 00479   
                  00487         00491                     00499   
      00503               00509                                 
00521   00523                                                   
00541               00547                           00557         
      00563               00569   00571               00577         
                  00587               00593               00599   
00601               00607               00613         00617   00619   
                              00631                           
00641   00643         00647               00653               00659   
00661                                 00673         00677         
      00683                     00691                           
00701                     00709                           00719   
                  00727               00733               00739   
      00743                     00751               00757         
00761                     00769         00773                     
                  00787                           00797         
                        00809   00811                           
00821   00823         00827   00829                           00839   
                                    00853         00857   00859   
      00863                                       00877         
00881   00883         00887                                       
                  00907         00911                     00919   
                        00929                     00937         
00941               00947               00953                     
                  00967         00971               00977         
      00983                     00991               00997         

          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0001 0001 10D6 0000 10D8 04E6 00110001
              0000 0000 0000 01FF 0000 DL YXSIZ
              0030 0000 0000           00 00000

.cycles
Total = 883908, Num Inst = 323490, Pgm Rd = 683061, Data Rd = 122262, Data Wr = 77577, Dummy Cycles = 1008
  CPI = 2.73, Avg Inst Len = 2.11


          PC   AC   XR   YR   SP   VM  NVMBDIZC
M65C02A: 0001 0001 10D6 0000 10D8 04E6 00110001
              0000 0000 0000 01FF 0000 DL YXSIZ
              0030 0000 0000           00 00000

.q

Note: the load command shows the total size of the executable image. Currently, that image includes the BSS, heap, and stack of the program. For the Sieve program, this area is 3035 bytes. That data will not be included in the executable image in the future. The M65C02A mov #imm instruction is used to zero out this memory. The 6072 cycles required by that instruction to zero out the allocated BSS variables, the heap, and the stack area is included in the cycle count shown above.

The Num Inst field of the Py65 cycles command shows the number of basic instructions executed, excluding any and all M65C02A prefix instructions, by the program. The Pgm Rd field shows the number of program memory reads made by the program, including fetches of prefix instructions and instruction operands. The Data Rd and Data Wr fields show the number of reads from and writes to data memory. The Dummy Cycles field show the number of M65C02A cycles do not perform either a read or a write of memory. The CPI field is the total number of cycles divided by the number of instructions executed. The Avg Inst Len field is the number of Pgm Rd cycles divided by the number of instructions executed.

In the M65C02A, dummy cycles for read/modify/write operations have been eliminated. The M65C02A pul zp/abs and phr rel16 instructions require dummy cycles. The dummy cycles in the Sieve program are associated with the csr rel16 instructions. The csr rel16 instruction is the pc-relative subroutine call instruction formed by prefixing the phr rel16 instruction with the ind prefix instruction. In the program, the csr rel16 instruction is used in the _iwrite() to call _idiv() and _swrite(). PC-relative subroutine calls require an extra cycle, i.e. the dummy cycles noted above, but they do enable position-independent programming. The compiler is not using csr rel16 in its code generator, and level 1 variables, i.e. global variables, are not allocated on the stack at the present time. Thus, a bit more work is required to support fully relocatable programs with the PC65 compiler.

Interestingly, the program exhibits a data memory read for every 5.6 per program memory reads, and a data memory write for every 8.8 program memory reads. I may add more metrics to the cycles command to measure the number of branches, branches taken/not taken, subroutine calls, etc. It would be interesting to see what the ratio of these instructions are relative to all instructions.

In general, I am pleased with the PC65 code generator. I did not have to make any changes to the compiler in order to get the program to execute. It could really use some optimization. A bunch of cycles can be saved for simple increments of the loop variables. As it is, for the M65C02A targeting a Spartan 3A -4 FPGA with a reported PAR frequency of 24 MHz, the Sieve program runs in approximately 36.8ms. Not too bad for a high level implementation in Pascal with roughly half of the sieve array wasted because of the 16-bit representation used; eliminating this inefficiency would dramatically decrease the overall cycle count. Another area of improvement could be to adopt the C approach to testing Boolean variables. The Pascal compiler explicitly tests against a value of 1, instead of 0 or non-zero. The code required for handling Boolean variables in this manner is non-trivial. In the C approach, the Z flag could be used, and simply loading the accumulator is all that would be required to set that flag.

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Tue Oct 16, 2018 4:08 am

Without changing the basic algorithm, as defined by the Pascal program, I optimized many of the operations and derived an assembly language version of the Sieve program. The total number of cycles went down quite a bit, but not as much as I expected. I suspected that a significant number of the total number of cycles were associated with the output. Therefore, I measured the Sieve computation and the Sieve output separately. In the optimized version, the Sieve computation requires:

Code: Select all

.cycles
Total = 136809, Num Inst = 32307, Pgm Rd = 98951, Data Rd = 24861, Data Wr = 12997, Dummy Cycles = 0
  CPI = 4.23, Avg Inst Len = 3.06

The Sieve output requires:

Code: Select all

.cycles
Total = 481257, Num Inst = 209792, Pgm Rd = 395999, Data Rd = 57287, Data Wr = 26963, Dummy Cycles = 1008
  CPI = 2.29, Avg Inst Len = 1.89

For comparison, I went back and measured Sieve2 in the same manner. The Sieve computation requires:

Code: Select all

.cycles
Total = 364801, Num Inst = 104939, Pgm Rd = 265283, Data Rd = 56390, Data Wr = 43128, Dummy Cycles = 0
  CPI = 3.48, Avg Inst Len = 2.53

The Sieve output requires:

Code: Select all

.cycles
Total = 513005, Num Inst = 218541, Pgm Rd = 417747, Data Rd = 62837, Data Wr = 31413, Dummy Cycles = 1008
  CPI = 2.35, Avg Inst Len = 1.91

I did optimize the Sieve output code in Sieve3 vs Sieve2: 481257 (optimized) versus 513005 (PC65 code generator). Optimization in the Sieve computation section provided a more dramatic reduction in the number of cycles required: 136809 (optimized) versus 364801 (PC65 code generator). The optimizations provided a 2.67x improvement in the number of cycles needed to run the sieve.

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Mon Feb 18, 2019 6:07 pm

Moved post to Forth Thread.

Michael A.

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Sun Jun 13, 2021 6:55 pm

Recently I have made some improvements on the assembler. I added a number of peephole optimizations that have improved the overall time required to execute the Pascal Sieve benchmark. The following table provides the performance measured for four different Sieve benchmarks. All of the benchmarks are derived from the fundamental Sieve benchmark.

Code: Select all

       | Time (s) | CPI  | Avg Inst Len | Total Cycles | Num Inst | sec/cycle (s) |    MIPS    |
Sieve  |   0.777  | 3.53 |     2.53     |   370,905    | 105,047  |  0.00000209   | 135,158.80 |
Sieve2 |   0.374  | 3.88 |     2.96     |   187,977    |  48,495  |  0.00000199   | 129,590.55 |
Sieve3 |   0.361  | 4.30 |     3.06     |   139,759    |  32,509  |  0.00000258   |  89,935.19 |
Sieve4 |   0.306  | 4.32 |     3.07     |   139,186    |  32,222  |  0.00000220   | 105,331.56 |

The measured time is only that of the portion of the program that initializes the boolean prime number array and the sieve computation that identifies the primes from 2 to 1000. The output portion of the program is untimed. The times provided in the table are measured using the high-performance timers of the machine on which I'm running my version of py65 using Python 3.8.4rc1. (The python module for accessing these timers was provided by Gabriel Staples on Stack Exchange.)

Sieve is the basic benchmark. It uses a 16-bit boolean array that it initializes at the start of the program. No peephole optimization is performed. It is strictly the output of the code generator of the recursive descent PC65 Pascal compiler.

Sieve2 is an optimized version of Sieve benchmark. Several peephole optimizations have been applied. It maintains booleans as 16-bit quantities.

Sieve3 is a hand-optimized version of the Sieve benchmark. It uses the M65C02A mov instruction to initialize the boolean array, and it uses 8-bit quantities to represent booleans instead of the 16-bit quantity of a standard Pascal boolean.

Sieve4 is a hand-optimized version of the Sieve3 benchmark. The additional optimizations applied pertain to some of the branching structures where the complementary branch instruction is used to remove an unconditional branch that follows a conditional branch.

From the time / cycle column, it can be seen that py65 is executing an emulated 6502 memory cycle in approximately 2 microseconds, or approximately 500,000 instructions per second. Given all that py65 is doing to provide a simulated 6502 (or in my case, a simulated 16-bit extension of a 65C02), the time per cycle is pretty impressive, at least to me.

The MIPS column is also interesting. The slowest of the three benchmarks, Sieve, has the highest calculated MIPS rating. Sieve also has the lowest CPI (clocks per instruction) and average instruction length. Sieve2 shows that the simple peephole optimizations that I added to PyAsm65 greatly improve the performance of the program. Sieve2 reduces the number of cycles to compute the sieve by roughly half. It has slightly higher CPI rating than Sieve of approximately 0.35, and a slightly higher average instruction length. With these differences, the MIPS rating for Sieve2 is just less than that of Sieve.

The CPI and average instruction length of the remaining two benchmarks are very similar, so they beat the PC65 code generator and peephole optimizer. However, their advantage over the Sieve2 benchmark is somewhat muted by their 4.30 / 4.32 CPI ratings and 3.06 / 3.07 average instruction lengths.

It seems that in this case, the lower the CPI, the lower the performance. I suppose this assessment, which tends to go against the current trend toward lower CPI ratings, is an artifact of the 8-bit nature of the basic machines used in this implementation: 6502/65C02/M65C02A. Clearly there must be a happy medium between 3.5 and 4.3 CPI where the best performance with the least effort can be achieved. Given the limited number of registers in the 6502/65C02/M65C02A architecture, extensive use of an external stack was made in the implementation of these benchmarks.

The M65C02A provides a number of enhancements relative to the stack such as a 16-bit stack pointer, stack-relative addressing, and base-pointer addressing, direct adjustment of the stack pointer following subroutine calls, etc. These enhancements, when effectively used, as demonstrated by the Sieve2 benchmark, appear to provide reasonable performance for a HLL-driven Sieve benchmark implementation. Additional performance improvements can be had, as demonstrated by the Sieve3 and Sieve4 benchmarks, but realizing those improvements requires recognizing that loops and array initialization can be replaced with the M65C02A mov instruction and that some unconditional branches can be replaced with the complementary conditional branch.

Thoughts?

Michael A.

BigEd: Posts: 11464; Joined: 11 Dec 2008; Location: England; Contact:
Contact BigEd

Website

Re: M65C02A Core

Quote

Post by BigEd » Sun Jun 13, 2021 8:05 pm

> It seems that in this case, the lower the CPI, the lower the performance.

Is that the same as saying that the highest performing code has used more complex instructions? (They take more ticks, but they get more done.)

(Is it correct that the final two columns are only telling us about the python emulation performance, and have nothing to say about the merits of the CPU you've designed?)

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Sun Jun 13, 2021 9:56 pm

BigEd wrote:

> (Is it correct that the final two columns are only telling us about the python emulation performance, and have nothing to say about the merits of the CPU you've designed?)

Yes. They refer to the performance of the py65 environment and the M65C02A MPU model / class that I built for it. Given the amount of stuff the environment and model do under the hood, I think an execution / emultation rate of 500,000 memory cycles per second is pretty good. Redoing the environment and model in C/C++ may improve the performance, but I'm not sure that the effort required will be worth it given the performance that py65 is providing right out of the box.

The py65 emulation of the console port, and the output of the primes found to the console port executes in about 5 to 5.5 seconds. This time swamps the time required to find the primes regardless of the executable tested.

Regarding the merits of the M65C02A architecture, I think the large reduction in total number of cycles between Sieve and Sieve2 indicates that the additional characteristics / features I added (using prefix instructions rather than mode bits) indicate that the ability to perform many of the 16-bit operations required by the Pascal model of computation in a single instruction is good for overall performance. The similar capabilities of the '816 provide a similar increase in performance.

The software multiplication routines that BitWise (Andrew Jacobs) and I wrote and compared some time back seem to indicate that the prefix instruction approach that I chose to use does not unduly burden the programmer or reduce the overall efficiency of the M65C02A compared to the '816 in that particular use case. Granted that I took advantage of some additional internal registers and some peculiar architectural support for software multiplication that I included in the expanded instruction set, but the overall performance of the M65C02A matched up well with that of the '816. However, that is just a single use case, and the M65C02A still retains a strictly 16-bit address range.

Extended memory can be added, but to access it, some type of paging like that used on 16-bit minicomputers would be required. That approach would be aided by the kernel / user mode support that I included in the architecture, but it would not be particularly easy to manage, and I've not yet added support for virtual memory or instruction abort / restart.

Michael A.

BigEd: Posts: 11464; Joined: 11 Dec 2008; Location: England; Contact:
Contact BigEd

Website

Re: M65C02A Core

Quote

Post by BigEd » Mon Jun 14, 2021 11:28 am

Thanks - very interesting to see a case study of prefixed instruction codings bringing benefits!

MichaelM: Posts: 761; Joined: 23 Apr 2012; Location: Huntsville, AL

Re: M65C02A Core

Quote

Post by MichaelM » Mon Mar 14, 2022 7:22 pm

Updated the M65C02A repository with a new instruction set map. The new instruction set map is for the M65C02B. The M65C02B is a modification of the base instruction set of M65C02A soft-core processor.

The M65C02A has five unused opcodes in Column 2. One of these is reserved for a co-processor, and the remaining four opcodes were reserved for signed / unsigned multiplication and division opcodes. However, Column 3 of the M65C02A instruction set implemented a ip,I++, register-indirect with autoincrement addressing mode (using the 16-bit IP register from the M65C02A's built-in Forth VM). The sixteen instruction implemented were:

Code: Select all

ORA/ASL/AND/ROL/EOR/LSR/ADD/ROR/STA/TSB/LDA/TRB/CMP/DEC/SUB/INC ip,I++

This complement of instructions was seen as representing nearly all of the important instructions of the basic 6502/65C02 with this new addressing mode.

I originally saw these Column 3 instructions as being important for the implementation of a fast Forth, but that assumption was woefully lacking in understanding of Forth. In the end, only the STA/LDA ip,I++ instructions were used, and the auto-increment of the IP was often unwanted.

In the implementation of the basic extended 6502/65C02 architecture of the M65C02A, the emphasis was toward making the instruction set support the code generator of the Mak Pascal compiler, a recursive descent compiler originally targeting the 8086 instruction set. After studying the compiler's code generator, I determined that the M65C02A would require both BP-relative and SP-relative addressing modes.

The standard pre-indexed (by X) zero page and pre-indexed (by X) absolute addressing modes would easily support the BP-relative addressing mode required to access parameters and local variables passed on the stack. A specific modification was needed to use zp,X addressing mode in the manner required for BP-relative addressing mode available with the 8086: if the BP in X is greater than 512, the parameter is treated as signed, else the parameter is treated as unsigned. When X is less than 512, X points into page 0 or page 1, in which case, for backward compatibility, the 8-bit offset in the instruction must be treated as unsigned as it is in the 6502/65C02 processors. When X is greater than 512, the assumption is made that the extended stack capability of the M65C02A (which supports 16-bit stack pointers with stacks to 64k in size) is being used, and the base pointer (BP) placed in X, using the 16-bit TSX.w instruction of the M65C02A, points into a stack with sufficient space to support both parameter passing and local variable declarations. In this situation, the 8-bit parameter in the instruction is treated as signed and sign-extended automatically.

The SP-relative addressing mode is easily constructed from the zero page pre-indexed (by X) and absolute pre-indexed (by X) addressing modes. By applying the OSX (override X with S) prefix instruction, these two addressing modes are converted to the required SP-relative addressing mode, with 8-bit and 16-bit unsigned offsets, respectively. For those instructions in the base 6502/65C02 instruction set which don't support pre-indexed (by X) addressing, applying the OSX prefix instruction is allowed, and will provide SP-relative addressing for those instructions. However, these instructions will not support the BP-relative addressing mode which only applies to instructions supporting the pre-indexed by X addressing modes.

Given those new addressing modes and a few other custom instructions, the code generator of the Mak Pascal was easily modified to support the extended 6502/65C02 architecture of the M65C02A. One problem that came up was that the order of the operands for subtractions and comparisons was not as needed. For the 8086. the evaluation order could be easily reversed so the order of the operands on the evaluation / operand stack was not an issue. The issue arises primarily because the recursive descent compiler pushes the operands of an expression onto the operand stack in left to right order. This puts the left operand deeper in the operand stack than the right operand of operations, such as subtraction and compare, in which the order of the operands is important. Popping the first operand (right operand) on the stack into the accumulator leaves the second operand (left operand) on the stack. With the 8086, the second operand could be popped into a second register, and the order of the operations using these two registers can be reversed.

I resolved this issue by adding an instruction, XMA zp,X, that exchanges A with an operand in memory, i.e. in the operand stack when the SP-relative form of the addressing mode is used: XMA[.w] 1,S. As useful as this instruction is in correctly ordering the operands of subtract / compare operations, it just did not set well with me that subtractions / comparisons require an additional 7 clock cycles compared to the other two operand instructions of the 6502/65C02/M65C02A instruction set. Without major changes to the compiler, these additional 7 cycles would be required for all subtraction and comparison operations.

The SP-relative indirect addressing mode, created by applying the OSX and IND prefix flags to an pre-indexed by X addressing mode, allows the processor to load / access the value of a variable whose address is on the stack. This addressing mode is very useful, and fairly inexpensive, but requires more cycles than if the pointer is first popped off the stack and the LDA ip,I++ or STA ip,I++ instructions in Column 3 were used instead. The SP-relative indirect addressing mode uses two additional memory cycles to resolve the address of the operand, and leaves the pointer of the stack which frequently will require that the operand stack be cleaned up by popping the value at a typical cost of an additional two memory cycles. Pulling the pointer off the operand stack into IP solves the stack cleanup problem, but if the pointer was also to be used to store the value of the expression into the same variable, the auto-increment feature of the ip,I++ addressing mode modified the pointer in the IP register. Thus, sometimes it would be necessary to use an instruction sequence using the SP-relative indirect addressing mode, especially if the variable whose address is on the stack is being updated with the results of the operation: CLC [1], ADC.w (1,S) [6], STA.w (1,S) [6], ADJ #2 [2]. Using the register-indirect mechanism supporting by the Column 3 ip,I++ instructions is difficult since the pointer loaded in IP is auto-incremented and there's no mechanism to override the auto-increment function, nor is there a decrement IP instruction in the M65C02A instruction set.

After some other work with the compiler and the assembler to speed up basic instruction sequences, I started considering what changes could be made to the instruction set of the M65C02A that would improve the situation with respect to a few bottlenecks that I noticed in the compiled code: (1) comparison operations, which are quite common, require a lot of additional clock cycles because the order of the operands on the evaluation stack is incorrect for the CMP instruction; (2) Pascal boolean operations require a lot of clock cycles to perform; (3) 8-bit SP-relative indirect addressing mode requires a second prefix instruction to implement.

The last of these three is not that important because the Pascal compiler doesn't use 8-bit constants, but it just seemed to me that the 8-bit LDA (1,S) [7] instruction (OSX, IND, LDA 1,X) should not be longer than the 16-bit LDA.w (1,S) [7] instruction (OIS, LDA 1,X). To make the 8-bit and 16-bit versions of this instruction the same length, another multi-flag prefix instruction would be needed: OSI {OSX + IND}.

Comparisons occur quite frequently, and in the right to left expression parsing of the Pascal compiler, the operands for the CMP instruction are in the wrong order on the evaluation stack. This means that the XMA.w 1,S [7] instruction (OSZ, XMA 1,X) must be used to place the left operand in A and the right operand on the stack (after having first popped it into A). Only in this way will the subtraction operation in the CMP instruction yield the comparison results as written in the source code. It is possible to complement the branch instruction used to test the results of the comparison, but I decided that would obfuscate the generated code more that I was willing to allow.

A complete example of the sequence instructions generated by the compiler for a REPEAT UNTIL construction (from the Sieve of Eratosthenes program) is as follows:

Code: Select all

( 163)                  ; ;   30:     UNTIL prime > limit;
( 164) 033C 4CEA02      ; 	jmp L_016
( 165)                  ; L_018
( 166) 033F ABAD180D    ; 	lda.w prime_006
( 167) 0343 AB48        ; 	pha.w
( 168) 0345 ABAD160D    ; 	lda.w limit_005
( 169) 0349 CB4401      ; 	xma.w 1,S
( 170) 034C CBD501      ; 	cmp.w 1,S
( 171) 034F C202        ; 	adj #2
( 172) 0351 08          ; 	php
( 173) 0352 A901        ; 	lda #1
( 174) 0354 28          ; 	plp
( 175) 0355 AB1002      ; 	bgt L_020
( 176) 0358 A900        ; 	lda #0
( 177)                  ; L_020
( 178) 035A ABC90100    ; 	cmp.w #1
( 179) 035E F003        ; 	beq L_012
( 180) 0360 4C8D02      ; 	jmp L_011
( 181)                  ; L_012

A couple of peephole optimizations that I added to the assembler transformed the previous code sequence into the following shorter / faster code sequence:

Code: Select all

( 164)                  ; ;   30:     UNTIL prime > limit;
( 165) 02CD 4C9A02      ; 	jmp L_016
( 166)                  ; L_018
( 167) 02D0 ABAD710C    ; 	lda.w prime_006
( 171) 02D4 ABCD6F0C    ; 	cmp.w limit_005
( 173) 02D8 AB1004      ; 	bgt L_020-2
( 174) 02DB A900        ; 	lda #0
( 175) 02DD 8002        ; 	bra L_020
( 177) 02DF A901        ; 	lda #1
( 178)                  ; L_020
( 180) 02E1 D003        ; 	bne L_012
( 181) 02E3 4C7002      ; 	jmp L_011
( 182)                  ; L_012

If the bgt L_020-2 instruction would leave the boolean result in the accumulator, rather than using code sequence in lines 174-177 (missing line numbers represent code removed by the peephole optimizer), then the 6 cycles could be saved from the example loop termination code sequence.

The M65C02B instruction set adds a mode to the branch instructions by using the OSX prefix instruction so that the branch instruction deposits into the accumulator the value of the internal CC flag, i.e., the Condition Code flag. The unconditional BRA rel branch instruction is converted to a conditional branch instruction when OSX is prefixed, and under those conditions, the branch takes the branch if the accumulator is non-zero, else it does not take the branch. The above optimized code sequence would resemble the following:

Code: Select all

( 164)                  ; ;   30:     UNTIL prime > limit;
( 165) 02CD 4C9A02      ; 	jmp L_016
( 166)                  ; L_018
( 167) 02D0 ABAD710C    ; 	lda.w prime_006
( 171) 02D4 ABCD6F0C    ; 	cmp.w limit_005
( 173) 02D8 CB1000      ; 	bgt.s L_020
( 178)                  ; L_020
( 180) 02E1 CBD8003     ; 	bra.s L_012
( 181) 02E3 4C7002      ; 	jmp L_011
( 182)                  ; L_012

(Note: if the range is longer than supported by an 8-bit offset, the range can be easily extended from 8 to 16 bits by using the OIS prefix instruction.)

In the preceding example from the Sieve of Eratosthenes program, there are no subroutines, so all relevant variables are at level 0, and therefore utilize absolute addressing. The situation is a bit different if the loop variables are local variables within a procedure / function. In this case, the variable are loaded on the stack, in the wrong order. This brings us to the changes that I've made to the instructions in Column 3 of the M65C02A instruction set. For these instructions, the order of the ALU operands are reversed. In other words, the normally left operand is treated as the right operand and vice versa. However, the destination is always the accumulator. For example, the definition of the standard subtraction instruction is A <= A + (~M) + C. For Column 3 instructions, subtraction is defined as A <= M + (~A) + 1, where M is either the value read from the memory location addressed by post-indexed (by Y) IP-indirect addressing mode, or the Y register: A <= [I+Y] + (~A) + C or A <= Y + (~A) + C. The following are the instructions of the M65C02B in Column 3:

Code: Select all

0x03:   NEG  A      // A <= 0 - A
0x13:   ORA  Y      // A <= Y | A
0x23:   CLR  A      // A <= 0
0x33:   AND  Y      // A <= Y & A
0x43:   CPL  A      // A <= ~A
0x53:   EOR  Y      // A <= Y ^ A
0x63:   ADC  I,Y    // M <= [I+Y]; A <= M + A + C
0x73:   ADC  Y      // A <= Y + A + C
0x83:   STA  I,Y    // [I+Y] <= A
0x93:   STA  I      // [I] <= A
0xA3:   LDA  I,Y    // A <= [I+Y]
0xB3:   LDA  I      // A <= [I]
0xC3:   CMP  I,Y    // M <= [I+Y]; (SIZ ? {N,V,Z,C} : {N,Z,C}) <= M - A
0xD3:   CMP  Y      // (SIZ ? {N,V,Z,C} : {N,Z,C}) <= Y - A
0xE3:   SBC  I,Y    // M <= [I+Y]; A <= M + ~A + C
0xF3:   SBC  Y      // A <= Y + ~A + C

The following are the four instructions added to the M65C02B in Column 2:

Code: Select all

0x22:   JSR I,Y
0x42:   OSI
0x62:   OWI
0x82:   JMP I,Y

The OSI instruction rounds out the prefix instructions, and OWI adds another prefix instruction which allows W to override IP when the IP register-indirect addressing mode is used. The JSR I,Y and JMP I,Y instructions were just added to see if they would be useful in some circumstances not currently supported by the compiler. JSR I,Y can certainly be useful if support for indirect function calls, of function pointers were added to the compiler.

Edit: modified example. Original incorrect. Given example not as effective in communicating the advantage of the use of the OSX prefix with the BRA rel8 instruction.

Last edited by MichaelM on Sat Mar 19, 2022 3:36 pm, edited 4 times in total.

Michael A.

Post Reply

137 posts

Page 9 of 10
- Jump to page:
Previous
1
…
6
7
8
9
10
Next

Return to “Programmable Logic”