6502.org

Posted: **Fri Mar 28, 2014 12:10 am**

Bregalad wrote:

A big problem of the register machine is that, after a call, you don't know what to expect in your regs. Even if it was a call to 6502 code, it could have itself called another VM code, and override the regs. Using a sliding register window like AcheronVM solves this, but is a source of bloat in RAM usage and VM itself.

It's not a source of memory bloat. If intermediate state needs to be saved, it will take up RAM, full stop. The mechanism by which that happens (manual, vs part of function call semantics, vs automatic allocation in stacks) is kind of irrelevant to the presence of that load on the memory footprint.

Regarding AcheronVM, most of its features actually were 6502 implementation details that I leaked out to useful features the VM itself, meaning that such features happen "for free", like the carry stack and various 8-bit address indexing modes. The additional effect on VM code footprint by having a sliding register window in comparison to stack-based operations or accessing an orthgonally indexed register bank, is likely under a dozen bytes and could be zero depending on your point of reference.

I do think you need to do gain a bit more understanding about the concepts of execution complexity and code representation, and how those affect the real tradeoffs in various approaches (hardcoded, stack-based, register files, etc). There are no silver bullets. However, the particular downsides you back away from may not really affect things much: Because of the low level of abstraction of 6502 code, many of the directions discussed here will be very advantageous regardless of perceived downsides.

One thing that I think is certain, however, is that you will need to perform some major rearchitecting of your project. In my personal opinion, your project does not sound big enough to be fearful of refactoring & porting, so go for it. You will end up raising the abstraction level of your code by most of these techniques, simplifying the authoring processes.

Posted: **Fri Mar 28, 2014 12:52 am**

Sweet16's 'ADD R0' encodes into a single byte, and does a pretty reasonable 16-bit-wide 'ASL A' ... as you probably already know, the source is widely available and not that hard to custom-tailor to your uses, after you recognize some of Woz's coding quirks. I believe that he was the first to use the 'RTS' dispatch technique on the 6502.

It should even be possible to change his 4/4 op-code/register split to 5/3, should you decide that eight registers are enough (despite Jeff's urging to try it, I can't wrap my aging head around coding for a machine with so many registers).

Posted: **Fri Mar 28, 2014 2:23 pm**

Bregalad wrote:

If I ever go as far as using a VM, I'd probably engine my own one instead of using a pre-existing one.
As far I can tell, there is the following "problems" with SWEET16 (not they are actual problems, but could be sources of inefficiencies) :
- No way to call 6502 code directly
- No way to call SWEET16 code directly from 6502, you'll have to either switch mode and call, or call and switch mode

To call 6502 code directly from Sweet16, you could just use one of the unimplemented Sweet16 opcodes followed by a two-byte address to make a new Sweet16 instruction. You've got $0D, $0E, and $0F available. Let's use the $0D opcode and call our new instruction "EXT" (for extend). Assuming an otherwise stock Sweet16 it could be as simple as hooking up the vector and adding this:

Code: Select all

EXT     LDX #0		; so SETZ will use R0 for the indirect
        LDY #1		; point to high byte of transfer address
        JSR SETZ
        JMP (R0)	; implement your 6502 extension to Sweet16 here!
; USE RTS FROM NATIVE EXTENSION ROUTINE TO RETURN TO SWEET16

To get around the issue of JSR SW16 consuming three bytes, I jacked my BRK vector to call Sweet16, so it only takes one byte. In my situation, this has the added advantage that the PET ROM code throws all 6502 registers on the machine stack prior to sending control to Sweet16

Bregalad wrote:

- No way to load a 8-bit constant in a register (this means you'll waste a $00 in ROM for the high byte)

This one's a little more tricky. Why not use another unimplemented opcode for that? I propose calling it SETB (set byte). Since there's no room for the register nybble, it always loads the byte following the $0E (SETB) instruction into R0.

Bregalad wrote:

- No shift operations. I use them so often, it's a major source of inefficiency in native 6502 when you have to do ASL/ROL or LSR/ROR in order to shift a 16-bit value, how could they leave them out ?
- No AND/OR/XOR operations. In several cases you could simulate them with ADD or SUB, but not always.

Easily done with the new Sweet16 EXT opcode

Bregalad wrote:

- Use a separate stack instead of 6502 stack
- 15 registers sounds a bit like overkill (although it's hard to tell before using them)
- Don't like the fact it tries to simulate the status flags of a processor. Why no higher level branches, like "branch if this is greater than this", or things like that

Don't know how to fix these for you, which seem more a matter of personal preference. As a Forth programmer, I tend to think of separate stacks as a strong advantage.

Posted: **Fri Mar 28, 2014 2:26 pm**

barrym95838 wrote:

Sweet16's 'ADD R0' encodes into a single byte, and does a pretty reasonable 16-bit-wide 'ASL A' ... as you probably already know, the source is widely available and not that hard to custom-tailor to your uses, after you recognize some of Woz's coding quirks. I believe that he was the first to use the 'RTS' dispatch technique on the 6502.

It should even be possible to change his 4/4 op-code/register split to 5/3, should you decide that eight registers are enough (despite Jeff's urging to try it, I can't wrap my aging head around coding for a machine with so many registers).

This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?

Posted: **Fri Mar 28, 2014 3:20 pm**

Quote:

This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?

AND, OR, XOR, LSR, ASL, LD and ST indirect without post-increment (when writing to a serial port).

The problem is that implementing more instructions also means the VM takes more space, killing its benefits of saving bytes.

@chitselb : yeah, using $0b and $0e that way is clever. But that doesn't fix the lack of LSR and logical operations. Of course they can be done easily switching back to 6502.

Quote:

If intermediate state needs to be saved, it will take up RAM, full stop

Sounds like Acheron VM saves 16 16-bit registers (that's 32 bytes) each time, no matter if you use them or not (and no matter if you must save all of them or not). Please correct me if I misunderstood that, if I did then the RAM usage isn't bloated.

Posted: **Fri Mar 28, 2014 3:22 pm**

Well, I'm kinda thinking that I would want AND, OR, EOR, LSR, BRL, BSL, ROL, ROR, MUL, DIV, MOD, and a way to more gracefully call a 6502 subroutine without having to always do something like 'RTN : LDA $00 : LDX $02 : LDY $04 : JSR Native : STA $00 : STX $02 : STY $04 : JSR Sweet16'. A new instruction (call it 'J02' or something) could be just three bytes long and would do all of that for you. If I could get all of that at the cost of losing half of my registers, I would be tempted to do so.

Mike

Posted: **Fri Mar 28, 2014 4:05 pm**

Bregalad wrote:

Sounds like Acheron VM saves 16 16-bit registers (that's 32 bytes) each time, no matter if you use them or not (and no matter if you must save all of them or not). Please correct me if I misunderstood that, if I did then the RAM usage isn't bloated.

The register stack is sort of like C stack frames. You tell it how much to slide. On function entry, you can say "grow 3" and you'll have slid the register window by 3 registers. If you use "mgrow 3", then it marks the current window location and a "retm" (instead of "ret") at the end of the function will automatically slide the register window back.

Regarding SWEET16, I do think that it's better for the global VM design to just drop to 6502 mode rather than call 6502 subroutines directly. 6502 code is simply too open-ended in terms of preconditions that you're better off setting up the call from 6502 code. SWEET16 itself can't really be general-purpose and have a fixed global 6502 calling convention, but your specific implementation can (for instance, solely using SW16 registers as parameters), so I think it's rightly left to extensions. Also note that the SWEET16 dispatch requires all instruction handlers to be on a single page, so making complex instructions which pass state between 6502 and SW16 registers might go beyond that and require additional JMP stumps.

Are you planning on releasing source to your project? All of this is pretty contextual to the particulars you're working with, and it's hard to know what will work best for your code base. It's next to impossible to take many of the things from this thread as global, gospel, constant truth, equally applicable to all projects ever.

Posted: **Tue Apr 01, 2014 6:53 pm**

barrym95838 wrote:

Well, I'm kinda thinking that I would want AND, OR, EOR, LSR, BRL, BSL, ROL, ROR, MUL, DIV, MOD, and a way to more gracefully call a 6502 subroutine without having to always do something like 'RTN : LDA $00 : LDX $02 : LDY $04 : JSR Native : STA $00 : STX $02 : STY $04 : JSR Sweet16'. A new instruction (call it 'J02' or something) could be just three bytes long and would do all of that for you. If I could get all of that at the cost of losing half of my registers, I would be tempted to do so.

Mike

Adding code to Sweet16 is kind of a toughie, but it's doable. For general purpose operators, I'd flag it with one of the single-byte opcodes ($0x) followed by a second opcode with a register, e.g. ANDD R3 (for AND double) would be $0e $13, and ORD R10 (OR double) would be $0e $2a. I didn't need to do that, but I managed to shrink my (FIND) word down by 31 bytes and it's still pretty peppy. I haven't benchmarked it yet vs. a real Forth

Code: Select all

33 bytes (wordlen16 and strcomp16 6502)
>C:0635  84 17 a0 02  b1 0a 29 3f  85 16 60 a4  0c c8 c8 b1   ......)?..`.....
>C:0645  0a 51 02 0a  d0 07 88 c0  02 d0 f4 38  24 18 26 1d   .Q.........8$.&.
>C:0655  60         

55 bytes (opfind sweet16 with ext)
>C:0956  00 b0 38 21  37 0e 21 35  0d 35 06 2b  36 27 35 0d   ..8!7.!5.5.+6'5.
>C:0966  35 06 26 db  02 1a 07 05  0d 40 06 03  03 65 01 ee   5.&......@...e..
>C:0976  65 25 a6 31  e1 45 b6 15  80 00 f8 d5  06 02 e8 e8   e%.1.E..........
>C:0986  0f 28 31 00  4c 80 00        

88 bytes totatl

Originally was all 6502

Code: Select all

119 bytes
>C:09a6  a0 01 20 1f  07 84 0f a0  02 b1 0a 29  1f 85 0c b1   .. ........)....
>C:09b6  02 85 0e 29  1f 85 0d a5  0c c5 0d 90  43 d0 31 a5   ...)........C.1.
>C:09c6  0e 29 20 d0  2b a5 0d 69  01 a8 b1 02  51 0a 0a d0   .) .+..i....Q...
>C:09d6  1f 88 c0 02  d0 f4 a4 0d  c8 c8 98 65  02 85 02 a4   ...........e....
>C:09e6  03 90 01 c8  84 03 a5 0e  0a 0a b0 23  c6 0f 90 1f   ...........#....
>C:09f6  a0 00 b1 02  48 c8 b1 02  85 03 68 85  02 c8 d0 af   ....H.....h.....
>C:0a06  a5 0b 85 03  a5 0a 69 02  85 02 90 03  e6 03 18 a4   ......i.........
>C:0a16  0f 98 69 00  4c f5 08

and this is what the source looks like, at the end of my sweet16.a65 file

Code: Select all

	.BYT <BS-1	;C
	.BYT <INR-1	;EX
	.BYT <EXT-1	;D
	.BYT <DCR-1	;FX
	.BYT <PULL-1	;E
	.BYT 0		;UNUSED
	.BYT <PUSH-1	;F
...
BK	RTS
RS	JMP RSZ
RTN	JMP RTNZ
EXT	LDA #R11	; SETZ will use R11 for the indirect transfer
	STA R14H
	ASL
	TAX
	LDY #1		; point to high byte of transfer address
	JSR SETZ
	JMP (r11)	; implement your 6502 extension to Sweet16 here!
PULL	BPL PULLZ	; bra
PUSH	JSR getpstack	; get PETTIL stack pointer
...

; returns just the low 6 bits of a length byte
;inputs
;	N0 (R5) = address
;returns
;	R11 = low bits of length byte at (N0+2)
wordlen16	sty r11+1		; clear high byte
		ldy #2
		lda (n),y
		and #$3f
		sta r11			; R11 = length
		rts

;inputs
;	TOS (R1) = addr-2 of a counted string
;	N0 (R5) = current LFA in the dictionary chain (initially the last LFA)
;	R7 = length of words
;returns
;	C = true if the strings match
strcomp16	ldy n+2		; the length
		iny
		iny
strcomp16a	lda (n),y
		eor (tos),y
		asl		; ignore bit7
		bne strcomp16b	; different? outtie fail
		dey
		cpy #2
		bne strcomp16a	; C flag is set on successful strcmp
		sec		; success!
		.byt $24	; BIT zp instruction
strcomp16b	clc		; fail
		rol R14H	; tell Sweet16 about the C flag
		rts

and how it's invoked from the main program

Code: Select all

;--------------------------------------------------------------
;
;	(FIND)   ( name-2 LFA -- name-2 0 | CFA flag )
;
; * outer interpreter headerless
;
; name-2 is the address-2 of a counted string we are searching for.
; LFA is the LFA at the head of a chain of LFAs
; returns
;	( CFA -1 ) if found normal word
;	( CFA 1 ) if found immediate word
;	( name 0 ) if not found
;
;pfindlfa	.byt $de,$ad
;		.byt (pfind-*-1)|bit7
;		.asc "(FIND",")"|bit7
opfind		brk
		.byt sub | R0
		.byt st | N3
		.byt ld | TOS
		.byt st | N2
		.byt pull
		.byt ld | TOS
		.byt st | N0
		.byt ext
		.word wordlen16
		.byt ld | R11		; set search length
		.byt st | N1
		.byt ld | N2
opfind02	.byt st | N0
		.byt ext
		.word wordlen16
		.byt ld | N1		; search length
		.byt cpr | R11		; dict length
		.byt bnc , <(opfind06-*-2)	; we went past it.  outtie
		.byt bnz , <(opfind03-*-2)	; different lengths, skip
		.byt ext
		.word strcomp16
		.byt bc , <(opfind04-*-2)
opfind03	.byt ldd | N0			; hop
		.byt br , <(opfind02-*-2)
opfind04	;winner!
		.byt ldd | N0		; add 2 to the LFA
		.byt ld | N0		; now it's an NFA
		.byt add | N1		; add the length
		.byt st | TOS		; now it's almost a CFA
		.byt inr | TOS		; add 1, now it's a CFA
		.byt ldi | N0		; fetch dictionary length byte
		.byt sub | N1		; subtract clean length leaving only bits
		.byt set | N0
		.word $80
		.byt dcr | N3		; assume it's a normal word
		.byt cpr | N0
		.byt bz , <(opfind06-*-2)
		.byt inr | N3
		.byt inr | N3
opfind06	.byt push
		.byt ld | N3
		.byt st | TOS
		.byt rtn
     		jmp next

Posted: **Wed Apr 16, 2014 10:52 pm**

I found 12 more bytes and probably shaved a few clocks in Sweet16. </blogpimp>

Posted: **Mon Apr 28, 2014 4:35 pm**

chitselb wrote:

This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?

Unless I am mistaken this is not about discarding opcodes, rather you half the number of accessible registers and double the number of instructions. I think this sounds like an interesting idea.

To expand on this I would propose keeping SWEET-16 instructions as is (though the last 3 instructions can of course be used freely). Instead add a new VM which for sake of argument we can call CUTE-16. SWEET-16 is made for one thing and does it well: pointer handling. Hence the rather large number of registers. What people here on the board request are ALU type instructions and that can be used in CUTE-16. The trick is to use the same 32 bytes register file. Since multiplication is sought after I would suggest a few changes:
- Accumulator is R0
- on invocation the 6502 accumulator goes into low byte of R0
- For multiplication the overflow or most significant part goes into R1
- the B-register is R2 (shades of dual accumulator machine here), with MSP in R3 if needed (not sure if it is though)
- index registers R4 and R5 are loaded from X and Y respectively
- R6 and R7 are spare
- R8 - R15 are not directly visible, though might use a few for for-loops

Posted: **Tue Apr 29, 2014 9:16 pm**

Here's details on the 1802 CPU on which Sweet-16 is based, starting in Chapter 11 on page 888. It powered the Cosmac ELF, which I never got to play with, but apparently Woz did.

So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7 and overlay the instruction set adding several more ALU-oriented register opcodes. In binary, the opcodes would be
0000 yyyy Original non register opcode (for all 16 values of yyyy)
yyyy 0xxx Original Sweet-16 register opcode (for 15 nonzero values of yyyy)
yyyy 1xxx new CUTE-16 opcodes (for 15 nonzero values of yyyy)

I would probably include a register bank switch opcode to toggle between the low R0-R7 registers and the upper R8-R15 registers, because sometimes it might be useful to directly load the program counter with ST 15 (e.g. from a jump table), or to unnest a BS call with POPD R12. Am I doing it wrong? What would the new opcodes be?

Posted: **Thu May 01, 2014 1:41 pm**

Quote:

So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7

In fact I don't see any reason to keep R8-R15 if you can't access them. I don't see a need for so many registers. PC can be either a separate, non-accessible register, or can be remapped to R7 (I'd opt for the first option personally, as jumps tables sure are useful, but not extremely common, and they could be done in native 6502 if such a need were to arise).

Quote:

What would the new opcodes be?

Things like logical operations, and shifts. Some other addressing modes maybe, it'd depends on the needs really.
Also a new opcode to call a native 6502 sub seems necessary at this point (but it would be a non-register opcode of course).

Posted: **Thu May 01, 2014 10:30 pm**

Bregalad wrote:

In fact I don't see any reason to keep R8-R15 if you can't access them. I don't see a need for so many registers. PC can be either a separate, non-accessible register, or can be remapped to R7 (I'd opt for the first option personally, as jumps tables sure are useful, but not extremely common, and they could be done in native 6502 if such a need were to arise).

There are 16 register for historical reasons (that's what the RCA 1802 CPU had). I have never come close to using them all, but the 16 extra bytes for the R8-R15 can be used by something else if I know my CUTE-16 code is never going to go there. The second reason to leave all 16 registers in there is If the PC were in R7, the flags in R6, the compare register in R5, the subroutine stack pointer in R4, and accumulator in R0, well, there goes more than half of your registers. The third reason to leave them all in there is to be able to overlay CUTE-16 on top of SWEET-16 and still have a working SWEET-16.

I'm also hating on the idea of preloading the CUTE-16 registers with e.g. the 6502 A register. If this is needed by an application, the caller could STA R1 (or whatever) prior to invoking CUTE-16. Preloading burdens every call with that overhead even when it isn't required.

Bregalad wrote:

Quote:

What would the new opcodes be?

Things like logical operations, and shifts. Some other addressing modes maybe, it'd depends on the needs really.
Also a new opcode to call a native 6502 sub seems necessary at this point (but it would be a non-register opcode of course).

That is a pretty vague requirement.

Posted: **Fri May 02, 2014 1:37 pm**

Quote:

the caller could STA R1 (or whatever) prior to invoking CUTE-16

This is a waste of 2 bytes

Quote:

Preloading burdens every call with that overhead even when it isn't required

Are you talking about execution time or bytes overhead ?
If there is a time overhead, I don't care since the very idea to use such a VM is to have a time overhead as a price to pay to save bytes.

If there is bytes overhead, then that's a problem. Of course the calling code will be a tiny bit larger (like the 2 bytes I've been mentionning above), but at least they only appear once in the ROM. I might have misunderstand what you're telling me (or the other way arround) though.

Also, in my particular case, I care 0% about compatibility with original SWEET-16 instructions or whatever, I'm just looking for bytes saving. I'm still unsure if I'll start going that route at all when I'll be going to miss bytes through, dropping some unimportant parts of the games is a much easier route than engineering and writing a VM and convert some of 6502 code for that VM.

Posted: **Fri May 02, 2014 10:20 pm**

Looked at in a very general way, Sweet16 is a token-threaded subroutine library that happens to closely mimic the RCA 1802 (Cosmac ELF) processor. You could make something very similar in your game and change something like

LDY SPRITENUMBER
LDX NORTH
JSR MOVESPRITE5PX [7 bytes]

into
.byte mvnorth | SPRITENUMBER [1 byte]

(just by way of example)

6502.org

Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process