Page 4 of 6
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 12:10 am
by White Flame
A big problem of the register machine is that, after a call, you don't know what to expect in your regs. Even if it was a call to 6502 code, it could have itself called another VM code, and override the regs. Using a sliding register window like AcheronVM solves this, but is a source of bloat in RAM usage and VM itself.
It's not a source of memory bloat. If intermediate state needs to be saved, it will take up RAM, full stop. The mechanism by which that happens (manual, vs part of function call semantics, vs automatic allocation in stacks) is kind of irrelevant to the presence of that load on the memory footprint.
Regarding AcheronVM, most of its features actually were 6502 implementation details that I leaked out to useful features the VM itself, meaning that such features happen "for free", like the carry stack and various 8-bit address indexing modes. The additional effect on VM code footprint by having a sliding register window in comparison to stack-based operations or accessing an orthgonally indexed register bank, is likely under a dozen bytes and could be zero depending on your point of reference.
I do think you need to do gain a bit more understanding about the concepts of execution complexity and code representation, and how those affect the real tradeoffs in various approaches (hardcoded, stack-based, register files, etc). There are no silver bullets. However, the particular downsides you back away from may not really affect things much: Because of the low level of abstraction of 6502 code, many of the directions discussed here will be very advantageous regardless of perceived downsides.
One thing that I think is certain, however, is that you will need to perform some major rearchitecting of your project. In my personal opinion, your project does not sound big enough to be fearful of refactoring & porting, so go for it. You will end up raising the abstraction level of your code by most of these techniques, simplifying the authoring processes.
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 12:52 am
by barrym95838
Sweet16's 'ADD R0' encodes into a single byte, and does a pretty reasonable 16-bit-wide 'ASL A' ... as you probably already know, the source is widely available and not that hard to custom-tailor to your uses, after you recognize some of Woz's coding quirks. I believe that he was the first to use the 'RTS' dispatch technique on the 6502.
It should even be possible to change his 4/4 op-code/register split to 5/3, should you decide that eight registers are enough (despite Jeff's urging to try it, I can't wrap my aging head around coding for a machine with so many registers).
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 2:23 pm
by chitselb
If I ever go as far as using a VM, I'd probably engine my own one instead of using a pre-existing one.
As far I can tell, there is the following "problems" with SWEET16 (not they are actual problems, but could be sources of inefficiencies) :
- No way to call 6502 code directly
- No way to call SWEET16 code directly from 6502, you'll have to either switch mode and call, or call and switch mode
To call 6502 code directly from Sweet16, you could just use one of the unimplemented Sweet16 opcodes followed by a two-byte address to make a new Sweet16 instruction. You've got $0D, $0E, and $0F available. Let's use the $0D opcode and call our new instruction "EXT" (for extend). Assuming an otherwise
stock Sweet16 it could be as simple as hooking up the vector and adding this:
Code: Select all
EXT LDX #0 ; so SETZ will use R0 for the indirect
LDY #1 ; point to high byte of transfer address
JSR SETZ
JMP (R0) ; implement your 6502 extension to Sweet16 here!
; USE RTS FROM NATIVE EXTENSION ROUTINE TO RETURN TO SWEET16
To get around the issue of JSR SW16 consuming three bytes, I jacked my BRK vector to call Sweet16, so it only takes one byte. In my situation, this has the added advantage that the PET ROM code throws all 6502 registers on the machine stack prior to sending control to Sweet16
- No way to load a 8-bit constant in a register (this means you'll waste a $00 in ROM for the high byte)
This one's a little more tricky. Why not use another unimplemented opcode for that? I propose calling it SETB (set byte). Since there's no room for the register nybble, it always loads the byte following the $0E (SETB) instruction into R0.
- No shift operations. I use them so often, it's a major source of inefficiency in native 6502 when you have to do ASL/ROL or LSR/ROR in order to shift a 16-bit value, how could they leave them out ?
- No AND/OR/XOR operations. In several cases you could simulate them with ADD or SUB, but not always.
Easily done with the new Sweet16 EXT opcode
- Use a separate stack instead of 6502 stack
- 15 registers sounds a bit like overkill (although it's hard to tell before using them)
- Don't like the fact it tries to simulate the status flags of a processor. Why no higher level branches, like "branch if this is greater than this", or things like that
Don't know how to fix these for you, which seem more a matter of personal preference. As a Forth programmer, I tend to think of separate stacks as a strong advantage.
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 2:26 pm
by chitselb
Sweet16's 'ADD R0' encodes into a single byte, and does a pretty reasonable 16-bit-wide 'ASL A' ... as you probably already know, the source is widely available and not that hard to custom-tailor to your uses, after you recognize some of Woz's coding quirks. I believe that he was the first to use the 'RTS' dispatch technique on the 6502.
It should even be possible to change his 4/4 op-code/register split to 5/3, should you decide that eight registers are enough (despite Jeff's urging to try it, I can't wrap my aging head around coding for a machine with so many registers).
This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 3:20 pm
by Bregalad
This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?
AND, OR, XOR, LSR, ASL, LD and ST indirect without post-increment (when writing to a serial port).
The problem is that implementing more instructions also means the VM takes more space, killing its benefits of saving bytes.
@chitselb : yeah, using $0b and $0e that way is clever. But that doesn't fix the lack of LSR and logical operations. Of course they can be done easily switching back to 6502.
If intermediate state needs to be saved, it will take up RAM, full stop
Sounds like Acheron VM saves 16 16-bit registers (that's 32 bytes) each time, no matter if you use them or not (and no matter if you must save all of them or not). Please correct me if I misunderstood that, if I did then the RAM usage isn't bloated.
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 3:22 pm
by barrym95838
Well, I'm kinda thinking that I would want AND, OR, EOR, LSR, BRL, BSL, ROL, ROR, MUL, DIV, MOD, and a way to more gracefully call a 6502 subroutine without having to always do something like 'RTN : LDA $00 : LDX $02 : LDY $04 : JSR Native : STA $00 : STX $02 : STY $04 : JSR Sweet16'. A new instruction (call it 'J02' or something) could be just three bytes long and would do all of that for you. If I could get all of that at the cost of losing half of my registers, I would be tempted to do so.
Mike
Re: Compressing during the compilation process
Posted: Fri Mar 28, 2014 4:05 pm
by White Flame
Sounds like Acheron VM saves 16 16-bit registers (that's 32 bytes) each time, no matter if you use them or not (and no matter if you must save all of them or not). Please correct me if I misunderstood that, if I did then the RAM usage isn't bloated.
The register stack is sort of like C stack frames. You tell it how much to slide. On function entry, you can say "grow 3" and you'll have slid the register window by 3 registers. If you use "mgrow 3", then it marks the current window location and a "retm" (instead of "ret") at the end of the function will automatically slide the register window back.
Regarding SWEET16, I do think that it's better for the
global VM design to just drop to 6502 mode rather than call 6502 subroutines directly. 6502 code is simply too open-ended in terms of preconditions that you're better off setting up the call from 6502 code. SWEET16 itself can't really be general-purpose and have a fixed global 6502 calling convention, but your specific implementation can (for instance, solely using SW16 registers as parameters), so I think it's rightly left to extensions. Also note that the SWEET16 dispatch requires all instruction handlers to be on a single page, so making complex instructions which pass state between 6502 and SW16 registers might go beyond that and require additional JMP stumps.
Are you planning on releasing source to your project? All of this is pretty contextual to the particulars you're working with, and it's hard to know what will work best for your code base. It's next to impossible to take many of the things from this thread as global, gospel, constant truth, equally applicable to all projects ever.
Re: Compressing during the compilation process
Posted: Tue Apr 01, 2014 6:53 pm
by chitselb
Well, I'm kinda thinking that I would want AND, OR, EOR, LSR, BRL, BSL, ROL, ROR, MUL, DIV, MOD, and a way to more gracefully call a 6502 subroutine without having to always do something like 'RTN : LDA $00 : LDX $02 : LDY $04 : JSR Native : STA $00 : STX $02 : STY $04 : JSR Sweet16'. A new instruction (call it 'J02' or something) could be just three bytes long and would do all of that for you. If I could get all of that at the cost of losing half of my registers, I would be tempted to do so.
Mike
Adding code to Sweet16 is kind of a toughie, but it's doable. For general purpose operators, I'd flag it with one of the single-byte opcodes ($0x) followed by a second opcode with a register, e.g. ANDD R3 (for AND double) would be $0e $13, and ORD R10 (OR double) would be $0e $2a. I didn't need to do that, but I managed to shrink my (FIND) word down by 31 bytes and it's still pretty peppy. I haven't benchmarked it yet vs. a real Forth
Code: Select all
33 bytes (wordlen16 and strcomp16 6502)
>C:0635 84 17 a0 02 b1 0a 29 3f 85 16 60 a4 0c c8 c8 b1 ......)?..`.....
>C:0645 0a 51 02 0a d0 07 88 c0 02 d0 f4 38 24 18 26 1d .Q.........8$.&.
>C:0655 60
55 bytes (opfind sweet16 with ext)
>C:0956 00 b0 38 21 37 0e 21 35 0d 35 06 2b 36 27 35 0d ..8!7.!5.5.+6'5.
>C:0966 35 06 26 db 02 1a 07 05 0d 40 06 03 03 65 01 ee 5.&......@...e..
>C:0976 65 25 a6 31 e1 45 b6 15 80 00 f8 d5 06 02 e8 e8 e%.1.E..........
>C:0986 0f 28 31 00 4c 80 00
88 bytes totatl
Originally was all 6502
Code: Select all
119 bytes
>C:09a6 a0 01 20 1f 07 84 0f a0 02 b1 0a 29 1f 85 0c b1 .. ........)....
>C:09b6 02 85 0e 29 1f 85 0d a5 0c c5 0d 90 43 d0 31 a5 ...)........C.1.
>C:09c6 0e 29 20 d0 2b a5 0d 69 01 a8 b1 02 51 0a 0a d0 .) .+..i....Q...
>C:09d6 1f 88 c0 02 d0 f4 a4 0d c8 c8 98 65 02 85 02 a4 ...........e....
>C:09e6 03 90 01 c8 84 03 a5 0e 0a 0a b0 23 c6 0f 90 1f ...........#....
>C:09f6 a0 00 b1 02 48 c8 b1 02 85 03 68 85 02 c8 d0 af ....H.....h.....
>C:0a06 a5 0b 85 03 a5 0a 69 02 85 02 90 03 e6 03 18 a4 ......i.........
>C:0a16 0f 98 69 00 4c f5 08
and this is what the source looks like, at the end of my sweet16.a65 file
Code: Select all
.BYT <BS-1 ;C
.BYT <INR-1 ;EX
.BYT <EXT-1 ;D
.BYT <DCR-1 ;FX
.BYT <PULL-1 ;E
.BYT 0 ;UNUSED
.BYT <PUSH-1 ;F
...
BK RTS
RS JMP RSZ
RTN JMP RTNZ
EXT LDA #R11 ; SETZ will use R11 for the indirect transfer
STA R14H
ASL
TAX
LDY #1 ; point to high byte of transfer address
JSR SETZ
JMP (r11) ; implement your 6502 extension to Sweet16 here!
PULL BPL PULLZ ; bra
PUSH JSR getpstack ; get PETTIL stack pointer
...
; returns just the low 6 bits of a length byte
;inputs
; N0 (R5) = address
;returns
; R11 = low bits of length byte at (N0+2)
wordlen16 sty r11+1 ; clear high byte
ldy #2
lda (n),y
and #$3f
sta r11 ; R11 = length
rts
;inputs
; TOS (R1) = addr-2 of a counted string
; N0 (R5) = current LFA in the dictionary chain (initially the last LFA)
; R7 = length of words
;returns
; C = true if the strings match
strcomp16 ldy n+2 ; the length
iny
iny
strcomp16a lda (n),y
eor (tos),y
asl ; ignore bit7
bne strcomp16b ; different? outtie fail
dey
cpy #2
bne strcomp16a ; C flag is set on successful strcmp
sec ; success!
.byt $24 ; BIT zp instruction
strcomp16b clc ; fail
rol R14H ; tell Sweet16 about the C flag
rts
and how it's invoked from the main program
Code: Select all
;--------------------------------------------------------------
;
; (FIND) ( name-2 LFA -- name-2 0 | CFA flag )
;
; * outer interpreter headerless
;
; name-2 is the address-2 of a counted string we are searching for.
; LFA is the LFA at the head of a chain of LFAs
; returns
; ( CFA -1 ) if found normal word
; ( CFA 1 ) if found immediate word
; ( name 0 ) if not found
;
;pfindlfa .byt $de,$ad
; .byt (pfind-*-1)|bit7
; .asc "(FIND",")"|bit7
opfind brk
.byt sub | R0
.byt st | N3
.byt ld | TOS
.byt st | N2
.byt pull
.byt ld | TOS
.byt st | N0
.byt ext
.word wordlen16
.byt ld | R11 ; set search length
.byt st | N1
.byt ld | N2
opfind02 .byt st | N0
.byt ext
.word wordlen16
.byt ld | N1 ; search length
.byt cpr | R11 ; dict length
.byt bnc , <(opfind06-*-2) ; we went past it. outtie
.byt bnz , <(opfind03-*-2) ; different lengths, skip
.byt ext
.word strcomp16
.byt bc , <(opfind04-*-2)
opfind03 .byt ldd | N0 ; hop
.byt br , <(opfind02-*-2)
opfind04 ;winner!
.byt ldd | N0 ; add 2 to the LFA
.byt ld | N0 ; now it's an NFA
.byt add | N1 ; add the length
.byt st | TOS ; now it's almost a CFA
.byt inr | TOS ; add 1, now it's a CFA
.byt ldi | N0 ; fetch dictionary length byte
.byt sub | N1 ; subtract clean length leaving only bits
.byt set | N0
.word $80
.byt dcr | N3 ; assume it's a normal word
.byt cpr | N0
.byt bz , <(opfind06-*-2)
.byt inr | N3
.byt inr | N3
opfind06 .byt push
.byt ld | N3
.byt st | TOS
.byt rtn
jmp next
Re: Compressing during the compilation process
Posted: Wed Apr 16, 2014 10:52 pm
by chitselb
Re: Compressing during the compilation process
Posted: Mon Apr 28, 2014 4:35 pm
by Alienthe
This 5/3 idea is mind-boggling. Which single-byte opcodes would you discard or convert to a register-associated instruction? What new opcodes would you implement?
Unless I am mistaken this is not about discarding opcodes, rather you half the number of accessible registers and double the number of instructions. I think this sounds like an interesting idea.
To expand on this I would propose keeping SWEET-16 instructions as is (though the last 3 instructions can of course be used freely). Instead add a new VM which for sake of argument we can call CUTE-16. SWEET-16 is made for one thing and does it well: pointer handling. Hence the rather large number of registers. What people here on the board request are ALU type instructions and that can be used in CUTE-16. The trick is to use the same 32 bytes register file. Since multiplication is sought after I would suggest a few changes:
- Accumulator is
R0
- on invocation the 6502 accumulator goes into low byte of
R0
- For multiplication the overflow or most significant part goes into
R1
- the B-register is
R2 (shades of dual accumulator machine here), with MSP in
R3 if needed (not sure if it is though)
- index registers
R4 and
R5 are loaded from X and Y respectively
-
R6 and
R7 are spare
-
R8 -
R15 are not directly visible, though might use a few for for-loops
Re: Compressing during the compilation process
Posted: Tue Apr 29, 2014 9:16 pm
by chitselb
Here's details on the
1802 CPU on which Sweet-16 is based, starting in Chapter 11 on page 888. It powered the
Cosmac ELF, which I never got to play with, but apparently Woz did.
So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7 and overlay the instruction set adding several more ALU-oriented register opcodes. In binary, the opcodes would be
0000 yyyy Original non register opcode (for all 16 values of yyyy)
yyyy 0xxx Original Sweet-16 register opcode (for 15 nonzero values of yyyy)
yyyy 1xxx new CUTE-16 opcodes (for 15 nonzero values of yyyy)
I would probably include a register bank switch opcode to toggle between the low R0-R7 registers and the upper R8-R15 registers, because sometimes it might be useful to directly load the program counter with ST 15 (e.g. from a jump table), or to unnest a BS call with POPD R12. Am I doing it wrong? What would the new opcodes be?
Re: Compressing during the compilation process
Posted: Thu May 01, 2014 1:41 pm
by Bregalad
So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7
In fact I don't see any reason to keep R8-R15 if you can't access them. I don't see a need for so many registers. PC can be either a separate, non-accessible register, or can be remapped to R7 (I'd opt for the first option personally, as jumps tables sure are useful, but not extremely common, and they could be done in native 6502 if such a need were to arise).
What would the new opcodes be?
Things like logical operations, and shifts. Some other addressing modes maybe, it'd depends on the needs really.
Also a new opcode to call a native 6502 sub seems necessary at this point (but it would be a non-register opcode of course).
Re: Compressing during the compilation process
Posted: Thu May 01, 2014 10:30 pm
by chitselb
In fact I don't see any reason to keep R8-R15 if you can't access them. I don't see a need for so many registers. PC can be either a separate, non-accessible register, or can be remapped to R7 (I'd opt for the first option personally, as jumps tables sure are useful, but not extremely common, and they could be done in native 6502 if such a need were to arise).
There are 16 register for historical reasons (that's what the RCA 1802 CPU had). I have never come close to using them all, but the 16 extra bytes for the R8-R15 can be used by something else if I know my CUTE-16 code is never going to go there. The second reason to leave all 16 registers in there is If the PC were in R7, the flags in R6, the compare register in R5, the subroutine stack pointer in R4, and accumulator in R0, well, there goes more than half of your registers. The third reason to leave them all in there is to be able to overlay CUTE-16 on top of SWEET-16 and still have a working SWEET-16.
I'm also hating on the idea of preloading the CUTE-16 registers with e.g. the 6502 A register. If this is needed by an application, the caller could STA R1 (or whatever) prior to invoking CUTE-16. Preloading burdens every call with that overhead even when it isn't required.
What would the new opcodes be?
Things like logical operations, and shifts. Some other addressing modes maybe, it'd depends on the needs really.
Also a new opcode to call a native 6502 sub seems necessary at this point (but it would be a non-register opcode of course).
That is a pretty vague requirement.
Re: Compressing during the compilation process
Posted: Fri May 02, 2014 1:37 pm
by Bregalad
the caller could STA R1 (or whatever) prior to invoking CUTE-16
This is a waste of 2 bytes
Preloading burdens every call with that overhead even when it isn't required
Are you talking about execution time or bytes overhead ?
If there is a time overhead, I don't care since the very idea to use such a VM is to have a time overhead as a price to pay to save bytes.
If there is bytes overhead, then that's a problem. Of course the calling code will be a tiny bit larger (like the 2 bytes I've been mentionning above), but at least they only appear once in the ROM. I might have misunderstand what you're telling me (or the other way arround) though.
Also, in my particular case, I care 0% about compatibility with original SWEET-16 instructions or whatever, I'm just looking for bytes saving. I'm still unsure if I'll start going that route at all when I'll be going to miss bytes through, dropping some unimportant parts of the games is a much easier route than engineering and writing a VM and convert some of 6502 code for that VM.
Re: Compressing during the compilation process
Posted: Fri May 02, 2014 10:20 pm
by chitselb
Looked at in a very general way, Sweet16 is a token-threaded subroutine library that happens to closely mimic the RCA 1802 (Cosmac ELF) processor. You could make something very similar in your game and change something like
LDY SPRITENUMBER
LDX NORTH
JSR MOVESPRITE5PX [7 bytes]
into
.byte mvnorth | SPRITENUMBER [1 byte]
(just by way of example)