Page 5 of 6
Re: Compressing during the compilation process
Posted: Sat May 03, 2014 8:31 pm
by Alienthe
Here's details on the
1802 CPU on which Sweet-16 is based, starting in Chapter 11 on page 888. It powered the
Cosmac ELF, which I never got to play with, but apparently Woz did.
1802 is interesting in a slightly weird way. 1805 improved things with more instructions using a prefix code but was still missing a proper stack, something Woz fixed in his SWEET16.
So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7 and overlay the instruction set adding several more ALU-oriented register opcodes.
More or less, yes, though I haven't decided on specifics though. My hunch is that for ALU heavy work a dual accumulator ISA would be useful. Having 4 index registers is also tempting. Having 4 data registers and 4 index registers reminds me a little of the 68k structure but it does save opcode space. I would also like to change a few of the non-register instructions such as branch to subroutine to take a 16 bit displacement for relocatable code. There is also my hobby horse regarding decrement and branch if not zero, mentioned earlier. To save opcode space I would like to use 2 index registers for auto increment and 2 without. This idea is borrowed from DCPU-16.
I would probably include a register bank switch opcode to toggle between the low R0-R7 registers and the upper R8-R15 registers, because sometimes it might be useful to directly load the program counter with ST 15 (e.g. from a jump table), or to unnest a BS call with POPD R12. Am I doing it wrong? What would the new opcodes be?
That would be pointer handling so a code to enter SWEET-16 might take care of that need.
Re: Compressing during the compilation process
Posted: Sat Jun 07, 2014 1:54 am
by barrym95838
... Also, in my particular case, I care 0% about compatibility with original SWEET-16 instructions or whatever, I'm just looking for bytes saving. I'm still unsure if I'll start going that route at all when I'll be going to miss bytes through, dropping some unimportant parts of the games is a much easier route than engineering and writing a VM and convert some of 6502 code for that VM.
[... bump ...]
I just ran across
THIS today, from a certainly gifted Mr. Schmenk. At first glance, it seems a bit like an alternative to Forth, with similar performance and footprint, but with a syntax that looks a bit more "C-like". Is this something that you or anyone else here might find to be interesting?
Here's a modestly technical description of what PLASMA is about:
https://github.com/dschmenk/PLASMA
Mike
Re: Compressing during the compilation process
Posted: Sat Jun 07, 2014 8:40 am
by BigEd
(Started a
new thread for this subtopic about the PLASMA virtual machine/programming environment)
Re: Compressing during the compilation process
Posted: Sat Jun 14, 2014 7:24 am
by Bregalad
Looks pretty awesome to tell the truth.
However if I want to really use this in the context of my current project, it will be hard anyways since I didn't plan such a thing from the begining. So I don't know how well it'll be doable to mix the tools for programming in PLASMA and those I use currently (WLA-DX assembler suite). At least it sounds like they made it really easy to intermix with native code which is good.
I'll also have to see how small is the runtime, and if there is room to remove unnecessary functions (such as MUL or DIV).
Re: Compressing during the compilation process
Posted: Mon Nov 10, 2014 11:08 am
by Bregalad
So I finally got to program my own VM, this is pretty fun, the only drawback is that it's hard to debug code.
It's inspired by SWEET-16 but I made many changes to suit my own needs. The specifications are still pretty open, as I might remove instructions I don't need and add instructions I need.
The only con is that the interpreter is still a bit too big as opposed to what I'd like (>600 bytes), and it's very slow too (it's between 90 and 190 cycles per instruction). The only way to make the interpreter faster would be to make it bigger, which I obviously don't want to.
The interpreter is based on a register machine. There is 8 registers, called R0-R7, and a single flag, called C.
The registers are 16-bit wide. When switching to/from assembly code, C = C, R0L = A, R1L = X, R2L = Y, and other registers correspond to "Temp" variables.
Unlike SWEET-16, the "PC", the stack pointer and the SREG are not accessible, they are hidden. The stack is the same as the 6502 stack. I could easily make PC visible as R7 for example if this was necessarly, but I doubt this is, so an extra register is probably a nicer thing to have.
The instruction encoding is the same as on SWEET-16, exept that 5 bits are used for opcode and 3 bits for register (instead of 4/4). There is therefore room for 8 global instructions and 31 register instructions (although not all of them have to be used).
Global instructions (1 byte unless specified) :
BRCS - Branch if carry is set
BRCC - Branch if carry is clear
CALL - Call bytecode subroutine (3 bytes)
CASM - Call assembly subroutine (3 bytes)
RETN - Return to caller (agnostic - will automatically return to bytecode or assembly mode)
JUMP - Jump to constant adress (3 bytes)
SETC - Set carry
CLRC - Clear carry
Register instructions (1 byte unless specified) :
CLER - Clear register (Rn := 0)
LDMW - Load memory word (R0 or R1 := (Rn)) (takes 2 instructions slots)
LDMB - Load memory byte (R0 or R1 := (Rn)) (idem)
STMW - Store memory word ((Rn) := R0 or R1) (idem)
STMB - Store memory byte ((Rn := R0 or R1) (idem)
LDIB - Load immediate byte (Rn := imm8) (2 bytes)
LDIW - Load immediate word (Rn := imm16) (3 bytes)
MOVE - Move register (Rn := R0 or R0 := Rn) (takes 2 instruction slots)
INCR - Increment register (Rn+=1)
DECR - Decrement register (Rn-=1)
LSHL - Logical shift left (R0 <<= Rn)
LSHR - Logical shift right (R0 >>= Rn)
LIOR - Logical inclusive OR (R0 |= Rn)
LAND - Logical AND (R0 &= Rn)
LXOR - Logical XOR (R0 ^= Rn)
ADDD - Addition (R0 += Rn)
NEGA - Negate (Rn := -Rn)
PUSH - Push Rn on stack (SP-=2, (SP+1):=Rn)
PULL - Pop Rn from stack (Rn:=(SP+1), SP+=2)
DJNZ - Decrement Rn and jump if non zero (2 bytes)
BRNZ - Branch if Rn is nonzero (2 bytes)
BRZE - Branch if Rn is zero (2 bytes)
BRPL - Branch if Rn is positive (2 bytes)
BRMI - Branch if Rn is negative (2 bytes)
That's it. Thanks to macros it makes it simple to "assemble" those instructions. Mnemonics are 4 letters to be sure not to confuse them with 6502 ones.
Only the ADDD, LSHL, LSHR. CLRC and SETC instructions affects the carry. Please tell me what do you think about this VM. It looks like code reduces to about 72% of it's original size (28% of savings) so it's good.
I'm not sure I'll need all those instructions, especially the load word and store word ones. I could replace this with more awesome adressing modes, for only 8-bit transfers. What do you think ?
It's also strange I toss out the status flags but keep C. In fact I'm so used to use C as an input/output flag for subroutines I just can't get rid of this habbit. Tell me if I'm crazy or if this is OK.
The 16-bits registers are useless most of the time exept when used as a counter or as a pointer, but when it's useful it saves so many bytes it's worth the trouble having them 16-bit all the time.
As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.
Re: Compressing during the compilation process
Posted: Mon Nov 10, 2014 4:02 pm
by BigEd
This is really a post worthy of a new thread. Garth, perhaps you could promote it to be one?
Re: Compressing during the compilation process
Posted: Sat Nov 15, 2014 1:30 am
by barrym95838
... As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.
I'm interested! I love browsing through size-optimized code, especially if it's commented well enough to understand the optimizations.
Mike
Re: Compressing during the compilation process
Posted: Sat Nov 15, 2014 9:39 am
by Bregalad
I uploaded the source
here, please tell me if you have any questions (you should because it's a bit a mess

)
I have special instructions for writing/reading NES' VRAM, and also fetching bytecode from VROM, that is why I have some writes/reads to $2006 and $2007 registers.
Re: Compressing during the compilation process
Posted: Sun Nov 16, 2014 12:17 am
by teamtempest
Playing with the code a bit (actually for too much time altogether

), I think you might have a bug with your coding of LDMW. If the address in the source register is of the form $xxFF, then just incrementing the low byte is going to point to $xx00, not $(xx+1)00. You might have to always do a full 16-bits on both increment and decrement to avoid this.
Re: Compressing during the compilation process
Posted: Sun Nov 16, 2014 12:30 am
by teamtempest
Ah, this is what I've played with so far. Haven't got the time to finish it - only register instructions as yet. It's a bit different than your organization, but not that much. The main difference is in dispatching instructions, which may have the effect of reducing overall size. Also the instruction coding is slightly different, but I imagine macros would cover that up in use.
Code: Select all
InterpreterLoop:
jsr NextByteCode
jsr Dispatch
jmp InterpreterLoop
Dispatch:
tay ; save opcode
and #%11000001 ; isolate register number
beq :+ ; b: non-register instruction
cmp #$80 ; adjust register number
rol
cmp #$80
rol ; register number
asl ; ...times two
+ tax ; =zero if non-register instruction
tya ; recover opcode
and #%00111110 ; isolate instruction (times two)
tay ; offset into dispatch table
lda DoOpcode,y ; push instruction address (minus one) on stack
pha
lda DoOpcode+1,y
pha
rts ; execute instruction
; ----------------------------------------
; destination memory register instructions
; ----------------------------------------
; R0L -> (Rx)
Do_STMB:
lda r0
sta (r0,x)
rts
; Rx -> stack
Do_PUSH:
pla ; discard return address (a constant value)
pla
lda r0+1,x
pha
lda r0,x
pha
jmp InterpreterLoop
; ------------------------------------
; destination Rx register instructions
; ------------------------------------
; stack -> Rx
Do_PULL:
pla ; discard return address (a constant value)
pla
pla
sta r0,x
pla
sta r0+1,x
jmp InterpreterLoop
; Rx EOR $FF -> Rx
Do_NEGA:
lda r0+1,x
eor #$FF
tay
lda r0,x
eor #$FF
jmp W2Rx
; immediate byte -> Rx
Do_LDIB:
jsr NextByteCode
.byte $2C ; skip word
; $0000 -> Rx
Do_CLER:
lda #$00 ; skipped if LDIB instruction
ldy #$00
beq W2Rx
; R0 -> Rx
Do_STR0:
lda r0
ldy r0+1
; A/Y word -> Rx
W2Rx:
sta r0,x
sty r0+1,x
rts
; immediate word -> Rx
Do_LDIW:
jsr NextByteCode
sta r0,x
jsr NextByteCode
sta r0+1,x
rts
; Rx - immediate byte -> Rx
Do_SUBI:
sec
jsr NextByteCode
eor #$FF
ldy #$FF
bmi addd1
; Rx + R0 -> Rx
Do_ADDD:
clc
lda r0
ldy r0+1
addd1:
adc r0,x
sta r0,x
tya
adc r0+1,x
sta r0+1,x
ror c
rts
; Rx + 1 -> Rx
Do_INCR:
inc r0,x
bne +
inc r0+1,x
+ rts
; Rx - 1 -> Rx
Do_DECR:
lda r0,x
bne +
dec r0+1,x
+ dec r0,x
rts
; do shift instruction
; - wonky result if shift value is zero
Do_SHFT:
jsr NextByteCode
tay
bmi shft1
; Rx << immediate byte
- asl r0,x
rol r0+1,x
dey
bne -
beq shft2
; Rx >> immediate byte
shft1:
lsr r0,x
ror r0+1,x
dey
bne shft_rgt
shft2:
ror c
rts
; ------------------------------------
; destination R0 register instructions
; ------------------------------------
; word (Rx) -> R0
Do_LDMW:
jsr Do_INCR
lda (r0,x)
tay
jsr Do_DECR
.byte $2C ; skip word
; byte (Rx) -> R0
Do_LDMB:
ldy #$00 ; skipped if LDMW
lda (r0,x)
jmp W2R0
; Rx -> R0
Do_LDR0:
lda r0,x
ldy r0+1,x
; A/Y word -> R0
W2R0:
sta r0
sty r0+1
rts
; Rx bitwise inclusive OR R0 -> R0
Do_LIOR:
lda r0+1
ora r0+1,x
tay
lda r0
ora r0,x
jmp W2R0
; Rx bitwise AND R0 -> R0
Do_LAND:
lda r0+1
and r0+1,x
tay
lda r0
and r0,x
jmp W2R0
; Rx bitwise exclusive OR R0 -> R0
Do_LXOR:
lda r0+1
eor r0+1,x
tay
lda r0
eor r0,x
jmp W2R0
; -------------------------------
NextByteCode:
inc ByteCodePtr
bne +
inc ByteCodePtr+1
+ bit ByteCodePtr+1
bmi +
lda $2007
rts
+ ldy #$00
lda (ByteCodePtr),y
rts
Re: Compressing during the compilation process
Posted: Sun Nov 16, 2014 11:23 am
by Bregalad
Wow, those are very cool optimisations ! Thank you SOOO much.
Unfortunately my interpreter is still over 600 bytes

(613 to be exact)
The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).
(EDIT : Right now the distance between Do_DJNZ and Do_LXOR is 306, so we are not too far to getting it under 256, to toss the high byte lookup table)
I uploaded my link to show the resulting interpreter.
Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.
Your dispatcher is interesting, however as it saves 8 bytes compared to mine, it needs 8 extra bytes of lookup table, as the non-register instructions are looked up by words. Since the result is neutral I kept mine (but keep in mind that any alternatives are welcome, of course).
Another idea I had was to toss lookup table entierely and write the opcode directly to PCL, so that each instruction has for exemple 8 bytes reserved for it. The problem is that while this would make dispatching simpler and faster, the instructions themselves would be hard to code, and would be a puzzle of jumping in all directions.
At some point I wanted to make the C flag of the interpreter match exactly the C flag of the interpreted code, so I wouldn't need this C location at all. The problem was that the shifts in the dispatching code made this impossible, as well as the comparison of the branching code. If only there was an alternative for those 2 cases, that'd be very cool.
I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so.
I was also wondering if self-modifying code could be useful anywhere in the interpreter, but I didn't think that was the case. Please tell me if I missed something.
As for the instruction themselves, the specs are still very open, I change the instruction as I am writing code for the VM, so it's no big surprise some of them have changed every time.
Re: Compressing during the compilation process
Posted: Sun Nov 16, 2014 11:47 pm
by teamtempest
The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).
You'd still have to be careful to start that window on a page boundary, or else risk crossing a page boundary for the later instructions in the window.
The same page-crossing problem could occur with the global instructions in your split dispatch table. It also might not - it depends on where the global instructions start in a page and their collective length.
Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.
So I did. There's also an error in the dispatch routine, which should push the high byte first (and actually will as it is if the addresses are stored MSB/LSB instead of LSB/MSB).
I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so
Here's one way that doesn't require either the 'tay' nor manually saving for each branch instruction. The idea is not to pay attention to the branch value until it's actually needed. This code also might be a bit shorter:
Code: Select all
; --------------------------
; global branch instructions
; --------------------------
; branch if carry set
Do_BRCS:
lda #$00
.byte $2C ; skip next instruction
; branch if carry clear
Do_BRCC:
lda #$80 ; skipped if BRCS
eor c ; sets high bit if condition true
; branch if negative
BranchMinus:
bpl ConditionFalse
; branch on condition true
ConditionTrue:
ldx #$00
jsr NextByteCode ; get branch value
clc
bpl + ; b: forward branch
eor #$FF ; backward branch
dex ; = $FF
sec
+ adc ByteCodePointer
sta ByteCodePointer
txa
adc ByteCodePointer+1
sta ByteCodePointer+1
rts
; ----------------------------
; register branch instructions
; ----------------------------
; decrement Rx and branch if not zero
Do_DJNZ:
jsr Do_DECR
; branch if Rx not zero
Do_BRNZ:
lda r0,x
ora r0+1,x
; branch if not zero
BranchNotZero:
bne ConditionTrue
; false condition
ConditionFalse:
jmp NextByteCode ; discard branch value
; branch if Rx zero
Do_BRZE:
lda r0,x
ora r0+1,x
; branch if zero
BranchZero:
beq ConditionTrue
bne ConditionFalse
; branch if Rx positive
; - checks only low byte
Do_BRPL:
lda r0,x
bpl ConditionTrue
bmi ConditionFalse
; branch if Rx negative
; - checks only low byte
Do_BRMI:
lda r0,x
jmp BranchMinus
; branch if R0 != Rx
Do_BRNE:
lda r0,x
cmp r0
bne ConditionTrue
lda r0+1,x
cmp r0+1
jmp BranchNotZero
; branch if R0 == Rx
Do_BREQ:
lda r0,x
cmp r0
bne ConditionFalse
lda r0+1,x
cmp r0+1
jmp BranchZero
The routines accessing VRAM have common code that can be factored out. It doesn't really save much code space overall and increases the cycle count, but the common code can be placed outside the 256-byte window you're looking for:
Code: Select all
; R0L -> (Rx)
Do_STVB:
jsr SetVRAMAddr
lda r0
sta $2007
jmp Do_INCR
and
Code: Select all
; byte (Rx) -> R0
Do_LDVB:
jsr SetVRAMAddr
lda $2007
lda $2007
jsr Do_INCR
ldy #$00
beq W2R0
and
Code: Select all
; Rx -> VRAM registers
SetVRAMAddr:
bit $2002
lda r0+1,x
sta $2006
lda r0,x
sta $2006
rts
Re: Compressing during the compilation process
Posted: Mon Nov 17, 2014 1:28 pm
by Bregalad
Very cool, now the interpreter is 578 bytes, and I have some ideas for other further size optimisations.
There is no need for that eor #$ff and sec in the branching code, just dex is enough. And it's very cool to do it like this, I have spend a few hours on this piece of code without ever finding an acceptable solution, when it was in fact so simple.
I am not sure if I will keep this C flag in the interpreted code. It seems to take too much place, I will see in my ByteCode if this is absolutely nessesarly, and if I can somehow get arround it I will completely remove it, and add cooler adressing modes with auto-increment instead, which seems more useful to reach extreme compactness.
PS : 14 more saved bytes and I'll be able to remove the high byte from the lookup table. I'll probably remove instructions I don't need such as XOR so this is very likely to finally become possible !
PPS : I hope you don't mind that I don't use the same assember as you do (they use different syntax)
Re: Compressing during the compilation process
Posted: Mon Nov 17, 2014 2:23 pm
by teamtempest
There is no need for that eor #$ff and sec in the branching code, just dex is enough.
Yup - it did eventually occur to me that sign extension was all that was needed, but not before you caught it!
Re: Compressing during the compilation process
Posted: Fri May 20, 2016 8:49 pm
by Bregalad
So, after a long time of inactivity, I went back to THAT project again. I made some small but significant changes to the bytecode system, I deleted some instructions that weren't used anyway, and I ... doccumented them. TADA !
Documentation of the bytecode interpreter: http://dl.dropboxusercontent.com/u/2346 ... de_doc.txt
Source code of the bytecode interpreter: http://dl.dropboxusercontent.com/u/2346 ... tecode.asm
Now everything is pretty much completed BUT my instruction jump stable still needs to be 2 bytes wide because the critical part (between .block "interpreter_critical" and .endb in the source code) is still 277 bytes long. It needs to reach 255 bytes or less. I doubt this is possible so I'll have to resort to an explicit jump table (using a chain of JMP instructions) for the first couple of instructions I guess
If anyone has suggetions or improvements, I am all ears.
Since I deleted some instructions, there is now room for other new instructions if they are useful. This would increase the size of the interpreter, but in some cases it might be worth it. I am particularly thinking about adding more adressing modes than just register+post-increment, but I don't know which ones. A zero page constant location adressing mode would be nice to have, but I don't know how to interpret it : I'll have to need two dummy bytes in zero page just for that, which is an annoyance.