6502.org

Posted: **Sat May 03, 2014 8:31 pm**

chitselb wrote:

Here's details on the 1802 CPU on which Sweet-16 is based, starting in Chapter 11 on page 888. It powered the Cosmac ELF, which I never got to play with, but apparently Woz did.

1802 is interesting in a slightly weird way. 1805 improved things with more instructions using a prefix code but was still missing a proper stack, something Woz fixed in his SWEET16.

Quote:

So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7 and overlay the instruction set adding several more ALU-oriented register opcodes.

More or less, yes, though I haven't decided on specifics though. My hunch is that for ALU heavy work a dual accumulator ISA would be useful. Having 4 index registers is also tempting. Having 4 data registers and 4 index registers reminds me a little of the 68k structure but it does save opcode space. I would also like to change a few of the non-register instructions such as branch to subroutine to take a 16 bit displacement for relocatable code. There is also my hobby horse regarding decrement and branch if not zero, mentioned earlier. To save opcode space I would like to use 2 index registers for auto increment and 2 without. This idea is borrowed from DCPU-16.

Quote:

I would probably include a register bank switch opcode to toggle between the low R0-R7 registers and the upper R8-R15 registers, because sometimes it might be useful to directly load the program counter with ST 15 (e.g. from a jump table), or to unnest a BS call with POPD R12. Am I doing it wrong? What would the new opcodes be?

That would be pointer handling so a code to enter SWEET-16 might take care of that need.

Posted: **Sat Jun 07, 2014 1:54 am**

Bregalad wrote:

... Also, in my particular case, I care 0% about compatibility with original SWEET-16 instructions or whatever, I'm just looking for bytes saving. I'm still unsure if I'll start going that route at all when I'll be going to miss bytes through, dropping some unimportant parts of the games is a much easier route than engineering and writing a VM and convert some of 6502 code for that VM.

[... bump ...]

I just ran across THIS today, from a certainly gifted Mr. Schmenk. At first glance, it seems a bit like an alternative to Forth, with similar performance and footprint, but with a syntax that looks a bit more "C-like". Is this something that you or anyone else here might find to be interesting?

Here's a modestly technical description of what PLASMA is about:

https://github.com/dschmenk/PLASMA

Mike

Posted: **Sat Jun 07, 2014 8:40 am**

(Started a new thread for this subtopic about the PLASMA virtual machine/programming environment)

Posted: **Sat Jun 14, 2014 7:24 am**

Looks pretty awesome to tell the truth.

However if I want to really use this in the context of my current project, it will be hard anyways since I didn't plan such a thing from the begining. So I don't know how well it'll be doable to mix the tools for programming in PLASMA and those I use currently (WLA-DX assembler suite). At least it sounds like they made it really easy to intermix with native code which is good.

I'll also have to see how small is the runtime, and if there is room to remove unnecessary functions (such as MUL or DIV).

Posted: **Mon Nov 10, 2014 11:08 am**

So I finally got to program my own VM, this is pretty fun, the only drawback is that it's hard to debug code.
It's inspired by SWEET-16 but I made many changes to suit my own needs. The specifications are still pretty open, as I might remove instructions I don't need and add instructions I need.
The only con is that the interpreter is still a bit too big as opposed to what I'd like (>600 bytes), and it's very slow too (it's between 90 and 190 cycles per instruction). The only way to make the interpreter faster would be to make it bigger, which I obviously don't want to.

The interpreter is based on a register machine. There is 8 registers, called R0-R7, and a single flag, called C.

The registers are 16-bit wide. When switching to/from assembly code, C = C, R0L = A, R1L = X, R2L = Y, and other registers correspond to "Temp" variables.

Unlike SWEET-16, the "PC", the stack pointer and the SREG are not accessible, they are hidden. The stack is the same as the 6502 stack. I could easily make PC visible as R7 for example if this was necessarly, but I doubt this is, so an extra register is probably a nicer thing to have.

The instruction encoding is the same as on SWEET-16, exept that 5 bits are used for opcode and 3 bits for register (instead of 4/4). There is therefore room for 8 global instructions and 31 register instructions (although not all of them have to be used).

Global instructions (1 byte unless specified) :
BRCS - Branch if carry is set
BRCC - Branch if carry is clear
CALL - Call bytecode subroutine (3 bytes)
CASM - Call assembly subroutine (3 bytes)
RETN - Return to caller (agnostic - will automatically return to bytecode or assembly mode)
JUMP - Jump to constant adress (3 bytes)
SETC - Set carry
CLRC - Clear carry

Register instructions (1 byte unless specified) :
CLER - Clear register (Rn := 0)
LDMW - Load memory word (R0 or R1 := (Rn)) (takes 2 instructions slots)
LDMB - Load memory byte (R0 or R1 := (Rn)) (idem)
STMW - Store memory word ((Rn) := R0 or R1) (idem)
STMB - Store memory byte ((Rn := R0 or R1) (idem)
LDIB - Load immediate byte (Rn := imm8) (2 bytes)
LDIW - Load immediate word (Rn := imm16) (3 bytes)
MOVE - Move register (Rn := R0 or R0 := Rn) (takes 2 instruction slots)
INCR - Increment register (Rn+=1)
DECR - Decrement register (Rn-=1)
LSHL - Logical shift left (R0 <<= Rn)
LSHR - Logical shift right (R0 >>= Rn)
LIOR - Logical inclusive OR (R0 |= Rn)
LAND - Logical AND (R0 &= Rn)
LXOR - Logical XOR (R0 ^= Rn)
ADDD - Addition (R0 += Rn)
NEGA - Negate (Rn := -Rn)
PUSH - Push Rn on stack (SP-=2, (SP+1):=Rn)
PULL - Pop Rn from stack (Rn:=(SP+1), SP+=2)
DJNZ - Decrement Rn and jump if non zero (2 bytes)
BRNZ - Branch if Rn is nonzero (2 bytes)
BRZE - Branch if Rn is zero (2 bytes)
BRPL - Branch if Rn is positive (2 bytes)
BRMI - Branch if Rn is negative (2 bytes)

That's it. Thanks to macros it makes it simple to "assemble" those instructions. Mnemonics are 4 letters to be sure not to confuse them with 6502 ones.

Only the ADDD, LSHL, LSHR. CLRC and SETC instructions affects the carry. Please tell me what do you think about this VM. It looks like code reduces to about 72% of it's original size (28% of savings) so it's good.

I'm not sure I'll need all those instructions, especially the load word and store word ones. I could replace this with more awesome adressing modes, for only 8-bit transfers. What do you think ?

It's also strange I toss out the status flags but keep C. In fact I'm so used to use C as an input/output flag for subroutines I just can't get rid of this habbit. Tell me if I'm crazy or if this is OK.

The 16-bits registers are useless most of the time exept when used as a counter or as a pointer, but when it's useful it saves so many bytes it's worth the trouble having them 16-bit all the time.

As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.

Posted: **Mon Nov 10, 2014 4:02 pm**

This is really a post worthy of a new thread. Garth, perhaps you could promote it to be one?

Posted: **Sat Nov 15, 2014 1:30 am**

Bregalad wrote:

... As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.

I'm interested! I love browsing through size-optimized code, especially if it's commented well enough to understand the optimizations.

Mike

Posted: **Sat Nov 15, 2014 9:39 am**

I uploaded the source here, please tell me if you have any questions (you should because it's a bit a mess

)

I have special instructions for writing/reading NES' VRAM, and also fetching bytecode from VROM, that is why I have some writes/reads to $2006 and $2007 registers.

Posted: **Sun Nov 16, 2014 12:17 am**

Playing with the code a bit (actually for too much time altogether

), I think you might have a bug with your coding of LDMW. If the address in the source register is of the form $xxFF, then just incrementing the low byte is going to point to $xx00, not $(xx+1)00. You might have to always do a full 16-bits on both increment and decrement to avoid this.

Posted: **Sun Nov 16, 2014 12:30 am**

Ah, this is what I've played with so far. Haven't got the time to finish it - only register instructions as yet. It's a bit different than your organization, but not that much. The main difference is in dispatching instructions, which may have the effect of reducing overall size. Also the instruction coding is slightly different, but I imagine macros would cover that up in use.

Code: Select all

InterpreterLoop:
	jsr NextByteCode
	jsr Dispatch
	jmp InterpreterLoop

Dispatch:
	tay		; save opcode
	and #%11000001	; isolate register number
	beq :+		; b: non-register instruction
	cmp #$80	; adjust register number
	rol
	cmp #$80
	rol		; register number
	asl		; ...times two
 +	tax		; =zero if non-register instruction
	tya		; recover opcode
	and #%00111110	; isolate instruction (times two)
	tay		; offset into dispatch table
 	lda DoOpcode,y	; push instruction address (minus one) on stack
	pha
	lda DoOpcode+1,y
	pha
	rts		; execute instruction

; ----------------------------------------
; destination memory register instructions
; ----------------------------------------

; R0L -> (Rx)

Do_STMB:
	lda r0
	sta (r0,x)
	rts

; Rx -> stack

Do_PUSH:
	pla		; discard return address (a constant value)
	pla
	lda r0+1,x
	pha
	lda r0,x
	pha
	jmp InterpreterLoop

; ------------------------------------
; destination Rx register instructions
; ------------------------------------

; stack -> Rx

Do_PULL:
	pla		; discard return address (a constant value)
	pla
	pla
	sta r0,x
	pla
	sta r0+1,x
	jmp InterpreterLoop
	
; Rx EOR $FF -> Rx

Do_NEGA:
	lda r0+1,x
	eor #$FF
	tay
	lda r0,x
	eor #$FF
	jmp W2Rx

; immediate byte -> Rx

Do_LDIB:
	jsr NextByteCode
	.byte	$2C	; skip word

; $0000 -> Rx

Do_CLER:
	lda #$00	; skipped if LDIB instruction
	ldy #$00
	beq W2Rx

; R0 -> Rx

Do_STR0:
	lda r0
	ldy r0+1

; A/Y word -> Rx

W2Rx:
	sta r0,x
	sty r0+1,x
	rts

; immediate word -> Rx

Do_LDIW:
	jsr NextByteCode
	sta r0,x
	jsr NextByteCode
	sta r0+1,x
	rts

; Rx - immediate byte -> Rx
	
Do_SUBI:
	sec
	jsr NextByteCode
	eor #$FF
	ldy #$FF
	bmi addd1
	
; Rx + R0 -> Rx

Do_ADDD:
	clc
	lda r0
	ldy r0+1

addd1:
	adc r0,x
	sta r0,x
	tya
	adc r0+1,x
	sta r0+1,x
	ror c
	rts

; Rx + 1 -> Rx

Do_INCR:
	inc r0,x
	bne +
	inc r0+1,x
 +	rts

; Rx - 1 -> Rx

Do_DECR:
	lda r0,x
	bne +
	dec r0+1,x
+	dec r0,x
	rts

; do shift instruction
; - wonky result if shift value is zero

Do_SHFT:
	jsr NextByteCode
	tay
	bmi shft1

; Rx << immediate byte

 -	asl r0,x
	rol r0+1,x
	dey
	bne -
	beq shft2

; Rx >> immediate byte

shft1:
	lsr r0,x
	ror r0+1,x
	dey
	bne shft_rgt

shft2:
	ror c
	rts
	
; ------------------------------------
; destination R0 register instructions
; ------------------------------------

; word (Rx) -> R0

Do_LDMW:
	jsr Do_INCR
	lda (r0,x)
	tay
	jsr Do_DECR
	.byte $2C	; skip word

; byte (Rx) -> R0

Do_LDMB:
	ldy #$00	; skipped if LDMW
	lda (r0,x)
	jmp W2R0

; Rx -> R0

Do_LDR0:
	lda r0,x
	ldy r0+1,x

; A/Y word -> R0

W2R0:
	sta r0
	sty r0+1
	rts

; Rx bitwise inclusive OR R0 -> R0

Do_LIOR:
	lda r0+1
	ora r0+1,x
	tay
	lda r0
	ora r0,x
	jmp W2R0

; Rx bitwise AND R0 -> R0

Do_LAND:
	lda r0+1
	and r0+1,x
	tay
	lda r0
	and r0,x
	jmp W2R0

; Rx bitwise exclusive OR R0 -> R0

Do_LXOR:
	lda r0+1
	eor r0+1,x
	tay
	lda r0
	eor r0,x
	jmp W2R0

; -------------------------------

NextByteCode:
	inc	ByteCodePtr
	bne	+
	inc	ByteCodePtr+1
 +	bit	ByteCodePtr+1
	bmi	+
	lda	$2007
	rts

 +	ldy	#$00
	lda	(ByteCodePtr),y
	rts

Posted: **Sun Nov 16, 2014 11:23 am**

Wow, those are very cool optimisations ! Thank you SOOO much.

Unfortunately my interpreter is still over 600 bytes

(613 to be exact)

The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).
(EDIT : Right now the distance between Do_DJNZ and Do_LXOR is 306, so we are not too far to getting it under 256, to toss the high byte lookup table)

I uploaded my link to show the resulting interpreter.

Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.

Your dispatcher is interesting, however as it saves 8 bytes compared to mine, it needs 8 extra bytes of lookup table, as the non-register instructions are looked up by words. Since the result is neutral I kept mine (but keep in mind that any alternatives are welcome, of course).

Another idea I had was to toss lookup table entierely and write the opcode directly to PCL, so that each instruction has for exemple 8 bytes reserved for it. The problem is that while this would make dispatching simpler and faster, the instructions themselves would be hard to code, and would be a puzzle of jumping in all directions.

At some point I wanted to make the C flag of the interpreter match exactly the C flag of the interpreted code, so I wouldn't need this C location at all. The problem was that the shifts in the dispatching code made this impossible, as well as the comparison of the branching code. If only there was an alternative for those 2 cases, that'd be very cool.

I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so.

I was also wondering if self-modifying code could be useful anywhere in the interpreter, but I didn't think that was the case. Please tell me if I missed something.

As for the instruction themselves, the specs are still very open, I change the instruction as I am writing code for the VM, so it's no big surprise some of them have changed every time.

Posted: **Sun Nov 16, 2014 11:47 pm**

Quote:

The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).

You'd still have to be careful to start that window on a page boundary, or else risk crossing a page boundary for the later instructions in the window.

The same page-crossing problem could occur with the global instructions in your split dispatch table. It also might not - it depends on where the global instructions start in a page and their collective length.

Quote:

Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.

So I did. There's also an error in the dispatch routine, which should push the high byte first (and actually will as it is if the addresses are stored MSB/LSB instead of LSB/MSB).

Quote:

I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so

Here's one way that doesn't require either the 'tay' nor manually saving for each branch instruction. The idea is not to pay attention to the branch value until it's actually needed. This code also might be a bit shorter:

Code: Select all

; --------------------------
; global branch instructions
; --------------------------

; branch if carry set

Do_BRCS:
	lda #$00
	.byte $2C		; skip next instruction

; branch if carry clear

Do_BRCC:
	lda #$80		; skipped if BRCS
	eor c			; sets high bit if condition true

; branch if negative

BranchMinus:
	bpl ConditionFalse

; branch on condition true

ConditionTrue:
	ldx #$00
	jsr NextByteCode	; get branch value
        clc
	bpl +			; b: forward branch
	eor #$FF		; backward branch
	dex			; = $FF
	sec
 +	adc ByteCodePointer
	sta ByteCodePointer
	txa
	adc ByteCodePointer+1
	sta ByteCodePointer+1
	rts

; ----------------------------
; register branch instructions
; ----------------------------

; decrement Rx and branch if not zero

Do_DJNZ:
	jsr Do_DECR

; branch if Rx not zero

Do_BRNZ:
	lda r0,x
	ora r0+1,x

; branch if not zero

BranchNotZero:
	bne ConditionTrue

; false condition

ConditionFalse:
	jmp NextByteCode	; discard branch value

; branch if Rx zero

Do_BRZE:
	lda r0,x
	ora r0+1,x

; branch if zero

BranchZero:
	beq ConditionTrue
	bne ConditionFalse

; branch if Rx positive
; - checks only low byte

Do_BRPL:
	lda r0,x
	bpl ConditionTrue
	bmi ConditionFalse

; branch if Rx negative
; - checks only low byte

Do_BRMI:
	lda r0,x
	jmp BranchMinus

; branch if R0 != Rx

Do_BRNE:
	lda r0,x
	cmp r0
	bne ConditionTrue
	lda r0+1,x
	cmp r0+1
	jmp BranchNotZero
	
; branch if R0 == Rx

Do_BREQ:
	lda r0,x
	cmp r0
	bne ConditionFalse
	lda r0+1,x
	cmp r0+1
	jmp BranchZero

The routines accessing VRAM have common code that can be factored out. It doesn't really save much code space overall and increases the cycle count, but the common code can be placed outside the 256-byte window you're looking for:

Code: Select all


; R0L -> (Rx)

Do_STVB:
	jsr SetVRAMAddr
	lda r0
	sta $2007
	jmp Do_INCR

and

Code: Select all

; byte (Rx) -> R0

Do_LDVB:
	jsr SetVRAMAddr
	lda $2007
	lda $2007
	jsr Do_INCR
	ldy #$00
	beq W2R0

and

Code: Select all

; Rx -> VRAM registers

SetVRAMAddr:
	bit $2002
	lda r0+1,x
	sta $2006
	lda r0,x
	sta $2006
	rts

Posted: **Mon Nov 17, 2014 1:28 pm**

Very cool, now the interpreter is 578 bytes, and I have some ideas for other further size optimisations.

There is no need for that eor #$ff and sec in the branching code, just dex is enough. And it's very cool to do it like this, I have spend a few hours on this piece of code without ever finding an acceptable solution, when it was in fact so simple.

I am not sure if I will keep this C flag in the interpreted code. It seems to take too much place, I will see in my ByteCode if this is absolutely nessesarly, and if I can somehow get arround it I will completely remove it, and add cooler adressing modes with auto-increment instead, which seems more useful to reach extreme compactness.

PS : 14 more saved bytes and I'll be able to remove the high byte from the lookup table. I'll probably remove instructions I don't need such as XOR so this is very likely to finally become possible !

PPS : I hope you don't mind that I don't use the same assember as you do (they use different syntax)

Posted: **Mon Nov 17, 2014 2:23 pm**

Quote:

There is no need for that eor #$ff and sec in the branching code, just dex is enough.

Yup - it did eventually occur to me that sign extension was all that was needed, but not before you caught it!

Posted: **Fri May 20, 2016 8:49 pm**

So, after a long time of inactivity, I went back to THAT project again. I made some small but significant changes to the bytecode system, I deleted some instructions that weren't used anyway, and I ... doccumented them. TADA !

Documentation of the bytecode interpreter: http://dl.dropboxusercontent.com/u/2346 ... de_doc.txt
Source code of the bytecode interpreter: http://dl.dropboxusercontent.com/u/2346 ... tecode.asm

Now everything is pretty much completed BUT my instruction jump stable still needs to be 2 bytes wide because the critical part (between .block "interpreter_critical" and .endb in the source code) is still 277 bytes long. It needs to reach 255 bytes or less. I doubt this is possible so I'll have to resort to an explicit jump table (using a chain of JMP instructions) for the first couple of instructions I guess

If anyone has suggetions or improvements, I am all ears.

Since I deleted some instructions, there is now room for other new instructions if they are useful. This would increase the size of the interpreter, but in some cases it might be worth it. I am particularly thinking about adding more adressing modes than just register+post-increment, but I don't know which ones. A zero page constant location adressing mode would be nice to have, but I don't know how to interpret it : I'll have to need two dummy bytes in zero page just for that, which is an annoyance.

6502.org

Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process

Re: Compressing during the compilation process