6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Mar 29, 2024 9:25 am

All times are UTC




Post new topic Reply to topic  [ 79 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
PostPosted: Sat May 03, 2014 8:31 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
chitselb wrote:
Here's details on the 1802 CPU on which Sweet-16 is based, starting in Chapter 11 on page 888. It powered the Cosmac ELF, which I never got to play with, but apparently Woz did.
1802 is interesting in a slightly weird way. 1805 improved things with more instructions using a prefix code but was still missing a proper stack, something Woz fixed in his SWEET16.
Quote:

So what you seem to be proposing is to retain the original Sweet-16 instruction set, which consists of 16 non-register opcodes and 15 register opcodes, but limit the accessible registers to R0-R7 and overlay the instruction set adding several more ALU-oriented register opcodes.
More or less, yes, though I haven't decided on specifics though. My hunch is that for ALU heavy work a dual accumulator ISA would be useful. Having 4 index registers is also tempting. Having 4 data registers and 4 index registers reminds me a little of the 68k structure but it does save opcode space. I would also like to change a few of the non-register instructions such as branch to subroutine to take a 16 bit displacement for relocatable code. There is also my hobby horse regarding decrement and branch if not zero, mentioned earlier. To save opcode space I would like to use 2 index registers for auto increment and 2 without. This idea is borrowed from DCPU-16.
Quote:

I would probably include a register bank switch opcode to toggle between the low R0-R7 registers and the upper R8-R15 registers, because sometimes it might be useful to directly load the program counter with ST 15 (e.g. from a jump table), or to unnest a BS call with POPD R12. Am I doing it wrong? What would the new opcodes be?
That would be pointer handling so a code to enter SWEET-16 might take care of that need.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 07, 2014 1:54 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1918
Location: Sacramento, CA, USA
Bregalad wrote:
... Also, in my particular case, I care 0% about compatibility with original SWEET-16 instructions or whatever, I'm just looking for bytes saving. I'm still unsure if I'll start going that route at all when I'll be going to miss bytes through, dropping some unimportant parts of the games is a much easier route than engineering and writing a VM and convert some of 6502 code for that VM.

[... bump ...]

I just ran across THIS today, from a certainly gifted Mr. Schmenk. At first glance, it seems a bit like an alternative to Forth, with similar performance and footprint, but with a syntax that looks a bit more "C-like". Is this something that you or anyone else here might find to be interesting?

Here's a modestly technical description of what PLASMA is about:

https://github.com/dschmenk/PLASMA

Mike


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 07, 2014 8:40 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
(Started a new thread for this subtopic about the PLASMA virtual machine/programming environment)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 14, 2014 7:24 am 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Looks pretty awesome to tell the truth.

However if I want to really use this in the context of my current project, it will be hard anyways since I didn't plan such a thing from the begining. So I don't know how well it'll be doable to mix the tools for programming in PLASMA and those I use currently (WLA-DX assembler suite). At least it sounds like they made it really easy to intermix with native code which is good.

I'll also have to see how small is the runtime, and if there is room to remove unnecessary functions (such as MUL or DIV).


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 10, 2014 11:08 am 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
So I finally got to program my own VM, this is pretty fun, the only drawback is that it's hard to debug code.
It's inspired by SWEET-16 but I made many changes to suit my own needs. The specifications are still pretty open, as I might remove instructions I don't need and add instructions I need.
The only con is that the interpreter is still a bit too big as opposed to what I'd like (>600 bytes), and it's very slow too (it's between 90 and 190 cycles per instruction). The only way to make the interpreter faster would be to make it bigger, which I obviously don't want to.

The interpreter is based on a register machine. There is 8 registers, called R0-R7, and a single flag, called C.

The registers are 16-bit wide. When switching to/from assembly code, C = C, R0L = A, R1L = X, R2L = Y, and other registers correspond to "Temp" variables.

Unlike SWEET-16, the "PC", the stack pointer and the SREG are not accessible, they are hidden. The stack is the same as the 6502 stack. I could easily make PC visible as R7 for example if this was necessarly, but I doubt this is, so an extra register is probably a nicer thing to have.

The instruction encoding is the same as on SWEET-16, exept that 5 bits are used for opcode and 3 bits for register (instead of 4/4). There is therefore room for 8 global instructions and 31 register instructions (although not all of them have to be used).

Global instructions (1 byte unless specified) :
BRCS - Branch if carry is set
BRCC - Branch if carry is clear
CALL - Call bytecode subroutine (3 bytes)
CASM - Call assembly subroutine (3 bytes)
RETN - Return to caller (agnostic - will automatically return to bytecode or assembly mode)
JUMP - Jump to constant adress (3 bytes)
SETC - Set carry
CLRC - Clear carry

Register instructions (1 byte unless specified) :
CLER - Clear register (Rn := 0)
LDMW - Load memory word (R0 or R1 := (Rn)) (takes 2 instructions slots)
LDMB - Load memory byte (R0 or R1 := (Rn)) (idem)
STMW - Store memory word ((Rn) := R0 or R1) (idem)
STMB - Store memory byte ((Rn := R0 or R1) (idem)
LDIB - Load immediate byte (Rn := imm8) (2 bytes)
LDIW - Load immediate word (Rn := imm16) (3 bytes)
MOVE - Move register (Rn := R0 or R0 := Rn) (takes 2 instruction slots)
INCR - Increment register (Rn+=1)
DECR - Decrement register (Rn-=1)
LSHL - Logical shift left (R0 <<= Rn)
LSHR - Logical shift right (R0 >>= Rn)
LIOR - Logical inclusive OR (R0 |= Rn)
LAND - Logical AND (R0 &= Rn)
LXOR - Logical XOR (R0 ^= Rn)
ADDD - Addition (R0 += Rn)
NEGA - Negate (Rn := -Rn)
PUSH - Push Rn on stack (SP-=2, (SP+1):=Rn)
PULL - Pop Rn from stack (Rn:=(SP+1), SP+=2)
DJNZ - Decrement Rn and jump if non zero (2 bytes)
BRNZ - Branch if Rn is nonzero (2 bytes)
BRZE - Branch if Rn is zero (2 bytes)
BRPL - Branch if Rn is positive (2 bytes)
BRMI - Branch if Rn is negative (2 bytes)

That's it. Thanks to macros it makes it simple to "assemble" those instructions. Mnemonics are 4 letters to be sure not to confuse them with 6502 ones.

Only the ADDD, LSHL, LSHR. CLRC and SETC instructions affects the carry. Please tell me what do you think about this VM. It looks like code reduces to about 72% of it's original size (28% of savings) so it's good.

I'm not sure I'll need all those instructions, especially the load word and store word ones. I could replace this with more awesome adressing modes, for only 8-bit transfers. What do you think ?

It's also strange I toss out the status flags but keep C. In fact I'm so used to use C as an input/output flag for subroutines I just can't get rid of this habbit. Tell me if I'm crazy or if this is OK.

The 16-bits registers are useless most of the time exept when used as a counter or as a pointer, but when it's useful it saves so many bytes it's worth the trouble having them 16-bit all the time.

As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 10, 2014 4:02 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
This is really a post worthy of a new thread. Garth, perhaps you could promote it to be one?


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 15, 2014 1:30 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1918
Location: Sacramento, CA, USA
Bregalad wrote:
... As for the source code it's till a mess because the VM is optimized for size (which means functions appear in an illogical order, in order to replace JMPs with branches with conditions which are always met). Tell me if anyone is interested.

I'm interested! I love browsing through size-optimized code, especially if it's commented well enough to understand the optimizations.

Mike


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 15, 2014 9:39 am 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
I uploaded the source here, please tell me if you have any questions (you should because it's a bit a mess ;) )

I have special instructions for writing/reading NES' VRAM, and also fetching bytecode from VROM, that is why I have some writes/reads to $2006 and $2007 registers.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 16, 2014 12:17 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 384
Location: Minnesota
Playing with the code a bit (actually for too much time altogether :? ), I think you might have a bug with your coding of LDMW. If the address in the source register is of the form $xxFF, then just incrementing the low byte is going to point to $xx00, not $(xx+1)00. You might have to always do a full 16-bits on both increment and decrement to avoid this.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 16, 2014 12:30 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 384
Location: Minnesota
Ah, this is what I've played with so far. Haven't got the time to finish it - only register instructions as yet. It's a bit different than your organization, but not that much. The main difference is in dispatching instructions, which may have the effect of reducing overall size. Also the instruction coding is slightly different, but I imagine macros would cover that up in use.

Code:
InterpreterLoop:
   jsr NextByteCode
   jsr Dispatch
   jmp InterpreterLoop

Dispatch:
   tay      ; save opcode
   and #%11000001   ; isolate register number
   beq :+      ; b: non-register instruction
   cmp #$80   ; adjust register number
   rol
   cmp #$80
   rol      ; register number
   asl      ; ...times two
 +   tax      ; =zero if non-register instruction
   tya      ; recover opcode
   and #%00111110   ; isolate instruction (times two)
   tay      ; offset into dispatch table
    lda DoOpcode,y   ; push instruction address (minus one) on stack
   pha
   lda DoOpcode+1,y
   pha
   rts      ; execute instruction

; ----------------------------------------
; destination memory register instructions
; ----------------------------------------

; R0L -> (Rx)

Do_STMB:
   lda r0
   sta (r0,x)
   rts

; Rx -> stack

Do_PUSH:
   pla      ; discard return address (a constant value)
   pla
   lda r0+1,x
   pha
   lda r0,x
   pha
   jmp InterpreterLoop

; ------------------------------------
; destination Rx register instructions
; ------------------------------------

; stack -> Rx

Do_PULL:
   pla      ; discard return address (a constant value)
   pla
   pla
   sta r0,x
   pla
   sta r0+1,x
   jmp InterpreterLoop
   
; Rx EOR $FF -> Rx

Do_NEGA:
   lda r0+1,x
   eor #$FF
   tay
   lda r0,x
   eor #$FF
   jmp W2Rx

; immediate byte -> Rx

Do_LDIB:
   jsr NextByteCode
   .byte   $2C   ; skip word

; $0000 -> Rx

Do_CLER:
   lda #$00   ; skipped if LDIB instruction
   ldy #$00
   beq W2Rx

; R0 -> Rx

Do_STR0:
   lda r0
   ldy r0+1

; A/Y word -> Rx

W2Rx:
   sta r0,x
   sty r0+1,x
   rts

; immediate word -> Rx

Do_LDIW:
   jsr NextByteCode
   sta r0,x
   jsr NextByteCode
   sta r0+1,x
   rts

; Rx - immediate byte -> Rx
   
Do_SUBI:
   sec
   jsr NextByteCode
   eor #$FF
   ldy #$FF
   bmi addd1
   
; Rx + R0 -> Rx

Do_ADDD:
   clc
   lda r0
   ldy r0+1

addd1:
   adc r0,x
   sta r0,x
   tya
   adc r0+1,x
   sta r0+1,x
   ror c
   rts

; Rx + 1 -> Rx

Do_INCR:
   inc r0,x
   bne +
   inc r0+1,x
 +   rts

; Rx - 1 -> Rx

Do_DECR:
   lda r0,x
   bne +
   dec r0+1,x
+   dec r0,x
   rts

; do shift instruction
; - wonky result if shift value is zero

Do_SHFT:
   jsr NextByteCode
   tay
   bmi shft1

; Rx << immediate byte

 -   asl r0,x
   rol r0+1,x
   dey
   bne -
   beq shft2

; Rx >> immediate byte

shft1:
   lsr r0,x
   ror r0+1,x
   dey
   bne shft_rgt

shft2:
   ror c
   rts
   
; ------------------------------------
; destination R0 register instructions
; ------------------------------------

; word (Rx) -> R0

Do_LDMW:
   jsr Do_INCR
   lda (r0,x)
   tay
   jsr Do_DECR
   .byte $2C   ; skip word

; byte (Rx) -> R0

Do_LDMB:
   ldy #$00   ; skipped if LDMW
   lda (r0,x)
   jmp W2R0

; Rx -> R0

Do_LDR0:
   lda r0,x
   ldy r0+1,x

; A/Y word -> R0

W2R0:
   sta r0
   sty r0+1
   rts

; Rx bitwise inclusive OR R0 -> R0

Do_LIOR:
   lda r0+1
   ora r0+1,x
   tay
   lda r0
   ora r0,x
   jmp W2R0

; Rx bitwise AND R0 -> R0

Do_LAND:
   lda r0+1
   and r0+1,x
   tay
   lda r0
   and r0,x
   jmp W2R0

; Rx bitwise exclusive OR R0 -> R0

Do_LXOR:
   lda r0+1
   eor r0+1,x
   tay
   lda r0
   eor r0,x
   jmp W2R0

; -------------------------------

NextByteCode:
   inc   ByteCodePtr
   bne   +
   inc   ByteCodePtr+1
 +   bit   ByteCodePtr+1
   bmi   +
   lda   $2007
   rts

 +   ldy   #$00
   lda   (ByteCodePtr),y
   rts
   


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 16, 2014 11:23 am 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Wow, those are very cool optimisations ! Thank you SOOO much. :D
Unfortunately my interpreter is still over 600 bytes :( (613 to be exact)

The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).
(EDIT : Right now the distance between Do_DJNZ and Do_LXOR is 306, so we are not too far to getting it under 256, to toss the high byte lookup table)

I uploaded my link to show the resulting interpreter.

Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.

Your dispatcher is interesting, however as it saves 8 bytes compared to mine, it needs 8 extra bytes of lookup table, as the non-register instructions are looked up by words. Since the result is neutral I kept mine (but keep in mind that any alternatives are welcome, of course).

Another idea I had was to toss lookup table entierely and write the opcode directly to PCL, so that each instruction has for exemple 8 bytes reserved for it. The problem is that while this would make dispatching simpler and faster, the instructions themselves would be hard to code, and would be a puzzle of jumping in all directions.

At some point I wanted to make the C flag of the interpreter match exactly the C flag of the interpreted code, so I wouldn't need this C location at all. The problem was that the shifts in the dispatching code made this impossible, as well as the comparison of the branching code. If only there was an alternative for those 2 cases, that'd be very cool.

I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so.

I was also wondering if self-modifying code could be useful anywhere in the interpreter, but I didn't think that was the case. Please tell me if I missed something.

As for the instruction themselves, the specs are still very open, I change the instruction as I am writing code for the VM, so it's no big surprise some of them have changed every time.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 16, 2014 11:47 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 384
Location: Minnesota
Quote:
The main problem now is that it'd be great to make the register instructions fit in a 256 byte window, so that I can toss the lookup table for the high byte (which eats 31 bytes).


You'd still have to be careful to start that window on a page boundary, or else risk crossing a page boundary for the later instructions in the window.

The same page-crossing problem could occur with the global instructions in your split dispatch table. It also might not - it depends on where the global instructions start in a page and their collective length.

Quote:
Also you introduced an error by replacing the iny by the dey in the shift right loop, the iny is correct, since Y starts negative and grows up to zero.


So I did. There's also an error in the dispatch routine, which should push the high byte first (and actually will as it is if the addresses are stored MSB/LSB instead of LSB/MSB).

Quote:
I say that you removed the 'tay' at the end of the NewByteCode routine. The problem is that without this I'd have to do it manually each time for each branch instruction. I don't think there's another way arround. If there is one, please tell me so


Here's one way that doesn't require either the 'tay' nor manually saving for each branch instruction. The idea is not to pay attention to the branch value until it's actually needed. This code also might be a bit shorter:

Code:
; --------------------------
; global branch instructions
; --------------------------

; branch if carry set

Do_BRCS:
   lda #$00
   .byte $2C      ; skip next instruction

; branch if carry clear

Do_BRCC:
   lda #$80      ; skipped if BRCS
   eor c         ; sets high bit if condition true

; branch if negative

BranchMinus:
   bpl ConditionFalse

; branch on condition true

ConditionTrue:
   ldx #$00
   jsr NextByteCode   ; get branch value
        clc
   bpl +         ; b: forward branch
   eor #$FF      ; backward branch
   dex         ; = $FF
   sec
 +   adc ByteCodePointer
   sta ByteCodePointer
   txa
   adc ByteCodePointer+1
   sta ByteCodePointer+1
   rts

; ----------------------------
; register branch instructions
; ----------------------------

; decrement Rx and branch if not zero

Do_DJNZ:
   jsr Do_DECR

; branch if Rx not zero

Do_BRNZ:
   lda r0,x
   ora r0+1,x

; branch if not zero

BranchNotZero:
   bne ConditionTrue

; false condition

ConditionFalse:
   jmp NextByteCode   ; discard branch value

; branch if Rx zero

Do_BRZE:
   lda r0,x
   ora r0+1,x

; branch if zero

BranchZero:
   beq ConditionTrue
   bne ConditionFalse

; branch if Rx positive
; - checks only low byte

Do_BRPL:
   lda r0,x
   bpl ConditionTrue
   bmi ConditionFalse

; branch if Rx negative
; - checks only low byte

Do_BRMI:
   lda r0,x
   jmp BranchMinus

; branch if R0 != Rx

Do_BRNE:
   lda r0,x
   cmp r0
   bne ConditionTrue
   lda r0+1,x
   cmp r0+1
   jmp BranchNotZero
   
; branch if R0 == Rx

Do_BREQ:
   lda r0,x
   cmp r0
   bne ConditionFalse
   lda r0+1,x
   cmp r0+1
   jmp BranchZero


The routines accessing VRAM have common code that can be factored out. It doesn't really save much code space overall and increases the cycle count, but the common code can be placed outside the 256-byte window you're looking for:

Code:

; R0L -> (Rx)

Do_STVB:
   jsr SetVRAMAddr
   lda r0
   sta $2007
   jmp Do_INCR


and

Code:
; byte (Rx) -> R0

Do_LDVB:
   jsr SetVRAMAddr
   lda $2007
   lda $2007
   jsr Do_INCR
   ldy #$00
   beq W2R0


and

Code:
; Rx -> VRAM registers

SetVRAMAddr:
   bit $2002
   lda r0+1,x
   sta $2006
   lda r0,x
   sta $2006
   rts


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 17, 2014 1:28 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Very cool, now the interpreter is 578 bytes, and I have some ideas for other further size optimisations.

There is no need for that eor #$ff and sec in the branching code, just dex is enough. And it's very cool to do it like this, I have spend a few hours on this piece of code without ever finding an acceptable solution, when it was in fact so simple.

I am not sure if I will keep this C flag in the interpreted code. It seems to take too much place, I will see in my ByteCode if this is absolutely nessesarly, and if I can somehow get arround it I will completely remove it, and add cooler adressing modes with auto-increment instead, which seems more useful to reach extreme compactness.

PS : 14 more saved bytes and I'll be able to remove the high byte from the lookup table. I'll probably remove instructions I don't need such as XOR so this is very likely to finally become possible !

PPS : I hope you don't mind that I don't use the same assember as you do (they use different syntax)


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 17, 2014 2:23 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 384
Location: Minnesota
Quote:
There is no need for that eor #$ff and sec in the branching code, just dex is enough.


Yup - it did eventually occur to me that sign extension was all that was needed, but not before you caught it!


Top
 Profile  
Reply with quote  
PostPosted: Fri May 20, 2016 8:49 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
So, after a long time of inactivity, I went back to THAT project again. I made some small but significant changes to the bytecode system, I deleted some instructions that weren't used anyway, and I ... doccumented them. TADA !

Documentation of the bytecode interpreter: http://dl.dropboxusercontent.com/u/23465629/NES_junk/bytecode_doc.txt
Source code of the bytecode interpreter: http://dl.dropboxusercontent.com/u/23465629/NES_junk/bytecode.asm

Now everything is pretty much completed BUT my instruction jump stable still needs to be 2 bytes wide because the critical part (between .block "interpreter_critical" and .endb in the source code) is still 277 bytes long. It needs to reach 255 bytes or less. I doubt this is possible so I'll have to resort to an explicit jump table (using a chain of JMP instructions) for the first couple of instructions I guess :(

If anyone has suggetions or improvements, I am all ears.

Since I deleted some instructions, there is now room for other new instructions if they are useful. This would increase the size of the interpreter, but in some cases it might be worth it. I am particularly thinking about adding more adressing modes than just register+post-increment, but I don't know which ones. A zero page constant location adressing mode would be nice to have, but I don't know how to interpret it : I'll have to need two dummy bytes in zero page just for that, which is an annoyance.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 79 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: