Announce: Acheron VM

GaBuZoMeu · Post by **GaBuZoMeu** » Sat Oct 05, 2019 6:42 pm

The 'with' op and it's nearly transparent implemantation is a key element together with the sliding registers and Rp. It is a powerful and versatile and elegant way to avoid register-register moves and more. I would not omit this feature without serious reasons.

128 instructions should be sufficient for many applications. And in theory there could be one 'shift' op (with or without an argument) that could temporarily / permanent / or else(°) switch the instruction dispatch table to a (or multiple) different one. (°: Perhaps a similar technique as your sliding register set could be used to extend the instruction set by changing the start of the dispatch table. Highly dangerous in some situations like subroutines and interrupt services

but freaky

)

White Flame · Post by **White Flame** » Sat Oct 05, 2019 7:45 pm

In the 256-opcode dispatcher, 'with' versions of individual instructions can be added with the first 2 lines here:

Code: Select all

OPWITH bswap
 get_rp
OP bswap, none, bits, "Swap high and low bytes of rP."
 dey
 save_y
 ldy 0,x
 lda 1,x
 sta 0,x
 sty 1,x
 jmp mainLoopGetY

Since the OP macros do not inject any bytes into the code stream at that point (they all go into different ca65 segments), entry points can chain together this way.

The usage would remain as normal:

Code: Select all

 with r3
 bswap

but compiles to the OPWITH-marked entry point of bswap instead.

This means that bswap has 2 entries in the opcode table, one for its normal version, and another that consumes the 'with' parameter. The 'with' instruction is also available standalone there, but obviously that version has full dispatch overhead.

I still think the 7-bit + 'with' flag dispatcher is better overall, but I do leave the option to use the other dispatchers.

--

I hadn't considered swapping between dispatch tables at runtime. I like the freaky/dangerous aspects of it, but we'll see how necessary it would be, if projects ever scale to that size. I can imagine a multitasking system where each process can have its own instruction set. It would certainly add space overhead, but might be doable, and many entries of the tables would share a common subset of implementation pointers. But that's for future-me to tackle.

Chromatix · Post by **Chromatix** » Sun Oct 06, 2019 5:24 am

You could also have escape codes as prefixes to lesser-used and/or slower instructions. You'd need extra dispatch tables to accommodate them, of course, but it means you can have a large instruction set without compromising the speed of frequently used instructions.

White Flame · Post by **White Flame** » Sun Oct 06, 2019 6:45 am

Conceptually, there's not much difference between a prefix byte, and a byte to swap dispatch tables. It's only the duration of its effect that differs.

GaBuZoMeu · Post by **GaBuZoMeu** » Sun Oct 06, 2019 1:35 pm

Yes, if a prefix|shift|escape byte only lasts for the next dispatch it's not really freaky. But if it should last longer then context switches (multitasking) would require that the task dispatcher has knowledge about this - and this would increase the overhead. Even common subroutines would require certain attention and if they are nested things could get easily nasty or bulky at least.

On the other hand a "from now on use FP64" would be nice and perhaps the following code would be more readable than frequently using a "shift prefix".

I remembered the XOP operations available for the TI TMS9900 CPU. But these instructions are actually only software traps like sixteen BRKs with different vectors. But perhaps a "use FP64_ADD for XOP1" could inject the address of the FP64_ADD subroutine into the dispatch table where XOP1 resides? This would last until it is redefined. A context switch still would be difficult but common subroutines might simply not use any XOPn and if necessary use a prefix to get access to special instructions.

Lot to think about - perhaps another year

dmsc · Post by **dmsc** » Sun Oct 06, 2019 11:29 pm

Hi!

White Flame wrote:

Yep, so you can see why it's advantageous to adjust the instruction set itself. ~20 cycles to dispatch to custom instructions, each use only takes a byte (plus params), and you can remove existing instructions if you're not using them. Then you can consider what operations are often used and what's seldomly used, and figure out your balance per project.

While the 7-bit 'with' dispatcher supports 128 instructions, you can dispatch on a full 256 as well, but you'd have to deal with the 'with' operation separately.

Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .

I also use a table dispatch, and also use ca65 macros to automatically populate the dispatch table and assign values to the instructions, see for example at https://github.com/dmsc/fastbasic/blob/ ... ub.asm#L38 the definition of the "ADD" and "SUB" instructions, and this is the macro definition https://github.com/dmsc/fastbasic/blob/ ... ok.inc#L31 .

My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.

A brief description of the VM:

- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
- The optional FP support adds instructions that operate on 6 bytes floating point, there is a separate floating point accumulator and a separate floating point stack.
- Also, there are many instructions that are contractions of others to make the code smaller and faster.
- The dispatch code uses only the Y register, and loads the stack pointer to Y before calling operations.
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
- The CALL and RETURN instructions use the 6502 stack to store the return address.

The instruction names are not really good, as were only tough as a BASIC representation, but are fairly generic, so many languages could be compiled to the VM.

Example of code generation:

For the BASIC code " DPOKE 1000, 1234 " (write the value 1234 to memory locations 1000 and 1001), this is 4 VM instructions:
NUM 1000 ; SADDR ; NUM 1234; DPOKE

NUM loads a 16 byte immediate to the accumulator, SADDR copies the accumulator to the pointer, DPOKE writes the value in the accumulator to the pointer address.

The BASIC code " X(100) = A " compiles to:

VAR_LOAD X; PUSH; BYTE 100 ; SHL ; ADD ; SADDR ; VAR_LOAD A ; DPOKE

The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:

VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE

Variables can be used for function parameters, as I explicitly dis-allow recursive functions.

Have Fun!

White Flame · Post by **White Flame** » Mon Oct 07, 2019 7:18 pm

dmsc wrote:

Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .

Cool, it's nice to have other points of comparison, especially if it uses nonstandard and fully-integrated designs.

Regarding differences, I think the main one between our projects relates to your last example:

Quote:

The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE

Note that with 6 instructions, the dispatcher burns 6*27=162 cycles. While not as bad as other systems, that still is 46% of your spent CPU time. That's an optimization opportunity I'm really focusing on in Acheron. For instance, "stmi r3, 200" would replace 4 fastbasic instructions (push_byte 200, add, saddr, dpoke): It would store the current 16-bit value using STA (r3),Y addressing with .Y=200-201 to avoid the separate offset pushing & 16-bit add. Of course, this pointer-offset addressing only supports a 256 byte range, while a general 16-bit add can reach the entire address space.

With register orthogonality, there's no instructions like SADDR or PUSH in general. Values just live where they are, and are directly referenced, without being copied around with additional instructions (and more importantly, without their respective dispatch overhead). But like I said in my talk, the assumption is that this is beneficial when values are reused multiple times, either as a constant or statefully. Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.

There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.

Quote:

My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.

My 3 qualities were speed, space, and power (higher level of abstraction, powerful but still general purpose instructions, etc). And of course having a user-editable ISA is the ultimate per-project optimization, but with compatibility tradeoffs. Since mine is basically a high-level assembly language it only takes 1.9kB at the moment, but it'll certainly grow tooling in the same directions.

Quote:

- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.

Would it be fair to describe this as a stack machine, with the TOS held in registers? I noticed in the things I glanced that there's a fair amount of [operation lo], PHA, TXA, [operation hi], TAX, PLA in order to swap bytes of the accumulator, and JSR/RTS to deal with a number of standard AX operations, so that's something to consider performance-wise. Since moving away from bit-packed opcodes/operands, I've eliminated all JSRs in the instruction implementations, except for tohex/fromhex which call a local helper per digit.

Quote:

- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.

I decided to stick with 16-bit ALU operations for mine, since in my environment you can drop into 6502 asm inline and deal with 8-bit computations when it'd be worth the overhead savings. Load/store instructions support both 8 and 16 values.

Quote:

- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.

Yeah, mine is called the "globals" page (named in contrast to function-local registers), although I stuck with a 256-byte page like zeropage does.

Quote:

Variables can be used for function parameters, as I explicitly dis-allow recursive functions.

Does that mean a BASIC routine can't recursively GOSUB itself, or is it a more low level restriction?

dmsc · Post by **dmsc** » Wed Oct 09, 2019 12:19 am

Hi!

White Flame wrote:

dmsc wrote:

Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .

Cool, it's nice to have other points of comparison, especially if it uses nonstandard and fully-integrated designs.

Regarding differences, I think the main one between our projects relates to your last example:

Quote:

The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE

Note that with 6 instructions, the dispatcher burns 6*27=162 cycles. While not as bad as other systems, that still is 46% of your spent CPU time. That's an optimization opportunity I'm really focusing on in Acheron. For instance, "stmi r3, 200" would replace 4 fastbasic instructions (push_byte 200, add, saddr, dpoke): It would store the current 16-bit value using STA (r3),Y addressing with .Y=200-201 to avoid the separate offset pushing & 16-bit add. Of course, this pointer-offset addressing only supports a 256 byte range, while a general 16-bit add can reach the entire address space.

Yes. This is an optimization that I specifically avoided as it is very difficult for the optimizer to narrow variables to a range, currently the optimizer is only a simple peephole pass. And in common code (I have used some game code and benchmarks like the "sieve" bench) most array indexes are 16-bit sized variables.

Quote:

With register orthogonality, there's no instructions like SADDR or PUSH in general. Values just live where they are, and are directly referenced, without being copied around with additional instructions (and more importantly, without their respective dispatch overhead). But like I said in my talk, the assumption is that this is beneficial when values are reused multiple times, either as a constant or statefully.

Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!

Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser

But for a cross-compiler, surely this is possible.

Quote:

Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.

There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.

Quote:

My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.

My 3 qualities were speed, space, and power (higher level of abstraction, powerful but still general purpose instructions, etc). And of course having a user-editable ISA is the ultimate per-project optimization, but with compatibility tradeoffs. Since mine is basically a high-level assembly language it only takes 1.9kB at the moment, but it'll certainly grow tooling in the same directions.

Quote:

- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.

Would it be fair to describe this as a stack machine, with the TOS held in registers?

Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.

Quote:

I noticed in the things I glanced that there's a fair amount of [operation lo], PHA, TXA, [operation hi], TAX, PLA in order to swap bytes of the accumulator, and JSR/RTS to deal with a number of standard AX operations, so that's something to consider performance-wise.

Yes, and I have tough about replacing the X register with a ZP location - but my simple measurements show that it would not be a net gain in speed - and be a loss in size. And having the accumulator in two ZP locations is slower, I tested it at the beginning. This is because in many programs arithmetic operations don't dominate the runtime.

This is the top used VM operations in the editor ( https://github.com/dmsc/fastbasic/blob/ ... or.bas#L27 ), when compiled without optimizations:

Code: Select all

	290	TOK_VAR_LOAD    ; Loads value of variable to AX
	273	TOK_BYTE        ; Loads immediate byte to A, X = 0
	263	TOK_PUSH        ; Push AX to stack
	103	TOK_VAR_STORE   ; Store AX to variable
	86	TOK_CJUMP       ; Jump to target if A != 1
	80	TOK_CALL        ; Calls subroutine
	66	TOK_ADD         ; Adds TOS into AX
	55	TOK_JUMP        ; Jumps to target
	39	TOK_SUB         ; Subtracts AX from TOS
	35	TOK_EQ          ; Compare AX with TOS, set A=1 if equal
	31	TOK_VAR_SADDR   ; Loads address of variable to SADDR

As you see, less than half of opcodes actually use the stack. When using the optimizer (that is only available in the cross-compiler), the statistics change:

Code: Select all

     171   TOK_VAR_LOAD         ; Loads value of variable to AX
     111   TOK_PUSH_VAR_LOAD    ; Push AX, then loads value of variable to AX
     100   TOK_VAR_STORE        ; Store AX to variable
     78    TOK_PUSH_BYTE        ; Push AX, then loads immediate byte to A, X = 0
     64    TOK_CJUMP            ; Jump to target if A != 1
     55    TOK_ADD              ; Adds TOS into AX
     48    TOK_BYTE             ; Loads immediate byte to A, X = 0
     46    TOK_CALL             ; Calls subroutine
     46    TOK_0                ; Loads #0 into AX
     36    TOK_PUSH_1           ; Push AX, then loads #1 into AX
     34    TOK_EQ               ; Compare AX with TOS, set A=1 if equal
     33    TOK_JUMP             ; Jumps to target
     33    TOK_1                ; Loads #1 into AX
     32    TOK_SUB              ; Subtracts AX from TOS

Quote:

Since moving away from bit-packed opcodes/operands, I've eliminated all JSRs in the instruction implementations, except for tohex/fromhex which call a local helper per digit.

Yes, that is an interesting idea, I could get smaller instruction sizes by storing the "PUSH before the instruction" in the high bit. But, I need to do that efficiently without using the 6502 accumulator... perhaps:

Code: Select all

.proc   interpreter
next_instruction:
cload:  ldy     $1234
        inc     z:cload+1
        bne     adj
        inc     z:cload+2
adj:    sty     z:jump+1
        asl     z:jump+1
        bcs     do_push
ldsptr: ldy     #0
jump:   jmp     (__JUMPTAB_RUN__)

It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.

Quote:

- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.

I decided to stick with 16-bit ALU operations for mine, since in my environment you can drop into 6502 asm inline and deal with 8-bit computations when it'd be worth the overhead savings. Load/store instructions support both 8 and 16 values.

The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.

Quote:

- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.

Yeah, mine is called the "globals" page (named in contrast to function-local registers), although I stuck with a 256-byte page like zeropage does.

I used 512 bytes instead of the simpler 256 because you can store floating point values, that use 6 bytes, so I tough that it would allow bigger programs. But perhaps limiting to 256 is not that limited. Also, I could add a conditional compilation option that uses the simpler implementation if the used space is less than 256 bytes.

Quote:

Variables can be used for function parameters, as I explicitly dis-allow recursive functions.

Does that mean a BASIC routine can't recursively GOSUB itself, or is it a more low level restriction?

Well, FastBasic does not support GOTO, GOSUB or line numbers

But no, it means that if you call a procedure recursively, the value of the variables is overwritten. The optimizer is capable of tail-call optimizations, this is really useful for the editor, as it means that code that ends calling a subroutine can simple use JUMP instead of CALL/RETURN.

For example:

Code: Select all

PROC decide
  IF a>0
    EXEC do_positive
  ELIF a < 0
    EXEC do_negative
  ELSE
    EXEC do_zero
  ENDIF
ENDPROC

This will be compiled to comparisons and conditional jumps:

Code: Select all

proc_lbl_DECIDE:
    VAR_LOAD  #0
    PUSH_0
    GT
    CNJUMP    proc_lbl_DO_POSITIVE
    VAR_LOAD  #0
    PUSH_0
    LT
    CNJUMP    proc_lbl_DO_NEGATIVE
    JUMP      proc_lbl_DO_ZERO

I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.

Thanks for your comments, and Have Fun!

Chromatix · Post by **Chromatix** » Wed Oct 09, 2019 2:08 am

Quote:

I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read…

Why are you using the Y register this way? If you assume a 65C02, there's a non-indexed zero-page indirect mode for LDA.

Conversely, if you keep the low byte of PC constant and only increment Y, then you only need to increment the high byte of PC when Y wraps around. That would also save quite a few cycles, at the expense of needing to preserve Y somewhere.

dmsc · Post by **dmsc** » Wed Oct 09, 2019 2:29 am

Hi!

Chromatix wrote:

Quote:

I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read…

Why are you using the Y register this way? If you assume a 65C02, there's a non-indexed zero-page indirect mode for LDA.

Because Fastbasic runs in the Atari 8-bit computers, that use the 6502 CPU.

Quote:

Conversely, if you keep the low byte of PC constant and only increment Y, then you only need to increment the high byte of PC when Y wraps around. That would also save quite a few cycles, at the expense of needing to preserve Y somewhere.

That would be slower and bigger, I tried: " LDY IDX ; LDA (PC),Y ; INY ; STY IDX ; BNE *+2; INC PC+1 ".

Current solution is a lot faster because the main dispatch code is in zeropage, so the pointer is read directly with an "LDY PTR", no indirection.

Have Fun!

Chromatix · Post by **Chromatix** » Wed Oct 09, 2019 3:13 am

Right, but what if you need two instruction bytes in succession? That saves you an LDY/STY pair, as long as you don't need the Y register meanwhile.

dmsc · Post by **dmsc** » Wed Oct 09, 2019 3:35 am

Hi!

Chromatix wrote:

Right, but what if you need two instruction bytes in succession? That saves you an LDY/STY pair, as long as you don't need the Y register meanwhile.

See: https://github.com/dmsc/fastbasic/blob/ ... st.asm#L41

Reading two bytes with only Y as pointer is more difficult, because you need to test for wrapping from 255 to 1 and from 254 to 0.

There is one alternative to all this: restrict the code so that pages are never crossed, and store one "next-page" instruction at the last position of the page. I have not done this because currently the VM code is relocatable, as the IDE emits the code to one address but compiles to another, but it could make the interpreter faster.

White Flame · Post by **White Flame** » Wed Oct 09, 2019 9:35 am

dmsc wrote:

Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!

Well, comparing that to burning 50-100 cycles of dispatches by forcing it to multiple instructions, that's not bad. I have face-value inefficiencies in quite a few of my instructions as well, but compared to issuing 3-4 other instructions and the register pressure of using additional intermediates, it's actually a net gain. Just need to zoom out more to see the full comparison in a larger unit of work.

Also, in reading from the instruction stream in Acheron, I keep the value of .Y live between instructions, offsetting into the instruction stream. Instructions simply LDA (iptr),Y INY and don't bother checking for overflow. The dispatcher does a BMI and rolls it into the base pointer if .Y gets larger than 127, and doesn't have to catch the exact transition.

The implication of this is that instructions cannot be longer than 128 bytes (consider a switch/case table as a single instruction, or long literals), but that's not generally a problem. It eliminates all that scattered checking from the instruction code, and even the dispatcher PC update is just 4 cycles of INY BMI in the vastly common case (jump/call/ret all set the 16-bit base and clear .Y, which prevents a lot of overflow). I don't even have my dispatch code in zeropage, and it's still 22 cycles or so including the 'with' bit check.

Code: Select all

OP seti16, imm16, regs, "rP := imm16"
 lda (iptr),y
 sta 0,x
 iny
 lda (iptr),y
 sta 1,x
 jmp mainLoop1  ; expects .Y to point to the last byte

Quote:

Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser

I have humans directly writing Acheron code for now, so those should come with at least a semi-working register allocator!

Quote:

Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.

Okay, then it is actually a true hybrid. Neat!

Code: Select all

     171   TOK_VAR_LOAD         ; Loads value of variable to AX
     111   TOK_PUSH_VAR_LOAD    ; Push AX, then loads value of variable to AX
     100   TOK_VAR_STORE        ; Store AX to variable
     78    TOK_PUSH_BYTE        ; Push AX, then loads immediate byte to A, X = 0

I can't help but go back to my claims in the talk. In register style, var/reg selection (assuming it's live & visible, which is a language difference) & immediate bytes are often operands to instructions instead of separate ones, and PUSH for use isn't a concept there. These quantified populations of what I called "overhead" do confirm my assumptions.

You can certainly make a mostly-stack machine that's faster than common stack machines or other BASIC interpreters, but the very architecture still is mandating doing a lot of fiddly things that a departure might be able to avoid. But it is style A vs style B, and both have their tradeoffs and a ton of long-tail effects. Stack style probably still wins a bit in memory footprint (when using 1-byte instructions), but I'm fairly confident about my speed decisions.

Also, bytecode population != runtime usage counts, which would also be interesting to see.

Quote:

It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.

It's quite a bold decision to keep both .A and .X live between instructions, so it's interesting that you've been able to tackle all the problems so far. But yeah, stuff like this is hard to make fast/small without the accumulator. I would be very hesitant to put more burden on the dispatcher if it can't be maximally optimized, because that cost compounds on everything.

Quote:

The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.

Ah, ok. That's much different than CBM BASIC, which stores its truth values as 16-bit integers, 0 for false, and -1 ($FFFF) for true. That allows you do numeric boolean masking, like "A = (X<10) AND B" or "A = (X<10)*-B + (X>9)*-C" to conditionally convert bools to full values within expressions themselves, without needing multi-line IF statements (and CBM BASIC has no ELSE).

Quote:

I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.

I think so, too. Both bytes & performance.

Of course, differences in screen hardware, ROM usage, and target language assumptions could also sway the tides but we'll have to see. I'll definitely give it a whirl when I'm ready!

BigEd · Post by **BigEd** » Wed Oct 09, 2019 11:34 am

(Excellent discussion!)

dmsc · Post by **dmsc** » Wed Oct 09, 2019 12:16 pm

Hi!

White Flame wrote:

dmsc wrote:

Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!

Well, comparing that to burning 50-100 cycles of dispatches by forcing it to multiple instructions, that's not bad. I have face-value inefficiencies in quite a few of my instructions as well, but compared to issuing 3-4 other instructions and the register pressure of using additional intermediates, it's actually a net gain. Just need to zoom out more to see the full comparison in a larger unit of work.

Also, in reading from the instruction stream in Acheron, I keep the value of .Y live between instructions, offsetting into the instruction stream. Instructions simply LDA (iptr),Y INY and don't bother checking for overflow. The dispatcher does a BMI and rolls it into the base pointer if .Y gets larger than 127, and doesn't have to catch the exact transition.

That is an interesting idea. But in my VM, keeping Y unmodified between calls is not easy. I miss an "PHY / PLY" pair

Quote:

The implication of this is that instructions cannot be longer than 128 bytes (consider a switch/case table as a single instruction, or long literals), but that's not generally a problem.

Well, in my case the only larger instruction is the "ConstantSTRING", this is up to 256 bytes long. But it could e handled in a special way without problem.

Quote:

It eliminates all that scattered checking from the instruction code, and even the dispatcher PC update is just 4 cycles of INY BMI in the vastly common case (jump/call/ret all set the 16-bit base and clear .Y, which prevents a lot of overflow). I don't even have my dispatch code in zeropage, and it's still 22 cycles or so including the 'with' bit check.

Code: Select all

OP seti16, imm16, regs, "rP := imm16"
 lda (iptr),y
 sta 0,x
 iny
 lda (iptr),y
 sta 1,x
 jmp mainLoop1  ; expects .Y to point to the last byte

Quote:

Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser

I have humans directly writing Acheron code for now, so those should come with at least a semi-working register allocator!

Quote:

Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.

Okay, then it is actually a true hybrid. Neat!

Code: Select all

     171   TOK_VAR_LOAD         ; Loads value of variable to AX
     111   TOK_PUSH_VAR_LOAD    ; Push AX, then loads value of variable to AX
     100   TOK_VAR_STORE        ; Store AX to variable
     78    TOK_PUSH_BYTE        ; Push AX, then loads immediate byte to A, X = 0

I can't help but go back to my claims in the talk. In register style, var/reg selection (assuming it's live & visible, which is a language difference) & immediate bytes are often operands to instructions instead of separate ones, and PUSH for use isn't a concept there. These quantified populations of what I called "overhead" do confirm my assumptions.

You can certainly make a mostly-stack machine that's faster than common stack machines or other BASIC interpreters, but the very architecture still is mandating doing a lot of fiddly things that a departure might be able to avoid. But it is style A vs style B, and both have their tradeoffs and a ton of long-tail effects. Stack style probably still wins a bit in memory footprint (when using 1-byte instructions), but I'm fairly confident about my speed decisions.

Also, bytecode population != runtime usage counts, which would also be interesting to see.

For the editor this is not easy, as it would mean instrumenting it to do some particular task. But for the sieve benchmark, calculating first 1899 primes, those are the runtime statistics:

With Optimizer:

Code: Select all

Total VM instructions: 2159246
Total 6502 cycles: 145821995
Mean cycles/instruction: 67.5

13.38  288871  TOK_PUSH_VAR_LOAD
12.50  269880  TOK_ADD
11.62  250904  TOK_VAR_LOAD
10.74  231910  TOK_CJUMP
10.74  231910  TOK_FOR_NEXT
 6.95  149990  TOK_SADDR
 6.95  149990  TOK_POKE
 6.95  149990  TOK_1
 4.67  100911  TOK_CNJUMP
 3.79   81910  TOK_COMP_0
 3.79   81910  TOK_PEEK
 0.88   19003  TOK_VAR_STORE
 0.88   19001  TOK_NUM
 0.88   19001  TOK_FOR
 0.88   19001  TOK_DPOKE
 0.88   19001  TOK_VAR_SADDR
 0.88   19001  TOK_FOR_EXIT
 0.88   18990  TOK_INCVAR
 0.88   18990  TOK_USHL
 0.88   18990  TOK_PUSH_BYTE
 0.00      20  TOK_PUSH_0
 0.00      12  TOK_PUSH_1
 0.00      10  TOK_MSET
 0.00      10  TOK_0
 0.00      10  TOK_PUSH_NUM
 0.00      10  TOK_PRINT_STR
 0.00       7  TOK_CSTRING
 0.00       4  TOK_PRINT_EOL
 0.00       3  TOK_INT_STR
 0.00       2  TOK_TIME
 0.00       1  TOK_SUB
 0.00       1  TOK_END
 0.00       1  TOK_DIM
 0.00       1  TOK_BYTE

Without Optimizer:

Code: Select all

Total VM instructions: 2606028
Total 6502 cycles: 160654974
Mean cycles/instruction: 61.6

21.44  558765  TOK_VAR_LOAD)
12.54  326882  TOK_PUSH)
12.04  313820  TOK_CJUMP)
11.08  288871  TOK_ADD)
 8.90  231910  TOK_FOR_NEXT)
 6.49  169012  TOK_BYTE)
 5.76  149990  TOK_SADDR)
 5.76  149990  TOK_POKE)
 3.14   81910  TOK_COMP_0)
 3.14   81910  TOK_PEEK)
 3.14   81910  TOK_L_NOT)
 1.46   37991  TOK_VAR_SADDR)
 0.73   19011  TOK_NUM)
 0.73   19003  TOK_VAR_STORE)
 0.73   19001  TOK_CNJUMP)
 0.73   19001  TOK_FOR)
 0.73   19001  TOK_DPOKE)
 0.73   19001  TOK_FOR_EXIT)
 0.73   18990  TOK_INC)
 0.00      11  TOK_PUSH_1)
 0.00      10  TOK_MSET)
 0.00      10  TOK_PRINT_STR)
 0.00       7  TOK_CSTRING)
 0.00       4  TOK_IOCHN)
 0.00       4  TOK_0)
 0.00       4  TOK_PRINT_EOL)
 0.00       3  TOK_INT_STR)
 0.00       2  TOK_TIME)
 0.00       1  TOK_SUB)
 0.00       1  TOK_END)
 0.00       1  TOK_DIM)
 0.00       1  TOK_PUSH_BYTE)

Quote:

It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.

It's quite a bold decision to keep both .A and .X live between instructions, so it's interesting that you've been able to tackle all the problems so far. But yeah, stuff like this is hard to make fast/small without the accumulator. I would be very hesitant to put more burden on the dispatcher if it can't be maximally optimized, because that cost compounds on everything.

Quote:

The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.

Ah, ok. That's much different than CBM BASIC, which stores its truth values as 16-bit integers, 0 for false, and -1 ($FFFF) for true. That allows you do numeric boolean masking, like "A = (X<10) AND B" or "A = (X<10)*-B + (X>9)*-C" to conditionally convert bools to full values within expressions themselves, without needing multi-line IF statements (and CBM BASIC has no ELSE).

Fastbasic does support this, but using Atari Basic convention that TRUE == 1, FALSE == 0, so you can do "X=(A>2)*25". The parser inserts instructions that convert booleans to integers and vice versa for those cases.

Quote:

I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.

I think so, too. Both bytes & performance.

Of course, differences in screen hardware, ROM usage, and target language assumptions could also sway the tides but we'll have to see. I'll definitely give it a whirl when I'm ready!

That would be great. The editor does not have too much hardware dependencies, it uses PRINT to output to the screen, only relies on be able to write control codes (cursor movement, insert line, delete line) and read the current cursor position.

Have Fun!

Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM