128 instructions should be sufficient for many applications. And in theory there could be one 'shift' op (with or without an argument) that could temporarily / permanent / or else(°) switch the instruction dispatch table to a (or multiple) different one. (°: Perhaps a similar technique as your sliding register set could be used to extend the instruction set by changing the start of the dispatch table. Highly dangerous in some situations like subroutines and interrupt services
Announce: Acheron VM
Re: Announce: Acheron VM
The 'with' op and it's nearly transparent implemantation is a key element together with the sliding registers and Rp. It is a powerful and versatile and elegant way to avoid register-register moves and more. I would not omit this feature without serious reasons.
128 instructions should be sufficient for many applications. And in theory there could be one 'shift' op (with or without an argument) that could temporarily / permanent / or else(°) switch the instruction dispatch table to a (or multiple) different one. (°: Perhaps a similar technique as your sliding register set could be used to extend the instruction set by changing the start of the dispatch table. Highly dangerous in some situations like subroutines and interrupt services
but freaky
)
128 instructions should be sufficient for many applications. And in theory there could be one 'shift' op (with or without an argument) that could temporarily / permanent / or else(°) switch the instruction dispatch table to a (or multiple) different one. (°: Perhaps a similar technique as your sliding register set could be used to extend the instruction set by changing the start of the dispatch table. Highly dangerous in some situations like subroutines and interrupt services
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Announce: Acheron VM
In the 256-opcode dispatcher, 'with' versions of individual instructions can be added with the first 2 lines here:
Since the OP macros do not inject any bytes into the code stream at that point (they all go into different ca65 segments), entry points can chain together this way.
The usage would remain as normal:
but compiles to the OPWITH-marked entry point of bswap instead.
This means that bswap has 2 entries in the opcode table, one for its normal version, and another that consumes the 'with' parameter. The 'with' instruction is also available standalone there, but obviously that version has full dispatch overhead.
I still think the 7-bit + 'with' flag dispatcher is better overall, but I do leave the option to use the other dispatchers.
--
I hadn't considered swapping between dispatch tables at runtime. I like the freaky/dangerous aspects of it, but we'll see how necessary it would be, if projects ever scale to that size. I can imagine a multitasking system where each process can have its own instruction set. It would certainly add space overhead, but might be doable, and many entries of the tables would share a common subset of implementation pointers. But that's for future-me to tackle.
Code: Select all
OPWITH bswap
get_rp
OP bswap, none, bits, "Swap high and low bytes of rP."
dey
save_y
ldy 0,x
lda 1,x
sta 0,x
sty 1,x
jmp mainLoopGetY
The usage would remain as normal:
Code: Select all
with r3
bswap
This means that bswap has 2 entries in the opcode table, one for its normal version, and another that consumes the 'with' parameter. The 'with' instruction is also available standalone there, but obviously that version has full dispatch overhead.
I still think the 7-bit + 'with' flag dispatcher is better overall, but I do leave the option to use the other dispatchers.
--
I hadn't considered swapping between dispatch tables at runtime. I like the freaky/dangerous aspects of it, but we'll see how necessary it would be, if projects ever scale to that size. I can imagine a multitasking system where each process can have its own instruction set. It would certainly add space overhead, but might be doable, and many entries of the tables would share a common subset of implementation pointers. But that's for future-me to tackle.
Re: Announce: Acheron VM
You could also have escape codes as prefixes to lesser-used and/or slower instructions. You'd need extra dispatch tables to accommodate them, of course, but it means you can have a large instruction set without compromising the speed of frequently used instructions.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Announce: Acheron VM
Conceptually, there's not much difference between a prefix byte, and a byte to swap dispatch tables. It's only the duration of its effect that differs.
Re: Announce: Acheron VM
Yes, if a prefix|shift|escape byte only lasts for the next dispatch it's not really freaky. But if it should last longer then context switches (multitasking) would require that the task dispatcher has knowledge about this - and this would increase the overhead. Even common subroutines would require certain attention and if they are nested things could get easily nasty or bulky at least.
On the other hand a "from now on use FP64" would be nice and perhaps the following code would be more readable than frequently using a "shift prefix".
I remembered the XOP operations available for the TI TMS9900 CPU. But these instructions are actually only software traps like sixteen BRKs with different vectors. But perhaps a "use FP64_ADD for XOP1" could inject the address of the FP64_ADD subroutine into the dispatch table where XOP1 resides? This would last until it is redefined. A context switch still would be difficult but common subroutines might simply not use any XOPn and if necessary use a prefix to get access to special instructions.
Lot to think about - perhaps another year
On the other hand a "from now on use FP64" would be nice and perhaps the following code would be more readable than frequently using a "shift prefix".
I remembered the XOP operations available for the TI TMS9900 CPU. But these instructions are actually only software traps like sixteen BRKs with different vectors. But perhaps a "use FP64_ADD for XOP1" could inject the address of the FP64_ADD subroutine into the dispatch table where XOP1 resides? This would last until it is redefined. A context switch still would be difficult but common subroutines might simply not use any XOPn and if necessary use a prefix to get access to special instructions.
Lot to think about - perhaps another year
Re: Announce: Acheron VM
Hi!
Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .
I also use a table dispatch, and also use ca65 macros to automatically populate the dispatch table and assign values to the instructions, see for example at https://github.com/dmsc/fastbasic/blob/ ... ub.asm#L38 the definition of the "ADD" and "SUB" instructions, and this is the macro definition https://github.com/dmsc/fastbasic/blob/ ... ok.inc#L31 .
My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.
A brief description of the VM:
- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
- The optional FP support adds instructions that operate on 6 bytes floating point, there is a separate floating point accumulator and a separate floating point stack.
- Also, there are many instructions that are contractions of others to make the code smaller and faster.
- The dispatch code uses only the Y register, and loads the stack pointer to Y before calling operations.
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
- The CALL and RETURN instructions use the 6502 stack to store the return address.
The instruction names are not really good, as were only tough as a BASIC representation, but are fairly generic, so many languages could be compiled to the VM.
Example of code generation:
For the BASIC code " DPOKE 1000, 1234 " (write the value 1234 to memory locations 1000 and 1001), this is 4 VM instructions:
NUM 1000 ; SADDR ; NUM 1234; DPOKE
NUM loads a 16 byte immediate to the accumulator, SADDR copies the accumulator to the pointer, DPOKE writes the value in the accumulator to the pointer address.
The BASIC code " X(100) = A " compiles to:
VAR_LOAD X; PUSH; BYTE 100 ; SHL ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
Variables can be used for function parameters, as I explicitly dis-allow recursive functions.
Have Fun!
White Flame wrote:
Yep, so you can see why it's advantageous to adjust the instruction set itself. ~20 cycles to dispatch to custom instructions, each use only takes a byte (plus params), and you can remove existing instructions if you're not using them. Then you can consider what operations are often used and what's seldomly used, and figure out your balance per project.
While the 7-bit 'with' dispatcher supports 128 instructions, you can dispatch on a full 256 as well, but you'd have to deal with the 'with' operation separately.
While the 7-bit 'with' dispatcher supports 128 instructions, you can dispatch on a full 256 as well, but you'd have to deal with the 'with' operation separately.
I also use a table dispatch, and also use ca65 macros to automatically populate the dispatch table and assign values to the instructions, see for example at https://github.com/dmsc/fastbasic/blob/ ... ub.asm#L38 the definition of the "ADD" and "SUB" instructions, and this is the macro definition https://github.com/dmsc/fastbasic/blob/ ... ok.inc#L31 .
My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.
A brief description of the VM:
- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
- The optional FP support adds instructions that operate on 6 bytes floating point, there is a separate floating point accumulator and a separate floating point stack.
- Also, there are many instructions that are contractions of others to make the code smaller and faster.
- The dispatch code uses only the Y register, and loads the stack pointer to Y before calling operations.
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
- The CALL and RETURN instructions use the 6502 stack to store the return address.
The instruction names are not really good, as were only tough as a BASIC representation, but are fairly generic, so many languages could be compiled to the VM.
Example of code generation:
For the BASIC code " DPOKE 1000, 1234 " (write the value 1234 to memory locations 1000 and 1001), this is 4 VM instructions:
NUM 1000 ; SADDR ; NUM 1234; DPOKE
NUM loads a 16 byte immediate to the accumulator, SADDR copies the accumulator to the pointer, DPOKE writes the value in the accumulator to the pointer address.
The BASIC code " X(100) = A " compiles to:
VAR_LOAD X; PUSH; BYTE 100 ; SHL ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
Variables can be used for function parameters, as I explicitly dis-allow recursive functions.
Have Fun!
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Announce: Acheron VM
dmsc wrote:
Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .
Regarding differences, I think the main one between our projects relates to your last example:
Quote:
The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
With register orthogonality, there's no instructions like SADDR or PUSH in general. Values just live where they are, and are directly referenced, without being copied around with additional instructions (and more importantly, without their respective dispatch overhead). But like I said in my talk, the assumption is that this is beneficial when values are reused multiple times, either as a constant or statefully. Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.
There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.
Quote:
My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.
Quote:
- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
- All binary operations use the accumulator and the stack if needed.
Quote:
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
Quote:
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
Quote:
Variables can be used for function parameters, as I explicitly dis-allow recursive functions.
Re: Announce: Acheron VM
Hi!
Cool, it's nice to have other points of comparison, especially if it uses nonstandard and fully-integrated designs.
Regarding differences, I think the main one between our projects relates to your last example:
Note that with 6 instructions, the dispatcher burns 6*27=162 cycles. While not as bad as other systems, that still is 46% of your spent CPU time. That's an optimization opportunity I'm really focusing on in Acheron. For instance, "stmi r3, 200" would replace 4 fastbasic instructions (push_byte 200, add, saddr, dpoke): It would store the current 16-bit value using STA (r3),Y addressing with .Y=200-201 to avoid the separate offset pushing & 16-bit add. Of course, this pointer-offset addressing only supports a 256 byte range, while a general 16-bit add can reach the entire address space.
Yes. This is an optimization that I specifically avoided as it is very difficult for the optimizer to narrow variables to a range, currently the optimizer is only a simple peephole pass. And in common code (I have used some game code and benchmarks like the "sieve" bench) most array indexes are 16-bit sized variables.
Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!
Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser
But for a cross-compiler, surely this is possible.
Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.
There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.
My 3 qualities were speed, space, and power (higher level of abstraction, powerful but still general purpose instructions, etc). And of course having a user-editable ISA is the ultimate per-project optimization, but with compatibility tradeoffs. Since mine is basically a high-level assembly language it only takes 1.9kB at the moment, but it'll certainly grow tooling in the same directions.
Would it be fair to describe this as a stack machine, with the TOS held in registers?
Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.
Yes, and I have tough about replacing the X register with a ZP location - but my simple measurements show that it would not be a net gain in speed - and be a loss in size. And having the accumulator in two ZP locations is slower, I tested it at the beginning. This is because in many programs arithmetic operations don't dominate the runtime.
This is the top used VM operations in the editor ( https://github.com/dmsc/fastbasic/blob/ ... or.bas#L27 ), when compiled without optimizations:
As you see, less than half of opcodes actually use the stack. When using the optimizer (that is only available in the cross-compiler), the statistics change:
Yes, that is an interesting idea, I could get smaller instruction sizes by storing the "PUSH before the instruction" in the high bit. But, I need to do that efficiently without using the 6502 accumulator... perhaps:
It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.
I decided to stick with 16-bit ALU operations for mine, since in my environment you can drop into 6502 asm inline and deal with 8-bit computations when it'd be worth the overhead savings. Load/store instructions support both 8 and 16 values.
The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.
Yeah, mine is called the "globals" page (named in contrast to function-local registers), although I stuck with a 256-byte page like zeropage does.
I used 512 bytes instead of the simpler 256 because you can store floating point values, that use 6 bytes, so I tough that it would allow bigger programs. But perhaps limiting to 256 is not that limited. Also, I could add a conditional compilation option that uses the simpler implementation if the used space is less than 256 bytes.
Does that mean a BASIC routine can't recursively GOSUB itself, or is it a more low level restriction?
Well, FastBasic does not support GOTO, GOSUB or line numbers
But no, it means that if you call a procedure recursively, the value of the variables is overwritten. The optimizer is capable of tail-call optimizations, this is really useful for the editor, as it means that code that ends calling a subroutine can simple use JUMP instead of CALL/RETURN.
For example:
This will be compiled to comparisons and conditional jumps:
I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.
Thanks for your comments, and Have Fun!
White Flame wrote:
dmsc wrote:
Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (https://github.com/dmsc/fastbasic ) .
Regarding differences, I think the main one between our projects relates to your last example:
Quote:
The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
Quote:
With register orthogonality, there's no instructions like SADDR or PUSH in general. Values just live where they are, and are directly referenced, without being copied around with additional instructions (and more importantly, without their respective dispatch overhead). But like I said in my talk, the assumption is that this is beneficial when values are reused multiple times, either as a constant or statefully.
Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser
But for a cross-compiler, surely this is possible.
Quote:
Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.
There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.
Quote:
My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.
Quote:
- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
- All binary operations use the accumulator and the stack if needed.
Quote:
I noticed in the things I glanced that there's a fair amount of [operation lo], PHA, TXA, [operation hi], TAX, PLA in order to swap bytes of the accumulator, and JSR/RTS to deal with a number of standard AX operations, so that's something to consider performance-wise.
This is the top used VM operations in the editor ( https://github.com/dmsc/fastbasic/blob/ ... or.bas#L27 ), when compiled without optimizations:
Code: Select all
290 TOK_VAR_LOAD ; Loads value of variable to AX
273 TOK_BYTE ; Loads immediate byte to A, X = 0
263 TOK_PUSH ; Push AX to stack
103 TOK_VAR_STORE ; Store AX to variable
86 TOK_CJUMP ; Jump to target if A != 1
80 TOK_CALL ; Calls subroutine
66 TOK_ADD ; Adds TOS into AX
55 TOK_JUMP ; Jumps to target
39 TOK_SUB ; Subtracts AX from TOS
35 TOK_EQ ; Compare AX with TOS, set A=1 if equal
31 TOK_VAR_SADDR ; Loads address of variable to SADDR
Code: Select all
171 TOK_VAR_LOAD ; Loads value of variable to AX
111 TOK_PUSH_VAR_LOAD ; Push AX, then loads value of variable to AX
100 TOK_VAR_STORE ; Store AX to variable
78 TOK_PUSH_BYTE ; Push AX, then loads immediate byte to A, X = 0
64 TOK_CJUMP ; Jump to target if A != 1
55 TOK_ADD ; Adds TOS into AX
48 TOK_BYTE ; Loads immediate byte to A, X = 0
46 TOK_CALL ; Calls subroutine
46 TOK_0 ; Loads #0 into AX
36 TOK_PUSH_1 ; Push AX, then loads #1 into AX
34 TOK_EQ ; Compare AX with TOS, set A=1 if equal
33 TOK_JUMP ; Jumps to target
33 TOK_1 ; Loads #1 into AX
32 TOK_SUB ; Subtracts AX from TOS
Quote:
Since moving away from bit-packed opcodes/operands, I've eliminated all JSRs in the instruction implementations, except for tohex/fromhex which call a local helper per digit.
Code: Select all
.proc interpreter
next_instruction:
cload: ldy $1234
inc z:cload+1
bne adj
inc z:cload+2
adj: sty z:jump+1
asl z:jump+1
bcs do_push
ldsptr: ldy #0
jump: jmp (__JUMPTAB_RUN__)
Quote:
Quote:
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
Quote:
Quote:
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
Quote:
Quote:
Variables can be used for function parameters, as I explicitly dis-allow recursive functions.
But no, it means that if you call a procedure recursively, the value of the variables is overwritten. The optimizer is capable of tail-call optimizations, this is really useful for the editor, as it means that code that ends calling a subroutine can simple use JUMP instead of CALL/RETURN.
For example:
Code: Select all
PROC decide
IF a>0
EXEC do_positive
ELIF a < 0
EXEC do_negative
ELSE
EXEC do_zero
ENDIF
ENDPROC
Code: Select all
proc_lbl_DECIDE:
VAR_LOAD #0
PUSH_0
GT
CNJUMP proc_lbl_DO_POSITIVE
VAR_LOAD #0
PUSH_0
LT
CNJUMP proc_lbl_DO_NEGATIVE
JUMP proc_lbl_DO_ZERO
Thanks for your comments, and Have Fun!
Re: Announce: Acheron VM
Quote:
I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read…
Conversely, if you keep the low byte of PC constant and only increment Y, then you only need to increment the high byte of PC when Y wraps around. That would also save quite a few cycles, at the expense of needing to preserve Y somewhere.
Re: Announce: Acheron VM
Hi!
Why are you using the Y register this way? If you assume a 65C02, there's a non-indexed zero-page indirect mode for LDA.
Because Fastbasic runs in the Atari 8-bit computers, that use the 6502 CPU.
That would be slower and bigger, I tried: " LDY IDX ; LDA (PC),Y ; INY ; STY IDX ; BNE *+2; INC PC+1 ".
Current solution is a lot faster because the main dispatch code is in zeropage, so the pointer is read directly with an "LDY PTR", no indirection.
Have Fun!
Chromatix wrote:
Quote:
I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read…
Quote:
Conversely, if you keep the low byte of PC constant and only increment Y, then you only need to increment the high byte of PC when Y wraps around. That would also save quite a few cycles, at the expense of needing to preserve Y somewhere.
Current solution is a lot faster because the main dispatch code is in zeropage, so the pointer is read directly with an "LDY PTR", no indirection.
Have Fun!
Re: Announce: Acheron VM
Right, but what if you need two instruction bytes in succession? That saves you an LDY/STY pair, as long as you don't need the Y register meanwhile.
Re: Announce: Acheron VM
Hi!
See: https://github.com/dmsc/fastbasic/blob/ ... st.asm#L41
Reading two bytes with only Y as pointer is more difficult, because you need to test for wrapping from 255 to 1 and from 254 to 0.
There is one alternative to all this: restrict the code so that pages are never crossed, and store one "next-page" instruction at the last position of the page. I have not done this because currently the VM code is relocatable, as the IDE emits the code to one address but compiles to another, but it could make the interpreter faster.
Chromatix wrote:
Right, but what if you need two instruction bytes in succession? That saves you an LDY/STY pair, as long as you don't need the Y register meanwhile.
Reading two bytes with only Y as pointer is more difficult, because you need to test for wrapping from 255 to 1 and from 254 to 0.
There is one alternative to all this: restrict the code so that pages are never crossed, and store one "next-page" instruction at the last position of the page. I have not done this because currently the VM code is relocatable, as the IDE emits the code to one address but compiles to another, but it could make the interpreter faster.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Announce: Acheron VM
dmsc wrote:
Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!
Also, in reading from the instruction stream in Acheron, I keep the value of .Y live between instructions, offsetting into the instruction stream. Instructions simply LDA (iptr),Y INY and don't bother checking for overflow. The dispatcher does a BMI and rolls it into the base pointer if .Y gets larger than 127, and doesn't have to catch the exact transition.
The implication of this is that instructions cannot be longer than 128 bytes (consider a switch/case table as a single instruction, or long literals), but that's not generally a problem. It eliminates all that scattered checking from the instruction code, and even the dispatcher PC update is just 4 cycles of INY BMI in the vastly common case (jump/call/ret all set the 16-bit base and clear .Y, which prevents a lot of overflow). I don't even have my dispatch code in zeropage, and it's still 22 cycles or so including the 'with' bit check.
Code: Select all
OP seti16, imm16, regs, "rP := imm16"
lda (iptr),y
sta 0,x
iny
lda (iptr),y
sta 1,x
jmp mainLoop1 ; expects .Y to point to the last byteQuote:
Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser 
Quote:
Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.
Code: Select all
171 TOK_VAR_LOAD ; Loads value of variable to AX
111 TOK_PUSH_VAR_LOAD ; Push AX, then loads value of variable to AX
100 TOK_VAR_STORE ; Store AX to variable
78 TOK_PUSH_BYTE ; Push AX, then loads immediate byte to A, X = 0
You can certainly make a mostly-stack machine that's faster than common stack machines or other BASIC interpreters, but the very architecture still is mandating doing a lot of fiddly things that a departure might be able to avoid. But it is style A vs style B, and both have their tradeoffs and a ton of long-tail effects. Stack style probably still wins a bit in memory footprint (when using 1-byte instructions), but I'm fairly confident about my speed decisions.
Also, bytecode population != runtime usage counts, which would also be interesting to see.
Quote:
It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.
Quote:
The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.
Quote:
I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.
Re: Announce: Acheron VM
(Excellent discussion!)
Re: Announce: Acheron VM
Hi!
Well, comparing that to burning 50-100 cycles of dispatches by forcing it to multiple instructions, that's not bad. I have face-value inefficiencies in quite a few of my instructions as well, but compared to issuing 3-4 other instructions and the register pressure of using additional intermediates, it's actually a net gain. Just need to zoom out more to see the full comparison in a larger unit of work.
Also, in reading from the instruction stream in Acheron, I keep the value of .Y live between instructions, offsetting into the instruction stream. Instructions simply LDA (iptr),Y INY and don't bother checking for overflow. The dispatcher does a BMI and rolls it into the base pointer if .Y gets larger than 127, and doesn't have to catch the exact transition.That is an interesting idea. But in my VM, keeping Y unmodified between calls is not easy. I miss an "PHY / PLY" pair
Well, in my case the only larger instruction is the "ConstantSTRING", this is up to 256 bytes long. But it could e handled in a special way without problem.
It eliminates all that scattered checking from the instruction code, and even the dispatcher PC update is just 4 cycles of INY BMI in the vastly common case (jump/call/ret all set the 16-bit base and clear .Y, which prevents a lot of overflow). I don't even have my dispatch code in zeropage, and it's still 22 cycles or so including the 'with' bit check.
I have humans directly writing Acheron code for now, so those should come with at least a semi-working register allocator!
Okay, then it is actually a true hybrid. Neat!
I can't help but go back to my claims in the talk. In register style, var/reg selection (assuming it's live & visible, which is a language difference) & immediate bytes are often operands to instructions instead of separate ones, and PUSH for use isn't a concept there. These quantified populations of what I called "overhead" do confirm my assumptions.
You can certainly make a mostly-stack machine that's faster than common stack machines or other BASIC interpreters, but the very architecture still is mandating doing a lot of fiddly things that a departure might be able to avoid. But it is style A vs style B, and both have their tradeoffs and a ton of long-tail effects. Stack style probably still wins a bit in memory footprint (when using 1-byte instructions), but I'm fairly confident about my speed decisions.
Also, bytecode population != runtime usage counts, which would also be interesting to see. For the editor this is not easy, as it would mean instrumenting it to do some particular task. But for the sieve benchmark, calculating first 1899 primes, those are the runtime statistics:
With Optimizer:
Without Optimizer:
It's quite a bold decision to keep both .A and .X live between instructions, so it's interesting that you've been able to tackle all the problems so far. But yeah, stuff like this is hard to make fast/small without the accumulator. I would be very hesitant to put more burden on the dispatcher if it can't be maximally optimized, because that cost compounds on everything.
Ah, ok. That's much different than CBM BASIC, which stores its truth values as 16-bit integers, 0 for false, and -1 ($FFFF) for true. That allows you do numeric boolean masking, like "A = (X<10) AND B" or "A = (X<10)*-B + (X>9)*-C" to conditionally convert bools to full values within expressions themselves, without needing multi-line IF statements (and CBM BASIC has no ELSE).
Fastbasic does support this, but using Atari Basic convention that TRUE == 1, FALSE == 0, so you can do "X=(A>2)*25". The parser inserts instructions that convert booleans to integers and vice versa for those cases.
I think so, too. Both bytes & performance.
Of course, differences in screen hardware, ROM usage, and target language assumptions could also sway the tides but we'll have to see. I'll definitely give it a whirl when I'm ready!
That would be great. The editor does not have too much hardware dependencies, it uses PRINT to output to the screen, only relies on be able to write control codes (cursor movement, insert line, delete line) and read the current cursor position.
Have Fun!
White Flame wrote:
dmsc wrote:
Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!
Also, in reading from the instruction stream in Acheron, I keep the value of .Y live between instructions, offsetting into the instruction stream. Instructions simply LDA (iptr),Y INY and don't bother checking for overflow. The dispatcher does a BMI and rolls it into the base pointer if .Y gets larger than 127, and doesn't have to catch the exact transition.
Quote:
The implication of this is that instructions cannot be longer than 128 bytes (consider a switch/case table as a single instruction, or long literals), but that's not generally a problem.
Quote:
It eliminates all that scattered checking from the instruction code, and even the dispatcher PC update is just 4 cycles of INY BMI in the vastly common case (jump/call/ret all set the 16-bit base and clear .Y, which prevents a lot of overflow). I don't even have my dispatch code in zeropage, and it's still 22 cycles or so including the 'with' bit check.
Code: Select all
OP seti16, imm16, regs, "rP := imm16"
lda (iptr),y
sta 0,x
iny
lda (iptr),y
sta 1,x
jmp mainLoop1 ; expects .Y to point to the last byteQuote:
Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser 
Quote:
Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.
Code: Select all
171 TOK_VAR_LOAD ; Loads value of variable to AX
111 TOK_PUSH_VAR_LOAD ; Push AX, then loads value of variable to AX
100 TOK_VAR_STORE ; Store AX to variable
78 TOK_PUSH_BYTE ; Push AX, then loads immediate byte to A, X = 0
You can certainly make a mostly-stack machine that's faster than common stack machines or other BASIC interpreters, but the very architecture still is mandating doing a lot of fiddly things that a departure might be able to avoid. But it is style A vs style B, and both have their tradeoffs and a ton of long-tail effects. Stack style probably still wins a bit in memory footprint (when using 1-byte instructions), but I'm fairly confident about my speed decisions.
Also, bytecode population != runtime usage counts, which would also be interesting to see.
With Optimizer:
Code: Select all
Total VM instructions: 2159246
Total 6502 cycles: 145821995
Mean cycles/instruction: 67.5
13.38 288871 TOK_PUSH_VAR_LOAD
12.50 269880 TOK_ADD
11.62 250904 TOK_VAR_LOAD
10.74 231910 TOK_CJUMP
10.74 231910 TOK_FOR_NEXT
6.95 149990 TOK_SADDR
6.95 149990 TOK_POKE
6.95 149990 TOK_1
4.67 100911 TOK_CNJUMP
3.79 81910 TOK_COMP_0
3.79 81910 TOK_PEEK
0.88 19003 TOK_VAR_STORE
0.88 19001 TOK_NUM
0.88 19001 TOK_FOR
0.88 19001 TOK_DPOKE
0.88 19001 TOK_VAR_SADDR
0.88 19001 TOK_FOR_EXIT
0.88 18990 TOK_INCVAR
0.88 18990 TOK_USHL
0.88 18990 TOK_PUSH_BYTE
0.00 20 TOK_PUSH_0
0.00 12 TOK_PUSH_1
0.00 10 TOK_MSET
0.00 10 TOK_0
0.00 10 TOK_PUSH_NUM
0.00 10 TOK_PRINT_STR
0.00 7 TOK_CSTRING
0.00 4 TOK_PRINT_EOL
0.00 3 TOK_INT_STR
0.00 2 TOK_TIME
0.00 1 TOK_SUB
0.00 1 TOK_END
0.00 1 TOK_DIM
0.00 1 TOK_BYTECode: Select all
Total VM instructions: 2606028
Total 6502 cycles: 160654974
Mean cycles/instruction: 61.6
21.44 558765 TOK_VAR_LOAD)
12.54 326882 TOK_PUSH)
12.04 313820 TOK_CJUMP)
11.08 288871 TOK_ADD)
8.90 231910 TOK_FOR_NEXT)
6.49 169012 TOK_BYTE)
5.76 149990 TOK_SADDR)
5.76 149990 TOK_POKE)
3.14 81910 TOK_COMP_0)
3.14 81910 TOK_PEEK)
3.14 81910 TOK_L_NOT)
1.46 37991 TOK_VAR_SADDR)
0.73 19011 TOK_NUM)
0.73 19003 TOK_VAR_STORE)
0.73 19001 TOK_CNJUMP)
0.73 19001 TOK_FOR)
0.73 19001 TOK_DPOKE)
0.73 19001 TOK_FOR_EXIT)
0.73 18990 TOK_INC)
0.00 11 TOK_PUSH_1)
0.00 10 TOK_MSET)
0.00 10 TOK_PRINT_STR)
0.00 7 TOK_CSTRING)
0.00 4 TOK_IOCHN)
0.00 4 TOK_0)
0.00 4 TOK_PRINT_EOL)
0.00 3 TOK_INT_STR)
0.00 2 TOK_TIME)
0.00 1 TOK_SUB)
0.00 1 TOK_END)
0.00 1 TOK_DIM)
0.00 1 TOK_PUSH_BYTE)
Quote:
Quote:
It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.
Quote:
The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.
Quote:
Quote:
I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.
Have Fun!