Hi!
White Flame wrote:
dmsc wrote:
Skimmed over your presentation, certainly your VM has some similarities - and differences - with my FastBasic VM (
https://github.com/dmsc/fastbasic ) .
Cool, it's nice to have other points of comparison, especially if it uses nonstandard and fully-integrated designs.
Regarding differences, I think the main one between our projects relates to your last example:
Quote:
The optimizer can transform that code to the shorter, total execution time is 350 cycles for the 6 instructions:
VAR_LOAD X; PUSH_BYTE 200 ; ADD ; SADDR ; VAR_LOAD A ; DPOKE
Note that with 6 instructions, the dispatcher burns 6*27=162 cycles. While not as bad as other systems, that still is 46% of your spent CPU time. That's an optimization opportunity I'm really focusing on in Acheron. For instance, "stmi r3, 200" would replace 4 fastbasic instructions (push_byte 200, add, saddr, dpoke): It would store the current 16-bit value using STA (r3),Y addressing with .Y=200-201 to avoid the separate offset pushing & 16-bit add. Of course, this pointer-offset addressing only supports a 256 byte range, while a general 16-bit add can reach the entire address space.
Yes. This is an optimization that I specifically avoided as it is very difficult for the optimizer to narrow variables to a range, currently the optimizer is only a simple peephole pass. And in common code (I have used some game code and benchmarks like the "sieve" bench) most array indexes are 16-bit sized variables.
Quote:
With register orthogonality, there's no instructions like SADDR or PUSH in general. Values just live where they are, and are directly referenced, without being copied around with additional instructions (and more importantly, without their respective dispatch overhead). But like I said in my talk, the assumption is that this is beneficial when values are reused multiple times, either as a constant or statefully.
Yes, I see the benefits of registers in the VM. But when doing optimizations, I realized also that reading from the instruction stream is very slow, as you need " LDY #0 ; LDA (PC), Y : INC PC : BNE *+2 ; INC PC+1 " for each read, this is 10 bytes and 15 cycles in the best case, half of what i takes to do a instruction dispatch!
Also, for good code generation with registers, you need a good register allocator - this means liveness analysis and other high level techniques - I can't do that efficiently in my simple parser
But for a cross-compiler, surely this is possible.
Quote:
Acheron has functions with local registers that are directly processed, and these registers are generally longer lived (function-scope) than BASIC's expression evaluation intermediate values, so this performance assumption should apply in a language-specific fashion. However, I haven't done any overall measurement yet, rather focusing on niggling infrastructural clock cycle sinks like this, having repeatedly changed its fundamental design to reduce them.
There are a lot of advantages to a more dedicated pointer register (though address1->address2 copies aren't one of them), and figuring out how to take advantage of that in a clean/fast/powerful way in AcheronVM has been bothering me for some time. But I have recently cracked that nut within my orthogonal register model, and one of my next updates is going to be finally addressing that.
Quote:
My VM is different because is optimized for speed, space and to be easy to compile to, as the entire IDE (editor, compiler and interpreter) is less than 8KB for the integer version and about 9.5KB for the floating point version, this includes strings, arrays, graphics, etc.
My 3 qualities were speed, space, and power (higher level of abstraction, powerful but still general purpose instructions, etc). And of course having a user-editable ISA is the ultimate per-project optimization, but with compatibility tradeoffs. Since mine is basically a high-level assembly language it only takes 1.9kB at the moment, but it'll certainly grow tooling in the same directions.
Quote:
- The VM has two registers, the accumulator (that is kept in the 6502 A and X registers) and one 16 bit the pointer, in ZP.
- All binary operations use the accumulator and the stack if needed.
Would it be fair to describe this as a stack machine, with the TOS held in registers?
Yes, but with one caveat: the TOS register is completely independent of the stack, so the VM don't need to PUSH/POP when loading the TOS. This is specially important in the simple expressions, like " A = 1 ", where the stack is not used at all - this is compiled to " BYTE #3 ; VAR_STORE 'A' " even without the optimizer.
Quote:
I noticed in the things I glanced that there's a fair amount of [operation lo], PHA, TXA, [operation hi], TAX, PLA in order to swap bytes of the accumulator, and JSR/RTS to deal with a number of standard AX operations, so that's something to consider performance-wise.
Yes, and I have tough about replacing the X register with a ZP location - but my simple measurements show that it would not be a net gain in speed - and be a loss in size. And having the accumulator in two ZP locations is slower, I tested it at the beginning. This is because in many programs arithmetic operations don't dominate the runtime.
This is the top used VM operations in the editor (
https://github.com/dmsc/fastbasic/blob/ ... or.bas#L27 ), when compiled without optimizations:
Code:
290 TOK_VAR_LOAD ; Loads value of variable to AX
273 TOK_BYTE ; Loads immediate byte to A, X = 0
263 TOK_PUSH ; Push AX to stack
103 TOK_VAR_STORE ; Store AX to variable
86 TOK_CJUMP ; Jump to target if A != 1
80 TOK_CALL ; Calls subroutine
66 TOK_ADD ; Adds TOS into AX
55 TOK_JUMP ; Jumps to target
39 TOK_SUB ; Subtracts AX from TOS
35 TOK_EQ ; Compare AX with TOS, set A=1 if equal
31 TOK_VAR_SADDR ; Loads address of variable to SADDR
As you see, less than half of opcodes actually use the stack. When using the optimizer (that is only available in the cross-compiler), the statistics change:
Code:
171 TOK_VAR_LOAD ; Loads value of variable to AX
111 TOK_PUSH_VAR_LOAD ; Push AX, then loads value of variable to AX
100 TOK_VAR_STORE ; Store AX to variable
78 TOK_PUSH_BYTE ; Push AX, then loads immediate byte to A, X = 0
64 TOK_CJUMP ; Jump to target if A != 1
55 TOK_ADD ; Adds TOS into AX
48 TOK_BYTE ; Loads immediate byte to A, X = 0
46 TOK_CALL ; Calls subroutine
46 TOK_0 ; Loads #0 into AX
36 TOK_PUSH_1 ; Push AX, then loads #1 into AX
34 TOK_EQ ; Compare AX with TOS, set A=1 if equal
33 TOK_JUMP ; Jumps to target
33 TOK_1 ; Loads #1 into AX
32 TOK_SUB ; Subtracts AX from TOS
Quote:
Since moving away from bit-packed opcodes/operands, I've eliminated all JSRs in the instruction implementations, except for tohex/fromhex which call a local helper per digit.
Yes, that is an interesting idea, I could get smaller instruction sizes by storing the "PUSH before the instruction" in the high bit. But, I need to do that efficiently without using the 6502 accumulator... perhaps:
Code:
.proc interpreter
next_instruction:
cload: ldy $1234
inc z:cload+1
bne adj
inc z:cload+2
adj: sty z:jump+1
asl z:jump+1
bcs do_push
ldsptr: ldy #0
jump: jmp (__JUMPTAB_RUN__)
It's 7 extra cycles per instruction, but could make the VM smaller by removing 6 PUSH_* instructions, and the code generator could simply be instructed to emit PUSH by adding 128 to the next emitted instruction.
Quote:
Quote:
- There are instructions that operate on 16 bit values (most) and others that operate on 8 bit values, like all boolean operations.
I decided to stick with 16-bit ALU operations for mine, since in my environment you can drop into 6502 asm inline and deal with 8-bit computations when it'd be worth the overhead savings. Load/store instructions support both 8 and 16 values.
The big gain is in the boolean operations, as it makes the code like "IF (A>1) AND (A<3)" a lot faster, as the result if the comparison, the "AND" and the conditional jump all operate on bytes instead of words.
Quote:
Quote:
- The VM also supports "variables", those are 512 bytes that can be addressed directly with only one byte parameter. The variables area can move in memory, supporting more than one program running concurrently. Integer and pointer variables use 2 bytes and floating point variables use 6 bytes.
Yeah, mine is called the "globals" page (named in contrast to function-local registers), although I stuck with a 256-byte page like zeropage does.
I used 512 bytes instead of the simpler 256 because you can store floating point values, that use 6 bytes, so I tough that it would allow bigger programs. But perhaps limiting to 256 is not that limited. Also, I could add a conditional compilation option that uses the simpler implementation if the used space is less than 256 bytes.
Quote:
Quote:
Variables can be used for function parameters, as I explicitly dis-allow recursive functions.
Does that mean a BASIC routine can't recursively GOSUB itself, or is it a more low level restriction?
Well, FastBasic does not support GOTO, GOSUB or line numbers
But no, it means that if you call a procedure recursively, the value of the variables is overwritten. The optimizer is capable of tail-call optimizations, this is really useful for the editor, as it means that code that ends calling a subroutine can simple use JUMP instead of CALL/RETURN.
For example:
Code:
PROC decide
IF a>0
EXEC do_positive
ELIF a < 0
EXEC do_negative
ELSE
EXEC do_zero
ENDIF
ENDPROC
This will be compiled to comparisons and conditional jumps:
Code:
proc_lbl_DECIDE:
VAR_LOAD #0
PUSH_0
GT
CNJUMP proc_lbl_DO_POSITIVE
VAR_LOAD #0
PUSH_0
LT
CNJUMP proc_lbl_DO_NEGATIVE
JUMP proc_lbl_DO_ZERO
I think that a good test for a register VM like yours would be to see how many bytes my editor uses, the current editor is compiler to 2267 bytes, this includes all string constants.
Thanks for your comments, and Have Fun!