Announce: Acheron VM

Miles J. · Post by **Miles J.** » Sat Jul 28, 2012 9:07 pm

@White Flame
Just one thing:
I had one problem with your solution: your routine 'convertString' messes with the string pointer and expects the caller to use the returned changed string address. TBH that was not what I intended. The idea was to put a string at an address given by the caller (so you can add strings together e.g. for generating a disassembly output: MOVE #<string here>, R0). However, rewriting the convertstring routine might be a bit boring. So I thought about the following:
In 6502 assembly a routine for writing a value as hex nibbles would usually look like

Code: Select all

		PHA
		LSR
		LSR
		LSR
		LSR
		ORA	#'0'
		CMP	#'9' + 1
		BCC	?0
		ADC	#'a' - '9' - 2
?0:	STA	...
		PLA
		AND	#$f
		ORA	#'0'
		...

Theoretically, this should be a good example for using the rP. I've already written a routine for the RISC processor that will print a 16bit value as '$01ab'. But to make it more complex may I suggest the following: instead of passing a file handle as a parameter, now we will pass a pointer to a device object. This device object will have a pointer at offset $12 that contains the address of a method 'outstring' which must be called for writing the string to the device. (This is something beyond the normal capabilities of the 6502.) Another method pointer at offset $14 contains the address of the routine for writing a character (like COUT on the C64). The API looks like this:
1) convertstring
in: register 1: value
register 2: pointer to string
out: register 2: new pointer to string (pointing to the end of the string: '\0')
2) outhex16
in: register 1: value
register 2: pointer to device object
out: register 3: errorcode

If you like you can implement method 'outstring':
3) outstring
in: register 1: pointer to string
register 2: pointer to device object
out: register 3: errorcode
4) outchar
in: register 1: character
register 2: pointer to device object
out: register 3: errorcode
How you implement outchar is up to you and not part of the exercise.
Please note: registers 1 and 2 may not be destroyed (except for routine 1). In addition to the errorcode a flag will indicate whether the operation was successful or not. For example, a 68000 will use the Z-flag for this, x86 and 6502 the C-flag.
I'll give you the example in RISC code soon. All I can say so far is that convertstring + outhex16 + outstring take $5a (90) bytes. Should be easy for you to beat this.
Cheers
Miles

White Flame · Post by **White Flame** » Sat Jul 28, 2012 11:22 pm

Integer numeric rendering into decimal digits is almost always done in backwards order to a fixed-size temp buffer of size appropriate for the numeric type, then that rendered buffer pointer is sent to the final output. There might be algorithms to do it forward, but I'm not familiar with any of them, and I've dug into a fair number of standard libraries. If you're just doing byte-aligned output in hex, though, that's another story, and is better suited for native 8-bit code than bothering with VM overhead.

Barring the output object dispatch, though, is the rest that indicative of language power? Howabout a malloc next? Linked list manipulations? A 16-bit sort algorithm? Fixed-point math? All of these environments have means to call or inline native routines, so doing what native code already does well doesn't seem like it shows much. However, you should write what I did in your platform for comparison: 16-bit unsigned decimal rendering with no leading zeros (unless the number is zero itself), returning a pointer to the string.

I do have good support for function lookup tables, specifically because I also recognized it as something hard in native code, and takes advantage of lda(zp),y in the implementation instead of relying on regular VM instructions to add indexes slowly:

Code: Select all

  ldmi r0, r1, $14  ; r0 = memory(r1 + $14)
  callp             ; call subroutine at rP

Regarding rendering hex bytes, I've got nybble swap, but my CMP equivalents aren't quite fleshed out yet. A table-based approach is easier for now, but I can jimmy up an add-based version based on just current instructions:

Code: Select all

  byte = r0
  tmp = r1
  
  copy tmp, byte       ; tmp = byte
  nswap                ; nybble swap, to work with high nybble first
  andp #$0f
  decloop tmp, 10, :+  ; decrement by 10, branch if it didn't underflow, which would normally continue the loop
   subp 'a' - '0'      ; the number was less than 10, base it off '0' rather than 'a'
: addp 'a' + 10        ; add back the 10 that decloop took out
  ...call output...

  with byte            ; low nybble, same thing
  andp #$0f
  ...

This could be collapsed into a subroutine call per nybble, of course. Or I could hop into native code (or create an instruction) to convert the low byte of a reg into a 2-byte ASCII word filling the reg. Then it'd be appropriate for 16-bit writes from VM code which seems like a much better idea than working with 8-bit value operations at this layer.

I'll work on finishing this, but two things:
- If convertString returns a pointer to the terminating zero, that's not a pointer to the string anymore.
- "Please note: registers 1 and 2 may not be destroyed (except for routine 1)." is kind of funny given the discussion about stack based systems, which always destroy their arguments.

In all the languages' low-level calling semantics that I know, parameter values are generally free to be overwritten unless specifically exempted (if that's even possible), and return values often share the same location as input parameters. It really doesn't matter here as there isn't much register pressure, though.

White Flame · Post by **White Flame** » Sun Jul 29, 2012 5:03 am

I just added something I was very dumb for not including in the first place: Register stack markers.

Old way:

Code: Select all

grow 3
...
rets 3 ; return + shrink 3

New way:

Code: Select all

mgrow 3  ; mark + grow 3
...
retm     ; return subroutine + return register stack to marker

The markers go onto the CPU stack, so if you use them in a subroutine it will take 3 bytes (return address + rstack marker) instead of 2, but it seems the right place for it. This is faster and shorter than doing math during return/shrink, and saves me the 16 opcodes of the 1-byte parameter-embedded rets instruction. "Well, Duh!" says I.

Growing & shrinking the rstack arbitrarily without marking is still in there, but I replaced the 1-byte embedded grow4 with mgrow4 instead of leaving both in. The non-marker grow is now always a 2-byte instruction with a signed 8-bit parameter. shrinkm is also included to just pop back to a marker.

Alienthe · Post by **Alienthe** » Mon Jul 30, 2012 5:26 pm

This looks very interesting.

White Flame wrote:

Goals

Significantly increase code density over native code for complex data-oriented operations
Achieve better speed than other VMs/interpreters (threaded Forth, Sweet16, various BASICs, etc)
Good compiler target for high level languages
Collect a contributed stable of custom instructions and modifications to the VM

Have you tested code densities yet? I would be interested in how it compares with SWEET16.

Quote:

I'd really appreciate feedback on any aspect you'd care to comment about, and it could use others' testing.

I'm going to move it to a publicly hosted VCS at some point, but for preview it's currently hidden away on my personal site.

Some documentation refer to rD, rA and others. Are these real registers that point into the 16 entry register file or just a shorthand notation? Also if these are real registers, how are they updated? Using with?

RCA 1802 used registers that pointed into its 16 entry register file, sadly the accumulator was treated separately.

Sliding register file was used earlier, in SPARC I believe but was in the end not as good an idea as first imagined. How are you overcoming the problems they faced?

BigEd · Post by **BigEd** » Mon Jul 30, 2012 5:34 pm

Alienthe wrote:

This looks very interesting.

(Agreed! I'm watching with interest but haven't had a chance to digest.)

Arlet · Post by **Arlet** » Tue Jul 31, 2012 3:20 pm

Alienthe wrote:

Sliding register file was used earlier, in SPARC I believe but was in the end not as good an idea as first imagined. How are you overcoming the problems they faced?

In SPARC, the register file is limited by hardware, so only a limited number of register window shifts was possible. In arbitrary code, it would be tricky to keep track of the number of free register slots. In a VM, this would be easy to trap, and then allocate some extra memory to the extend the register file.

White Flame · Post by **White Flame** » Fri Aug 03, 2012 7:49 am

I haven't been online for a bit, but I did get the code up on github, with a few more changes in the code & docs:
project: https://github.com/AcheronVM/acheronvm
docs: http://acheronvm.github.com/acheronvm/

Alienthe wrote:

This looks very interesting.
Have you tested code densities yet? I would be interested in how it compares with SWEET16.

Only the basics. The avoidance of having to copying in & out of an accumulator for multi-parameter calculations seems to be a benefit, plus the fact that the "prior" register acts as an accumulator when needed includes SW16's density advantages there. A lot more instructions and composite equivalents can and should still be written to increase density (and speed) further. I do not have post/pre-inc/dec on memory ops yet, but standalone inc/dec and inc2/dec2 instructions on the prior address.

Both single byte ((4-bit-opcode << 4) + 4-bit-param) and double byte (8-bit-opcode, 4-bit-param) encodings are supported (among many other multi-param encodings), to balance the tradeoff between opcode space and instruction size & frequency.

Quote:

Some documentation refer to rD, rA and others. Are these real registers that point into the 16 entry register file or just a shorthand notation? Also if these are real registers, how are they updated? Using with?

The generated instruction set doc has the register legend up top. They are shorthand. "copy rD, rA" just means you can do "copy r3, r9" or whatever with all 16 regs. I don't think the docs say this explicitly, though, so I'll update it.

Quote:

RCA 1802 used registers that pointed into its 16 entry register file, sadly the accumulator was treated separately.

Yes, I tried working with something vaguely similar to its register file, having the "current data register" and "current address register", without the separate acc. Running in software, reducing both the number of instructions dispatched and the number of parameters decoded is important for performance, so using implied registers is a good thing, but that prior attempt got real clunky real fast. Acheron's notion of a single "prior" register I think balances the tradeoffs better, though it can lead to more implied addressing modes for the same basic instruction.

Quote:

Sliding register file was used earlier, in SPARC I believe but was in the end not as good an idea as first imagined. How are you overcoming the problems they faced?

Like Arlet said, it was very fixed. It only slid at 8-register increments, and had the first 8 registers as normal non-windowed static registers. I do better on the former, but completely discarded the latter to avoid special cases in register dereferencing.

The "global variables" page seeks to replace static registers, but hasn't been fleshed out yet. Spillover for large numbers of parameters will depend on how it's used regarding memory allocation, though I recently realized that the 2-byte instruction form with the lone 4-bit parameter can technically address r0-r127, which I will take advantage of in some way.

BigEd wrote:

(Agreed! I'm watching with interest but haven't had a chance to digest.)

The new github link has more introductory information in the docs. Hopefully that'll help digestion. Portions might still be a little underripe.

BigEd · Post by **BigEd** » Fri Oct 04, 2019 8:18 am

Here's a talk from VCF Midwest - I think this is our very own White Flame presenting:
AcheronVM: 16-bit code on the 6502, taken too far (48 mins, youtube)

Here's the video description:

Quote:

AcheronVM is a small, customizable 16-bit software CPU for the 6502. It has thrown out traditional models to pursue all 3 competing aspects of density, speed, and power solely from within the context of the 6502's tradeoffs. Notable features include a unique hybrid register model, try/catch/finally support, pointer-offset addressing modes, easy instruction set modifications, and a purely ca65 macro-based implementation. This talk spans its design, implementation, and use.

Here's the repository, announcing an imminent update to match the talk:
https://github.com/AcheronVM/acheronvm

GaBuZoMeu · Post by **GaBuZoMeu** » Fri Oct 04, 2019 1:08 pm

This VM is a brilliant piece of - hmm - artwork me think is the most-fitting word.

Thank you for sharing this.

(And TY for the link BigEd.)

Chromatix · Post by **Chromatix** » Fri Oct 04, 2019 4:07 pm

Certainly there is an elegance to the design. Just a shame the github repo is so out of date.

I watched the talk and immediately noticed that the registers don't actually need to be 16 bits each, they're just named in 16-bit increments and the "mgrow" parameter is similarly scaled. At the bytecode level, registers appear to be addressed with byte granularity. So there's nothing fundamentally preventing an extension to support 24-bit addresses, 32-bit integers, fixed-point values of arbitrary size and shape, and/or 48- or 128-bit floating point. Which could be useful…

White Flame · Post by **White Flame** » Fri Oct 04, 2019 6:52 pm

I still haven't pushed my repo, as is still very inconsistent between the docs and the code, and a lot of debugging needs to be done. Of course, I should simply do so and let people poke into it themselves, but I don't like uploading wrong stuff publicly.

But yeah, there's no limits to the possibilities of tweaking this model, and it's explicitly intended to be so. I've been exploring different ISAs, and having a build system with an easily editable ISA is something that I left in. Regarding register width, of course the primary constraint would be the size of zeropage allocated to the registers.

GaBuZoMeu · Post by **GaBuZoMeu** » Sat Oct 05, 2019 1:57 am

White Flame wrote:

I still haven't pushed my repo, as is still very inconsistent between the docs and the code, and a lot of debugging needs to be done. Of course, I should simply do so and let people poke into it themselves, but I don't like uploading wrong stuff publicly.

Perhaps you could fresh up the page a little and add a big note "work in progress".

Roughly 20 years ago I did some investigations to Sweet-16. In order to get the execution times I triggered a timer (6522) then call Sw16, do a single instruction, and return. The empty run (JSR Sw16 / RTN) took 101 cycles I noted. Others like ADD n took 108 cycles (the addition only, not counting the call and return). These times are much bigger than those you have mentioned in your presentation? I'm sure that my timer was clocked with systemclock so the values are clock cycles.

Do you have a cycle count for leaving Acheron and reenter it - in other words: what would be the overhead for "inline assembly" ?

Regards,
Arne

White Flame · Post by **White Flame** » Sat Oct 05, 2019 3:26 am

The cycle times that I posted for Forth and Sweet16 are purely for dispatching the next instruction, not including the instruction execution itself, and not including exiting/entering the environment. Basically, as you're running code in the language, that'd be the time spent between instruction implementations. And I guess it doesn't include the JMP to get back to the main dispatch loop either. With Forth, it also doesn't count DOCOL overhead for secondary words, so yeah actual round-trip cycle times would be much more.

Counting cycles from the source code:

28 cycles to switch from 6502->Acheron, including the 'jsr acheron'
19 cycles to switch from Acheron->6502 (plus the instruction dispatch, which is around 20 cycles depending on which dispatcher is used)

So I'd guess about 70 cycles round-trip. I play it safe in these transitions in terms of storing state, so I didn't think about it being ultra-fast. If you're mode switching, it should be that you're doing a fair amount of work in the other mode that it's worth it.

Certainly a faster mode switch is possible, but it'd have to be more "dangerous" in terms of the called code's handling of registers.

GaBuZoMeu · Post by **GaBuZoMeu** » Sat Oct 05, 2019 4:33 am

Ah OK. That explains the somewhat huge difference.

The housekeeping (virtual PC and flags) take their time.

The transition from Archeron to 6502 and back I was asking is the penalty one have to pay (or consider) when falling back to native code for speed reasons and being too lazy to add the appropriate word.

I remember that I was missing Boolean operations with Sweet-16. But the round-trip time was that huge that I wasn't confident with the results at all.

White Flame · Post by **White Flame** » Sat Oct 05, 2019 11:44 am

Yep, so you can see why it's advantageous to adjust the instruction set itself. ~20 cycles to dispatch to custom instructions, each use only takes a byte (plus params), and you can remove existing instructions if you're not using them. Then you can consider what operations are often used and what's seldomly used, and figure out your balance per project.

While the 7-bit 'with' dispatcher supports 128 instructions, you can dispatch on a full 256 as well, but you'd have to deal with the 'with' operation separately.

Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM

Re: Announce: Acheron VM