6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 10:49 pm

All times are UTC




Post new topic Reply to topic  [ 1 post ] 
Author Message
PostPosted: Mon Dec 21, 2020 1:51 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I am greatly enthused by projects such as White Flame's AcheronVM. This is exactly why I joined the 6502 forum. Because projects like this are being developed and I'd like to help. Unfortunately, White Flame has the extremely difficult task of working against an established competitor with a 35 year lead. Despite this, White Flame's work is vastly superior. It has 3x performance and vastly more flexible bytecode. For example, SWEET16 is adequate to skip through common data structures but has zero support for logical operations.

AcheronVM is intended to magnify the 16 bit address-space of a 6502 and typically does so using less 2KB software. After reading that it is possible to write a 6502 Huffman compressor in 700 bytes, I wondered if it is possible to make a decompressor and executable page caching mechanism for native code which is smaller and faster than AcheronVM. The answer is yes. However, the break-even point is extremely high. We also have to pass through several levels of absurdity to reach it. Therefore, AcheronVM remains the preferred choice. Regardless, my investigation is a complimentary technique which provides additional scope for decompression which may also be applied to other data types.

I explain the decompressor, the ABI, the memory model, cache contention and the packer. The primary intention is to implement a page caching mechanism which has less overhead than a byte dispatch virtual machine.

It appears that White Flame and I have arrived at similar conclusions independently. In particular, handling of bit packed fields may indicate a sub-optimal design. Irregular packing is the worst. Nybble packing is better but often requires three or four shift operations. It is preferable to eliminate bit shifts entirely. Rather than attempt a 700 byte Huffman decompressor, I wondered if it was possible to implement a shorter, faster algorithm. The answer is yes. The result may not be optimal due to nybble packing but it is lightweight and easiest to describe in the decompression phase. I assume something like this has been previously implemented. Regardless, this technique is a placeholder to demonstrate that it requires less than 700 bytes and less clock cycles than virtual machine dispatch.

Compressed fragments of data are held in two pieces. A substitution table with maximum size of 256 bytes is also required. A separate substitution table may be required for each data type. Excluding the substitution table and indexes to the packed data, the most optimistic decompression step converts 8 bytes of packed data into 16 bytes of unpacked data. The unpacking process may be part of a binary tree and therefore there may be multiple opportunities to apply decompression with a best case of 2:1 expansion. One of the pieces of data is the lower nybbles of byte pairs packed into one byte. The other piece of data is upper nybbles with RLE. Some spans may require a dummy to be encoded. Alternatively, RLE may terminate early. Therefore, in the most optimistic case, no RLE data is required. This would occur in the case that only 16 values occur in the compressed data. These values have no restriction due to the substitution table. The primary application for this scheme is intended to be decompression of 6502 executables which are known to be highly skewed toward common opcodes.

In the most optimistic case, a 6502 opcode may be packed in one nybble. A similar power distribution is expected with AcheronVM, strings, 8 bit PCM audio and similar. Data may be packed such that unpacking does not cross page boundaries. Optimal packing is a knapsack problem and therefore may be solved on an Apple II in Integer BASIC. The laziest to pack payloads is to sort by descending size and infill available gaps. This technique is unlikely to waste more than one page.

I like to write software in a stodgy, uninteresting style without tricksy elements. In particular, I like to minimize action-at-a-distance and therefore I typically write callee saves and validate immediately before use. This leads to some of the slowest but most reliable software. Unfortunately, when subroutines are typically prefixed with validation, this incurs particular overhead to decompress a subroutine in full, perform validation and then terminate early. It would be greatly appreciated if a subroutine could be decompressed with, for example, 64:192 byte split. This would allow small subroutines to be decompressed in full without incurring repeated overhead. However, it would also allow large subroutines to be speculatively decompressed. Decompressed subroutines only begin on page boundaries. This initially limits subroutine size to 256 bytes. However, it allows the source and destination of the decompressor to ignore page boundaries.

For native code, the first pass of decompression applies to a 64 byte payload. The decompressor fills the subsequent 129 bytes with a repeating 3 byte sequence of JSR abs. This provides a transparent mechanism to decompress the remainder of the page with a second pass. If the JSR abs is encountered, the decompressor determines the page to complete, performs the second pass of decompression, nudges the call address back by 3 bytes and returns. Well, I try not to write tricksy software. This arrangement also introduces the obscure constraint that forward branches which cross phases of decompression must be aligned on 3 byte boundaries.

For virtual machine, alignment constraints may not apply. In this case, the first pass of decompression sets the remainder of a page to a reserved value, such as zero. Any attempt to execute the reserved opcode incurs decompression before execution resumes at the same address.

I initially thought that the decompressor would be called manually. For example, every invocation of JMP and JSR would be a macro. Furthermore, I assumed that flags and one or more registers would be destroyed between subroutines. This is in addition to a subroutine limit of 256 bytes and a maximum of 256 subroutines. This is all optional because it is possible to increasingly generalize the model. In particular, decompression may be completely transparent to an application which does not use indirection. I begin with the small model, with macros and registers destroyed. Then I work outwards to something less masochistic.

In the trivial case, JSR is replaced with something akin to LDX imm // JSR abs where the target address is the decompressor. (JMP is similarly substituted.) The decompressor unpacks a page specified in Register X (partially or in full) then jumps to the beginning of the page. This destroys two (or more) registers between subroutines. It is possible to preserve registers with PHP // PHX // LDX imm // JSR abs and then taking further care in the decompressor. At this point, the compression system is fairly transparent with the exception that subroutines are restricted to 256 bytes which, unfortunately, includes 7 byte JSRs and minor constraints on some forward branches. It is possible to make register preservation optional. It is also possible to hide page number after JSR. This is significant because it reduces JSR to a 4 byte sequence. It is also possible to reduce JMP by performing JSR to a dummy routine. For very frequently invoked subroutines, the JSR/JMP overhead can be reduced further by maintaining trampolines outside of the executable page cache.

It should be obvious that an application begins with LDX #0 // JMP abs for the purpose of decompressing and executing the first page of the application. It should also be obvious that it is possible to have 2^16 subroutines (or more) running in 16 bit address-space. Obviously, this requires an external paging mechanism, such as retrieval of 4KB pages from disk or network. This leads to the curious situation where we may have a transparent calling mechanism but failure to load a page may be indicated with an error code in the accumulator which is not placed by the application. It just gets there by magic.

After getting this far, I had no idea how to implement the caching mechanism. I was concerned that my idea was infeasible or that I would lose much of the gains. I initially considered a dynamic caching mechanism. For example, LRU cache eviction or my stochastic extension therefore. Given atypical use of TXS opcode found in the wild, it is not possible to reference count. Anyhow, it is preferable to not incur overhead upon entry and exit of every subroutine. So, how it is possible to implement an executable page cache which only intercepts JSR and JMP (and minor constraints on branching)? Given these conditions, it may be preferable to perform static analysis on the call graph. There will always be pathological cases where the decompressor is frequently invoked. However, all of these cases should incur less overhead than a byte dispatch virtual machine. This is particularly true if decompression occurs in two or more phases with the aid of a tricksy continuation.

From the vantage of any leaf subroutine, we only require ancestor subroutines to be fully or partially decompressed. These can be numbered statically such that the bottom bits are unique. This also provides a mechanism to scale the executable page cache to available memory. For example, an application which requires a minimum of 16 pages also works with 32 pages. However, said application may or may not run faster with the larger cache. This static allocation system also alleviates the subroutine size limit. It is now possible for a contiguous allocation to span multiple pages. It is also possible for a contiguous allocation to be selectively decompressed using a continuation which does not cross a page boundary. Obviously, native subroutines cannot exceed the size of the executable page cache. Regardless, it is practical to have multiple subroutines which exceed 1KB and use the executable page cache concurrently. This scheme also handles self recursion and mutual recursion.

Unfortunately, for native code, it is probably not worthwhile to compress an application which is less than 32KB. Even then, it is a borderline case. It may be preferable to only compress a native application which exceeds 64KB. At this point, desperate measures are required and this typically incurs re-write, virtual machine or program banking. If performance is not critical, transparent compression may incur the least effort and the least degradation. There are smaller cases of merit. For example, it may be possible to supply more than 64KB program and data in a 16KB ROM. I believe this format was popular on Acorn computers. However, an application such as EhBASIC is only about 11KB and allocation of a executable page cache would be detrimental to its primary function as an interpreter.

It is a different matter for AcheronVM because the virtual machine plus decompressor may be smaller than the virtual machine alone. Therefore, savings may be immediate. It is also possible to use the same decompressor on data, bytecode and infrequently executed native code. AcheronVM bytecode is typically intermixed with native code and it is typical to link AcheronVM as a subroutine. It is therefore possible to selectively unpack some or all of AcheronVM as required and do so in a manner which takes no precedent over any other part of an application.

Finally, we get to my favorite part: the packer. The preferred embodiment of the packer is an assembler option. In addition to packing manually written assembly, it also packs output from suitable compilers. This allows applications written in high level languages to exceed the 64KB executable limit and run on stock, vintage hardware without bank switching. It incurs modest overhead which is expected to to be less than bytecode or bank switching. This technique is also compatible with the leading bytecode interpreters. Specifically, it is possible to intermix any proportion of native code and bytecode while vastly exceeding the 64KB executable limit.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: