6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 11:05 pm

All times are UTC




Post new topic Reply to topic  [ 171 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 12  Next
Author Message
PostPosted: Sun Sep 01, 2019 1:49 pm 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi qus,

I'm aware that I'm entering a bit late to this thread, and I may not be fully qualified to discuss on it, particularly because I have no practical experience with the 6502 architecture and instruction set. However, I was recently able to implement a LLVM compiler backend for a custom architecture, and maybe I am able to point out something that might be useful. Please forget this if you have already considered it or if it's an idea that doesn't apply.

If I understand it correctly, you are basing code generation on the physical existence of a Data Stack that you use to store temporary values and perform operations, in a Forth like fashion. So for example to perform an addition, you push the first operand to the stack, then the second operand, and then perform the add operation, which result is available on top of the stack (Please correct me if this is not the case)

For what I have learned from the LLVM compiler, I think that modern compilers use a different approach. Compilers will build a graph based on an Intermediate Representation of the source code in which each node is in "Static Single Assignment" form or SSA. The idea is that the compiler initially assumes an infinite number of virtual registers, where each variable or intermediate result is assigned exactly once to a virtual register, and every variable is always defined before it is used. So for example the expression
Code:
x = a+b*c
will be represented in SSA form like this (assuming all are global variables):
Code:

vr0 = (globaladdressof x)
vr1 = (globaladdressof a)
vr2 = load vr1
vr3 = (globaladdressof b)
vr4 = load vr3
vr5 = (globaladdressof c)
vr6 = load vr5
vr7 = vr4 * vr6
vr8 = vr2 + vr7
(void) = store vr8 -> vr0
To generate that in the proper order, a compile-time stack can be used to help with operator precedence, and in fact this output almost resembles a stack based machine, but the point is that the stack does not explicitly exist in the machine. Since there is no dependence of a physical stack and there's infinite number of virtual registers to play with, array indexing or struct member access is not much harder than simple arithmetic operations, as the former can be turned into the later. In fact everything except jumps and calls can be turned into sequences of arithmetic or logical operations. For example, the code you proposed:
Code:
objectArrayInstance[4].someField = 69
can be turned into the following SSA sequence:
Code:
vr0 = (globaladdressof objectArrayInstance) 
vr1 = (constantsizeof objectArrayInstance)
vr2 = 4
vr3 = vr1 * vr2
vr4 = vr0 + vr3
vr5 = (constantoffsetof somefield)
vr6 = vr4 + vr5
vr7 = 69
(void) = store vr7 -> vr6

The last SSA instruction, the store, does not create any new virtual register because it does not generate any new result, but of course both the destination address and the stored value are still available in vr6 and vr7. The SSA instruction sequences are preferably represented as a linked graph in the compiler, but some compilers such as the gcc I think that just keep them plainly linearly in arrays, which should definitely be easier to debug, although (according to the llvm compiler developers) this can make some optimisations harder and time consuming.

After the SSA sequence (or graph) is generated, it is optimised. This essentially searches for repeated patterns and opportunities for register reuse and elimination. In the case above, clearly, vr1 and vr2 registers are no longer used after the third instruction, so any of the registers used later can be replaced by them and so on. This step can also apply some target specific optimisations or insert/replace IR instructions by convenient 'pseudo' instructions that will be easily converted later into physical ones. Compilers are kind of 'black art' so I suppose anything that can improve final code generation is allowed anywhere.

Next, physical register allocation begins. This essentially means replacing the remaining virtual registers by existing physical registers, and creating spills for the ones that are not available. Since the 6502 does not have by any means the number of physical registers required by such kind of compiled code, my approach would be to reserve a sufficiently large number of memory slots in page zero and use that memory as 'physical registers'. Instead of a Forth-like data stack, I would have just maybe 8 or 16 slots, or even 32 slots in memory, that would be treated as actual physical registers during compiler code generation. The more 'physical registers' the better, so I if you have already allocated some fixed memory for the data stack, then the same memory can be used for as many registers as would fit in there (!)

The last step is converting the optimized internal instruction representation into actual 6502 instructions. This offers additional optimisation opportunities, that in this case are purely target specific (for the 6502 processor). For example, several intermediate instructions in a row may get folded into simpler native 6502 sequences, such as load/store pairs or constant stores, etc.

As said, the above is what I have learned from my recent work on a LLVM backend, and I'm unsure if any of this would be useful or applicable for a 6502 compiler. Anyway, I hope this was an interesting reading.

Joan


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 01, 2019 7:14 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
Thanks for informative post! Oh, I wish someone created LLVM backend for 6502... I wouldn't have to write my compiler :D Btw - how hard was it? I took a look at introduction to LLVM backend docs, but it was like "War And Peace" of manuals! Also - a few implementations of backends for 6502 seem to be dead, which says a lot. (See: https://github.com/c64scene-ar/llvm-6502 - I couldn't make it work, maybe you have more luck?)

Anyway - Wolin behaves in a way similar to what you've described. It uses infinite number of registers internally and that will allow easy creation of graphs and optimization you've mentioned. But due to how 6502 was designed - all good addressing modes operate only on Zero Page (firtst 256 bytes of memory - in theory - in reality more than half of these addresses are used by OS) I need to stick to first 256 bytes to make things easy. So these infinite registers are simulated using software stack located in ZP. Plus - since poor 6502 has only 3 hardware registers, some people claim ZP was really designed to be used as flexible 256 registers... Who knows? Judging by how nice it works out in Wolin, I think they might be right! (See also here: https://lists.llvm.org/pipermail/llvm-d ... 95439.html )

So - because I can use "only" about 72 words or 155 bytes of "registetrs" (on Commodore 64) I need to allocate/deallocate them constantly every time I parse ANTLR graph. I can't really imagine code complicated enough to fill this stack.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 01, 2019 11:15 pm 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi qus, I celebrate that you found my post informative. You are doing a great work, and the approach you took at making your compiler from scratch is very reasonable specially considering the peculiarities of the 6502. Also, a limitation of LLVM is that the existing front end only covers c and c++, so unless we want to enter into the even more unattainable world of front end development, we get stuck on these two. I developed a backend for an invented but complete architecture and I am posting my work in the sibling forum anycpy.org http://anycpu.org/forum/viewtopic.php?f=23&t=583. In the last page/post there's the link to the compiler source code.

As far as I know, there are two separate "attempts" at making a LLVM backend for the 6502. The one that you linked from c64scene-ar is a joke. It's nothing else than an exact copy of the SPARC backend, with all the "Sparc" texts replaced by "Mos6502" and the files renamed. I don't really get why somebody would want to do so and make it public without any further work. It really is nothing more than that, so it has very little to to with the 6502.

The other one that I am aware can be found here https://github.com/beholdnec/llvm-m6502/tree/m6502-on-release_60/lib/Target/M6502. This one looks like a more serious attempt, which has some remarkable work done, but for what I can see it remains largely unfinished. I think that the 6502 instruction set is particularly difficult to fit in a modern compiler because such compilers are only happy with risc architectures and lots of registers. This does not imply that it couldn't be done thought, however I believe that making it really good would be very difficult. The link that you also posted is quite informative on what it can be reasonably done.

I think that if I had to implement a LLVM 6502 backend, I would define a set of native registers -physically sitting on the zero page-, adopt a small but complete set of convenient virtual instructions -not coinciding with the 6502 native ones, but using those zero page registers as operands- and make the compiler generate code with that. In a very late compile pass, translate the virtual instructions into native 6502 sequences, or subroutine calls. I think that an approach like that would work, but the compiler would generate rather inefficient code compared (possibly) with the hard coded compilers of the earlier generations.


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 02, 2019 12:16 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Yes, a big difficulty of the 6502 is that you can't actually load and operate on an address-sized value as a single entity. This means you have to interleave load-operate-store sequences for both halves, whereas a compiler thinks naturally of loading, then operating on, then storing the entire value. Modern CPUs, and even many of the 6502's contemporaries, don't have this difficulty.

The workarounds of treating the zero-page area as a large bank of registers, and/or emitting code for a virtual machine rather than directly for the 6502 itself, do make the problem more tractable if less optimised. But I think qus is using a virtual machine with many registers as an IR, and then generating real 6502 code from that.

The 65816 does have some features which should make conventional code generation easier than the 6502. For example, it has stack-relative addressing, as well as index and stack registers large enough to actually use as pointers, or address realistically-sized stack frames, in their own right. The Direct Page Register even allows extending the zero-page-as-registers idea to a SPARC-like "register window" structure, should you be so inclined, as it points to an arbitrary 256-byte section of the first 64K bank.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 07, 2019 2:34 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
As an update - I'm overhauling pseudo-asm type system to support referencing/dereferencing of registers and variables (like & and * in C). While I could probably make it work without it, the pseudo asm / template syntax wout be inconsistent, as the same syntax would mean different things depending on context. I don't like that. So... it looks like this:

Code:
let SP(?dst)[ubyte*] = *?src[ubyte] -> """
    lda #<{src}
    sta {dst},x
    lda #>{src}
    sta {dst}+1,x
"""


meaning: set pointer to ubyte on stack dst to address of ubyte variable src.

Code:
let &SP(?dst)[ubyte*] = &SP(?src)[ubyte*] -> """
    lda (src,x)
    sta (dst,x)
"""


meaning: there's a pointer to ubyte on stack, please set the memory to which it points to contents of memory to which src pointer on stack points.

I guess that should fix all kinds of ambiguity, so back to derefrencing arrays, objects, arrays of objects, objects with arrays, arrays of objects with arrays, arrays of arrays of objects...


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 7:09 am 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
Update: ref/deref system is working pretty well, but it also causes generation of pretty twisted pseudo asm, like:

Code:
evalless SP(4)<__wolin_reg4>[bool] = &SP(2)<__wolin_reg5>[uword*], SP(0)<__wolin_reg6>[uword]


(imagine writing a code that checks if value REFERENCED by ZP register is less than value in ZP register - ugly and slow), so I decided it is high time to write optimizer that would get rid of such constructs. So my first step is looking for registers that get assigned values only once. All occurences of such register will be simply replaced by the value that was assigned.

So, a fragment of code that fills 1000 bytes of screen memory with something, namely "while(i<1000)":

Code:
alloc SP<__wolin_reg5>, #2 // LEFT for <
let SP(0)<__wolin_reg5>[uword*] = *__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword] // przez sprawdzacz typow - simple id from var
// switchType to:uword by type from pl.qus.wolin.i
// top type already set: __wolin_reg5: uword* = 0 (LEFT for <)
alloc SP<__wolin_reg6>, #2 // RIGHT for <
// switchType to:uword by parse literal constant
let SP(0)<__wolin_reg6>[uword] = #1000[uword] // atomic ex
// top type already set: __wolin_reg6: uword = 0 (RIGHT for <)
evalless SP(4)<__wolin_reg4>[bool] = &SP(2)<__wolin_reg5>[uword*], SP(0)<__wolin_reg6>[uword]
free SP<__wolin_reg6>, #2 // RIGHT for <
free SP<__wolin_reg5>, #2 // LEFT for <


reg5 is a candidate for removal obviously. It contains just a reference to "i" uword variable:

Code:
*__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]


Since here it is dereferenced by "&":

Code:
evalless SP(4)<__wolin_reg4>[bool] = &SP(2)<__wolin_reg5>[uword*], SP(0)<__wolin_reg6>[uword]


it's quite easy, we just drop the "*"

Code:
evalless SP(4)<__wolin_reg4>[bool] = __wolin_pl_qus_wolin_i<pl.qus.wolin.i>, SP(0)<__wolin_reg6>[uword]


and that just requires cmping memory location with ZP location! alloc/free for reg5 get dropped, stack vectors will get shifted as required by deducing register size. Tadam!


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 1:06 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
And here's the simple magic at work. Original pseudo-asm:

Code:
setup HEADER
setup SPF = 251[ubyte], 40959[uword] // call stack pointer at 251 = 40959
setup SP = 143[ubyte] // register stack top = 142
setup HEAP = 176[ubyte]
// inicjalizacja zmiennej pl.qus.wolin.i
alloc SP<__wolin_reg0>, #1 // for var pl.qus.wolin.i init expression
// switchType to:ubyte by parse literal constant
let SP(0)<__wolin_reg0>[ubyte] = #0[ubyte] // atomic ex
// top type already set: __wolin_reg0: ubyte = 0 (for var pl.qus.wolin.i init expression)
let __wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword] = SP(0)<__wolin_reg0>[ubyte] // podstawic wynik inicjalizacji expression do zmiennej pl.qus.wolin.i
free SP<__wolin_reg0>, #1 // for var pl.qus.wolin.i init expression
// inicjalizacja zmiennej pl.qus.wolin.znak
alloc SP<__wolin_reg1>, #1 // for var pl.qus.wolin.znak init expression
// switchType to:ubyte by parse literal constant
let SP(0)<__wolin_reg1>[ubyte] = #0[ubyte] // atomic ex
// top type already set: __wolin_reg1: ubyte = 0 (for var pl.qus.wolin.znak init expression)
let __wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte] = SP(0)<__wolin_reg1>[ubyte] // podstawic wynik inicjalizacji expression do zmiennej pl.qus.wolin.znak
free SP<__wolin_reg1>, #1 // for var pl.qus.wolin.znak init expression
//  main function entry
//  otwarcie stosu na wywolanie pl.qus.wolin.main
alloc SPF, #0
//  tu podajemy argumenty dla pl.qus.wolin.main
//  po argumentach dla pl.qus.wolin.main
call __wolin_pl_qus_wolin_main[adr]
free SPF <unit>, #2 // free return value of pl.qus.wolin.main from call stack
ret
// switchType to:unit* by function declaration

// ****************************************
// funkcja: fun pl.qus.wolin.main():unit
// ****************************************
label __wolin_pl_qus_wolin_main
alloc SP<__wolin_reg4>, #1 // for while condition
label __wolin_lab_loopStart_1
alloc SP<__wolin_reg5>, #2 // LEFT for <
let SP(0)<__wolin_reg5>[uword*] = *__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword] // przez sprawdzacz typow - simple id from var
// switchType to:uword by type from pl.qus.wolin.i
// top type already set: __wolin_reg5: uword* = 0 (LEFT for <)
alloc SP<__wolin_reg6>, #2 // RIGHT for <
// switchType to:uword by parse literal constant
let SP(0)<__wolin_reg6>[uword] = #1000[uword] // atomic ex
// top type already set: __wolin_reg6: uword = 0 (RIGHT for <)
evalless SP(4)<__wolin_reg4>[bool] = &SP(2)<__wolin_reg5>[uword*], SP(0)<__wolin_reg6>[uword]
free SP<__wolin_reg6>, #2 // RIGHT for <
free SP<__wolin_reg5>, #2 // LEFT for <
// top type already set: __wolin_reg4: bool = 0 (for while condition)
bne SP(0)<__wolin_reg4>[bool] = #1[bool], __wolin_lab_loopEnd_1<label_po_if>[adr]
//
// == ASSIGNMENT LEFT =======================================
alloc SP<__wolin_reg9>, #2 // ASSIGNMENT target
alloc SP<__wolin_reg10>, #2 // arr_deref
//  LEWA strona array access, czyli co to za zmienna
let SP(0)<__wolin_reg10>[ubyte*] = 1024[ubyte*] // simple id - fixed array var
// switchType to:ubyte* by type from pl.qus.wolin.ekran
//  PRAWA strona array access, czyli indeks w nawiasach
alloc SP<__wolin_reg11>, #2 // For calculating index
let SP(0)<__wolin_reg11>[uword*] = *__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword] // przez sprawdzacz typow - operator ++
add __wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword] = __wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword], #1[uword] // simple id
// switchType to:uword by ++ operator
// FORCE TOP: __wolin_reg11: uword* = 0 (For calculating index) -> uword*
add SP(2)<__wolin_reg10>[ubyte*] = SP(2)<__wolin_reg10>[ubyte*], &SP(0)<__wolin_reg11>[uword*] // long index, single byte per element array
free SP<__wolin_reg11>, #2 // For calculating index
// **ARRAY Changing current type to prevReg type __wolin_reg10: ubyte* = 0 (arr_deref)
//  after index
// dereference value at topRegister
//  kod obsługi tablicy
//  non-fast array, changing top reg to ptr
let SP(2)<__wolin_reg9>[ubyte*] = SP(0)<__wolin_reg10>[ubyte*] // przez sprawdzacz typow - non-fast array
free SP<__wolin_reg10>, #2 // arr_deref
// top type already set: __wolin_reg9: ubyte* = 0 (ASSIGNMENT target)
// == ASSIGNMENT RIGHT =======================================
alloc SP<__wolin_reg12>, #2 // ASSIGNMENT value
let SP(0)<__wolin_reg12>[ubyte*] = *__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte] // przez sprawdzacz typow - operator ++
add __wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte] = __wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte], #1[ubyte] // simple id
// switchType to:ubyte by ++ operator
let &SP(2)<__wolin_reg9>[ubyte*] = &SP(0)<__wolin_reg12>[ubyte*] // przez sprawdzacz typow - process assignment
free SP<__wolin_reg12>, #2 // ASSIGNMENT value, type = ubyte
free SP<__wolin_reg9>, #2 // ASSIGNMENT target
// == ASSIGNMENT END =======================================
//
// switchType to:unit* by assignment
// top type already set: __wolin_reg8: unit* = 0 (for expression)
goto __wolin_lab_loopStart_1[adr]
label __wolin_lab_loopEnd_1
free SP<__wolin_reg4>, #1 // for while condition
// top type already set: __wolin_reg3: unit* = 0 (for expression)
// caller ma obowiązek zwolnoć wartość zwrotną z SPF!!!
// return from function body
ret



// ****************************************
// LAMBDAS
// ****************************************


// ****************************************
// STATIC SPACE
// ****************************************
label __wolin_indirect_jsr
goto 65535[adr]
label __wolin_pl_qus_wolin_i
alloc 0[uword]  // pl.qus.wolin.i
label __wolin_pl_qus_wolin_znak
alloc 0[ubyte]  // pl.qus.wolin.znak


Optimized pseudo - asm (badly formatted, as it is just dump from ANTLR):

Code:
setupSPF=251[ubyte],40959[uword]
setupSP=143[ubyte]
setupHEAP=176[ubyte]
let__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=#0[ubyte]
let__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=#0[ubyte]
allocSPF,#0
call__wolin_pl_qus_wolin_main[adr]
freeSPF<unit>,#2
ret

label__wolin_pl_qus_wolin_main
allocSP<__wolin_reg4>,#1
label__wolin_lab_loopStart_1
evallessSP(4)<__wolin_reg4>[bool]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1000[uword]
bneSP(0)<__wolin_reg4>[bool]=#1[bool],__wolin_lab_loopEnd_1<label_po_if>[adr]
allocSP<__wolin_reg10>,#2
letSP(0)<__wolin_reg10>[ubyte*]=1024[ubyte*]
add__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1[uword]
addSP(2)<__wolin_reg10>[ubyte*]=SP(2)<__wolin_reg10>[ubyte*],__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]
freeSP<__wolin_reg10>,#2
add__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte],#1[ubyte]
goto__wolin_lab_loopStart_1[adr]
label__wolin_lab_loopEnd_1
freeSP<__wolin_reg4>,#1
ret

label__wolin_indirect_jsr
goto65535[adr]

label__wolin_pl_qus_wolin_i
alloc0[uword]

label__wolin_pl_qus_wolin_znak
alloc0[ubyte]


Again, this is code to put increasing values into 1000 addresses starting at 1024:

Code:
package pl.qus.wolin

var border: ubyte^53280

var ekran: ubyte[]^1024

var i: uword = 0

var znak: ubyte = 0

fun main() {
    while(i<1000) {
        ekran[i++] = znak++
    }
}


Look, mom! Just two ZP registers allocated! (Note though that shifting of stack locations due to removed registers is not implemented yet)


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 3:49 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
And the same optimized code after bug fixes and with shifted stack registers:

setupHEADER
setupSPF=251[ubyte],40959[uword]
setupSP=143[ubyte]
setupHEAP=176[ubyte]
let__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=#0[ubyte]
let__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=#0[ubyte]
allocSPF,#0
call__wolin_pl_qus_wolin_main[adr]
freeSPF<unit>,#2
ret

label__wolin_pl_qus_wolin_main
allocSP<__wolin_reg4>,#1
label__wolin_lab_loopStart_1
evallessSP(0)<__wolin_reg4>[bool]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1000[uword]
bneSP(0)<__wolin_reg4>[bool]=#1[bool],__wolin_lab_loopEnd_1<label_po_if>[adr]
allocSP<__wolin_reg9>,#2
allocSP<__wolin_reg10>,#2
letSP(0)<__wolin_reg10>[ubyte*]=1024[ubyte*]
add__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1[uword]
addSP(0)<__wolin_reg10>[ubyte*]=SP(0)<__wolin_reg10>[ubyte*],__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]
letSP(2)<__wolin_reg9>[ubyte*]=SP(0)<__wolin_reg10>[ubyte*]
freeSP<__wolin_reg10>,#2
add__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte],#1[ubyte]
let&SP(0)<__wolin_reg9>[ubyte*]=__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]
freeSP<__wolin_reg9>,#2
goto__wolin_lab_loopStart_1[adr]
label__wolin_lab_loopEnd_1
freeSP<__wolin_reg4>,#1
ret
label__wolin_indirect_jsr
goto65535[adr]
label__wolin_pl_qus_wolin_i
alloc0[uword]
label__wolin_pl_qus_wolin_znak
alloc0[ubyte]


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 4:39 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
And finaly 6502 code of optimized Wolin code:

Code:
; setupHEADER


;**********************************************
;*
;* BASIC header
;*
;* compile with:
;* cl65.exe -o assembler.prg -t c64 -C c64-asm.cfg -g -Ln labels.txt assembler.s
;*
;**********************************************
            .org 2049
            .export LOADADDR = *
Bas10:      .word BasEnd
            .word 10
            .byte 158 ; sys
            .byte " 2064"
            .byte 0
BasEnd:     .word 0
            .word 0
            ;


; setupSPF=251[ubyte],40959[uword]


; prepare function stack
__wolin_spf := 251 ; function stack ptr
__wolin_spf_hi := 251+1 ; function stack ptr

__wolin_spf_top := 40959 ; function stack top
__wolin_spf_top_hi := 40959+1 ; function stack top
    lda #<__wolin_spf_top ; set function stack top
    sta __wolin_spf
    lda #>__wolin_spf_top
    sta __wolin_spf+1

; setupSP=143[ubyte]


; prepare program stack
__wolin_sp_top := 143 ; program stack top
__wolin_sp_top_hi := 143+1 ; program stack top
    ldx #__wolin_sp_top ; set program stack top

; setupHEAP=176[ubyte]


__wolin_this_ptr := 176
__wolin_this_ptr_hi := 176+1


; let__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=#0[ubyte]


    lda #<0
    sta __wolin_pl_qus_wolin_i
    lda #>0
    sta __wolin_pl_qus_wolin_i+1

; let__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=#0[ubyte]


    lda 0
    sta __wolin_pl_qus_wolin_znak

; allocSPF,#0

 

; call__wolin_pl_qus_wolin_main[adr]

    jsr __wolin_pl_qus_wolin_main

; freeSPF<unit>,#2


    clc
    lda __wolin_spf
    adc #2
    sta __wolin_spf
    bcc :+
    inc __wolin_spf+1
:

; ret

    rts

; label__wolin_pl_qus_wolin_main

__wolin_pl_qus_wolin_main:

; allocSP<__wolin_reg4>,#1

    dex

; label__wolin_lab_loopStart_1

__wolin_lab_loopStart_1:

; evallessSP(0)<__wolin_reg4>[bool]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1000[uword]


    lda #1 ; mniejsze
    sta 0,x
    lda __wolin_pl_qus_wolin_i+1
    cmp #>1000
    bcc :+ ; mniejsze
    lda __wolin_pl_qus_wolin_i
    cmp #<1000
    bcc :+ ; mniejsze
    lda #0 ; jednak wieksze
    sta 0,x
:


; bneSP(0)<__wolin_reg4>[bool]=#1[bool],__wolin_lab_loopEnd_1<label_po_if>[adr]


    lda 0,x
    beq __wolin_lab_loopEnd_1

; allocSP<__wolin_reg9>,#2


    dex
    dex

; allocSP<__wolin_reg10>,#2


    dex
    dex

; letSP(0)<__wolin_reg10>[ubyte*]=1024[ubyte*]


    lda #<1024
    sta 0,x
    lda #>1024
    sta 0+1,x

; add__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]=__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword],#1[uword]


    clc
    lda __wolin_pl_qus_wolin_i
    adc #<1
    sta __wolin_pl_qus_wolin_i
    lda __wolin_pl_qus_wolin_i+1
    adc #>1
    sta __wolin_pl_qus_wolin_i+1


; addSP(0)<__wolin_reg10>[ubyte*]=SP(0)<__wolin_reg10>[ubyte*],__wolin_pl_qus_wolin_i<pl.qus.wolin.i>[uword]


    clc
    lda 0,x
    adc __wolin_pl_qus_wolin_i
    sta 0,x
    lda 0+1,x
    adc __wolin_pl_qus_wolin_i+1
    sta 0+1,x

; letSP(2)<__wolin_reg9>[ubyte*]=SP(0)<__wolin_reg10>[ubyte*]


    lda 0,x
    sta 2,x
    lda 0+1,x
    sta 2+1,x

; freeSP<__wolin_reg10>,#2


    inx
    inx

; add__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]=__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte],#1[ubyte]


    inc __wolin_pl_qus_wolin_znak

; let&SP(0)<__wolin_reg9>[ubyte*]=__wolin_pl_qus_wolin_znak<pl.qus.wolin.znak>[ubyte]


    lda __wolin_pl_qus_wolin_znak
    sta (0,x)

; freeSP<__wolin_reg9>,#2


    inx
    inx

; goto__wolin_lab_loopStart_1[adr]

    jmp __wolin_lab_loopStart_1

; label__wolin_lab_loopEnd_1

__wolin_lab_loopEnd_1:

; freeSP<__wolin_reg4>,#1

    inx

; ret

    rts

; label__wolin_indirect_jsr

__wolin_indirect_jsr:

; goto65535[adr]

    jmp 65535

; label__wolin_pl_qus_wolin_i

__wolin_pl_qus_wolin_i:

; alloc0[uword]

    .word 0

; label__wolin_pl_qus_wolin_znak

__wolin_pl_qus_wolin_znak:

; alloc0[ubyte]

    .byte 0



I still need to consolidate subsequent allocs/frees


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 5:22 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Looks like a significant improvement...


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 22, 2019 1:25 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
Well - I was planning to do optimization as last part of the project, believing it will make things more complicated. So it's quite surptising it actually simplifies things...


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 22, 2019 7:47 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Optimisation can make things complicated *if* you implement instruction scheduling (which is irrelevant for the 6502, because it's neither pipelined nor superscalar) or loop unrolling (which you haven't - yet). Most relevant optimisations do actually simplify things, by removing unnecessary artefacts of the code generation process and leaving behind code that mostly makes real progress on the computation.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 20, 2019 5:48 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
Not much time lately, and a lot of things are broken due to reference/dereference logic (plus I make a lot of silly mistakes in templates), but this evening I was able to fix this:

Code:
package pl.qus.wolin

var border: ubyte^53280

fun suma(a: ubyte, b: ubyte): ubyte {
    return a+b
}

fun main() {
    border = suma(2,2)
}


Which generates a working proggy:

Code:
; setupHEADER


;**********************************************
;*
;* BASIC header
;*
;* compile with:
;* cl65.exe -o assembler.prg -t c64 -C c64-asm.cfg -g -Ln labels.txt assembler.s
;*
;**********************************************
            .org 2049
            .export LOADADDR = *
Bas10:      .word BasEnd
            .word 10
            .byte 158 ; sys
            .byte " 2064"
            .byte 0
BasEnd:     .word 0
            .word 0
            ;


; setupSPF=251[ubyte],40959[uword]


; prepare function stack
__wolin_spf := 251 ; function stack ptr
__wolin_spf_hi := 251+1 ; function stack ptr

__wolin_spf_top := 40959 ; function stack top
__wolin_spf_top_hi := 40959+1 ; function stack top
    lda #<__wolin_spf_top ; set function stack top
    sta __wolin_spf
    lda #>__wolin_spf_top
    sta __wolin_spf+1

; setupSP=143[ubyte]


; prepare program stack
__wolin_sp_top := 143 ; program stack top
__wolin_sp_top_hi := 143+1 ; program stack top
    ldx #__wolin_sp_top ; set program stack top

; setupHEAP=176[ubyte]


__wolin_this_ptr := 176
__wolin_this_ptr_hi := 176+1

; allocSPF,#0


; call__wolin_pl_qus_wolin_main[adr]

    jsr __wolin_pl_qus_wolin_main

; freeSPF<unit>,#0

 

; ret

    rts

; label__wolin_pl_qus_wolin_suma

__wolin_pl_qus_wolin_suma:

; allocSP<__wolin_reg3>,#2


    dex
    dex

; letSP(0)<__wolin_reg3>[ubyte*]=*SPF(1)<pl.qus.wolin.suma.a>[ubyte]


    clc
    lda __wolin_spf
    adc #1
    sta 0,x
    lda __wolin_spf+1
    adc #0
    sta 0+1,x

; add&SP(0)<__wolin_reg3>[ubyte*]=&SP(0)<__wolin_reg3>[ubyte*],SPF(0)<pl.qus.wolin.suma.b>[ubyte]


    clc
    lda (0,x)
    ldy #0
    adc (__wolin_spf), y
    sta (0,x)


; letSPF(2)<returnValue>[ubyte]=&SP(0)<__wolin_reg3>[ubyte*]


    lda (0,x)
    ldy #2
    sta (__wolin_spf),y


; freeSP<__wolin_reg3>,#2


    inx
    inx

; freeSPF,#2


    clc
    lda __wolin_spf
    adc #2
    sta __wolin_spf
    bcc :+
    inc __wolin_spf+1
:

; ret

    rts

; label__wolin_pl_qus_wolin_main

__wolin_pl_qus_wolin_main:

; allocSP<__wolin_reg7>,#2


    dex
    dex

; letSP(0)<__wolin_reg7>[ubyte*]=53280[ubyte*]


    lda #<53280
    sta 0,x
    lda #>53280
    sta 0+1,x

; allocSP<__wolin_reg8>,#1

    dex

; allocSPF,#3


    clc
    lda __wolin_spf
    sbc #3
    sta __wolin_spf
    bcs :+
    dec __wolin_spf+1
:

; letSPF(1)[ubyte]=#2[ubyte]


    ldy #1
    lda #2
    sta (__wolin_spf),y

; letSPF(0)[ubyte]=#2[ubyte]


    ldy #0
    lda #2
    sta (__wolin_spf),y

; call__wolin_pl_qus_wolin_suma[adr]

    jsr __wolin_pl_qus_wolin_suma

; letSP(0)<__wolin_reg8>[ubyte]=SPF(0)<returnValue>[ubyte]


    ldy #0
    lda (__wolin_spf),y
    sta 0,x


; freeSPF<ubyte>,#1


    clc
    lda __wolin_spf
    adc #1
    sta __wolin_spf
    bcc :+
    inc __wolin_spf+1
:

; let&SP(1)<__wolin_reg7>[ubyte*]=SP(0)<__wolin_reg8>[ubyte]


    lda 0,x
    sta (1,x)

; freeSP<__wolin_reg8>,#1

    inx

; freeSP<__wolin_reg7>,#2


    inx
    inx

; ret

    rts

; label__wolin_indirect_jsr

__wolin_indirect_jsr:

; goto65535[adr]

    jmp 65535

; alloc0[ubyte]

    .byte 0


So I can go back to arrays/object instances etc.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 23, 2019 12:41 pm 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
New pieces of code = new surprises with optimizer. Although it is obvious as soon as you think about it, I forgot I have to shift stack vectors of substituted registers. I.e.

Let's say some reg0 is assigned some function stack stack variable (a local variable, function parameter or return value):

reg0 = 3rd variable on function stack from top

and then I call another function, first allocating new space on function stack :D

Obviously "3rd variable on function stack" isn't 3rd anymore, so if I use it in place of reg0 I will get crap...

Back to drawing board ;)


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 07, 2019 9:59 am 
Offline

Joined: Sat Apr 20, 2019 5:31 pm
Posts: 104
Small update - I was stuck for a moment due to how I did the optimizations, namely: manipulating ANTLR tree in memory. It is tricky, as these are meant to be read-only, so optimizing code that way is a hack, really. Anyway - I was able to fix the bug described above and now call stack args get properly shifted on alloc/free. But for next optimization phases I will probably work by changin pseudosam in plain text instead of hacking ANTLR tree do to what it was not designed to do.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 171 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 12  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: