6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Jul 02, 2024 12:46 am

All times are UTC




Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Fri Jan 26, 2018 6:27 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1938
Location: Sacramento, CA, USA
Martin_H wrote:
The second is a 16 bit stack pointer in page zero that grows downward from ram top: https://github.com/Martin-H1/Lisp65/blob/master/bigstack.asm

You might end up encountering some difficulties assembling those stz (_sp),y instructions :wink:

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 26, 2018 6:28 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8461
Location: Southern California
Druzyek wrote:
Quote:
You can save X on the stack too, but then take that into account when formulating the number to add the index to.

Would you do that by having the macro assign a temporary value that is added to the hard coded base address and replace all push and pull arguments with macros that increase or decrease that value? It seems like it would be tough to keep adding and subtracting to the base yourself, especially if you go back and add a push or pull and have to redo the following ones.

I haven't thought about how to have a macro do that. I guess I'd like to get more code done with it to identify a pattern so the macro truly clarifies things. However, the following is from the parameter-passing chapter of the 6502 stacks treatise, about 30% of the way down the long page:

      If you want to access that output data that's on the stack before pulling it off, perhaps even in arbritrary order, remember that X is still valid from the subroutine using it for indexing into the stack, as shown below. If you've done TSX again, adjust your indexed numbers to account for the fact that the return address is no longer there, and in the case below, that there was a PLA after the TSX such that now after the subroutine return, the outputs are at 101,X to 104,X. If you have not done TSX again, the old indexed numbers will still be valid.

      The example above calls the UM_STAR subroutine below which takes Bruce Clark's improvement on my commented bug fix on the UM* multiplication in fig-Forth, and it modifies it for I/O on the hardware stack. The structures are per my structure macro article and source code linked there. They assemble exactly the same thing you would by hand, but make the conditions and branches in the source code clearer.

      Code:
      UM_STAR: LDA #0                 ; Unsigned, mixed-precision (16-bit by 16-bit input, 32-bit output)
               PHA                    ; multiply.  Add a variable byte to the stack, initializing it as 0.
               TSX                    ; Now 101,X holds that new variable, 102,X and 103,X hold the return
               LSR $107,X             ; address, and 104,X to 107,X holds the inputs and later the outputs.
               ROR $106,X
               FOR_Y  16, DOWN_TO, 0  ; Loop 16x.  The DEY, BNE in NEXT_Y below will drop through on 0.
                   IF_CARRY_SET
                       CLC
                       PHA            ; Note that the PHA (and PLA below) doesn't affect the indexing.
                          LDA $101,X
                          ADC $104,X
                          STA $101,X
                       PLA
                       ADC $105,X
                   END_IF
                   ROR
                   ROR $101,X
                   ROR $107,X
                   ROR $106,X
               NEXT_Y
               STA $105,X
               PLA                    ; Retrieve the variable byte we added at the top, cleaning up the stack.
               STA $104,X             ; Again note that the PLA changed S but not X, so the 104 is still 104.
               RTS
       ;------------------

      (Note: In many situations, it will be advantageous to give names to the items on the stack, using EQUates, so it's more clear what 101,X is, what 102,X is, 103,X, and so on. (But don't forget to still put the ,X after them.) More on that later.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 26, 2018 8:12 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
I haven't thought about how to have a macro do that.
I'm not sure if the ca65 macros, which are really great, could do it either. I could not get it to count how many arguments I needed in an argument and then push that many arguments, regardless of where the function was called. Instead, I pushed the count of arguments to the stack so my loop takes a few extra cycles. I do think it would be very doable with a custom macro system. I wonder if I could pass a source through a macro program without assembling then change some things and run it back through. That would be extremely powerful without having to write a new macro system.

Quote:
Note that the PHA (and PLA below) doesn't affect the indexing
This is really slick. I think what I'm afraid of with this kind of stack stuff is needing nested X and Y loops that also need to use X to index into the stack and the value of X for something else. It seems like you would have a lot of switching overhead.

Would you tend to use X indexing for everything until you absolutely need the extra speed? What do you like about doing it this way over zero page with no X index?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 26, 2018 8:59 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8461
Location: Southern California
Something I liked about the 2500AD assembler I used at work from 1986 to about 1990 was that it allowed you to say in a macro definition in essence "if there's a fourth macro parameter, do this with it;" or "If there's a fifth one, do the following with it." The C32 assembler I use now requires you to use the same number of parameters in the macro invocation that the macro was written to use, although I suppose you could pad unused ones with 0's or whatever value would guarantee you're not using it (although that doesn't help readability).

I have never used a macro that I myself did not write though. They're very quick to write, assuming you know what you want. They can also have any amount of conditional assembly you could want. As much as I've used and written about them, I think there's still a lot more that macros could be used to do that I haven't thought of yet, or at least haven't thought of how to do, like parsing a string. Of course, it would depend on the particular assembler's macro capabilities, how powerful its macro language is. (Most assemblers are pretty similar in this respect though.) This is all part of extending the power and capabilities of assembly language. I don't think we're anywhere near having maxed it out yet. We just need to get more creative and resourceful and share our developments to build upon.

Much of what's been talked about in the preceding posts was about passing parameters on the hardware stack, C-style. Sometimes that will be appropriate; but it does come with penalties, as I describe in the 6502 stacks treatise. Virtually all of my own parameter-passing by stacks goes on in the ZP data stack, using X as the stack pointer. This is in Forth though, so I very seldom need X for anything else. When I do, I save and restore it. As they say, you can get 95% of max performance with only 5% of your code in assembly, provided it's the right 5%, the critical 5%. In Forth it's very easy to dip into assembly when you need the performance or the tightest control of the machine. In those cases, I have to remember to save and restore X if I use it for something besides the data-stack pointer.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 27, 2018 1:56 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
The C32 assembler I use now requires you to use the same number of parameters in the macro invocation that the macro was written to use, although I suppose you could pad unused ones with 0's or whatever value would guarantee you're not using it (although that doesn't help readability).
I swear no one is paying me to shill for the ca65 assembler, I am just really impressed. You can supply less than the number of arguments the macro was written to use and compile conditionally based on whether an argument is blank or how many arguemtns it receives: http://www.cc65.org/doc/ca65-12.html

Quote:
This is all part of extending the power and capabilities of assembly language. I don't think we're anywhere near having maxed it out yet. We just need to get more creative and resourceful and share our developments to build upon.
Could be. Maybe I am premature in saying you can't do what I want to with macros. There are two problems I have not been able to see how macros could solve without being designed specifically to do it:

Code:
;Assuming functions use zero page addresses $00-$07
main:
;This macro should jump to func1 without pushing anything
macro_jsr func1

func1:
;This macro should assign a-d to addresses $00-$03
macro_var a, b, c, d
;Code using a, b, c, and d
...
IF something
;This macro should jump to func2 without pushing anything
macro_jsr func2
ENDIF
rts

func2:
;This macro should assign a-d to addresses $04-$07 since $00-03$ are taken
macro_var a, b, c, d
;Code using a, b, c, and d
...
;This macro should push $00-$03 since the first call to it is already using them
;Note that the first call to func1 did not push anything!
macro_jsr func1
rts
I think to do this you might need some higher level abstraction but maybe it can be done with macros.

Another problem is where to put the memory to reduce pushes. Imagine if you assigned $00-$07 like this:
func1: $00-$03
func2: $04-$07
func3: ??? (4 bytes)

If the calling order is func1->func2->func3, it doesn't matter since it will have to push anyway.
With this order: func2->func1 and func2->func3 , func3 should use $00-$03.
With this order: func1->func2 and func1->func3 , func3 should use $03-$07.

Another useful feature would be basic optimizing like this:
Code:
main:
ldx #25
lda #VALUE
macro_jsr func
ldx #35
...

func:
clc
adc #4
IF A, EQUALS, 9
ina
rts
ELSE
dex
stx PERIPHERAL1
lda #0
ENDIF
If #VALUE is 5 then it should automatically reduce to this, since it is both smaller and faster than what it replaces:
Code:
lda #10
ldx #35
On the other hand if #VALUE is 7 (and it's your only call to func) it could become this:
Code:
ldx #24
stx PERIPHERAL1
lda #0
ldx #35

The trick is, I think, having the assembler spot that #VALUE is a constant that always leads to the same result and computing it for you. The strength is you only have to write the function once and it still works when A is loaded with something not constant, in which case the function should stay intact.

I can deal without most of what C has to offer but it would be nice to have just enough abstraction to do those kinds of optimizations. I think if you 1.) treated normal labels as function names, 2.) only used temporary labels within functions, and 3.) only jumped to functions using their names, rather than hard coded addresses, you could do quite a lot. I think you would also never have to compromise performance and you would only increase program size if you chose speed over size. I suspect you would need a lot more than macros though.

Quote:
This is in Forth though, so I very seldom need X for anything else.
Interesting. Out of curiosity, do you have an idea of the speed trade off between doing the same job in assembly?


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 27, 2018 2:16 am 
Offline
User avatar

Joined: Tue Oct 25, 2016 8:56 pm
Posts: 360
Druzyek wrote:
I swear no one is paying me to shill for the ca65 assembler, I am just really impressed. You can supply less than the number of arguments the macro was written to use and compile conditionally based on whether an argument is blank or how many arguemtns it receives: http://www.cc65.org/doc/ca65-12.html


You can also do it in the WDC Official Assembler; a trivial example using WDC syntax, it extends my CopyByte macro so that you can make two copies of the same byte by passing a second destination argument.
Code:
ARGCHK OFF
COPYB MACRO mSrc, mDest, mDest2
LDA mSrc
STA mDest
IFMA 2 ; does mDest2 exist?
STA mDest2
ENDIF
ENDM

_________________
Want to design a PCB for your project? I strongly recommend KiCad. Its free, its multiplatform, and its easy to learn!
Also, I maintain KiCad libraries of Retro Computing and Arduino components you might find useful.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 27, 2018 11:37 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8461
Location: Southern California
Druzyek wrote:
Quote:
The C32 assembler I use now requires you to use the same number of parameters in the macro invocation that the macro was written to use, although I suppose you could pad unused ones with 0's or whatever value would guarantee you're not using it (although that doesn't help readability).
I swear no one is paying me to shill for the ca65 assembler, I am just really impressed. You can supply less than the number of arguments the macro was written to use and compile conditionally based on whether an argument is blank or how many arguemtns it receives: http://www.cc65.org/doc/ca65-12.html

Quote:
This is all part of extending the power and capabilities of assembly language. I don't think we're anywhere near having maxed it out yet. We just need to get more creative and resourceful and share our developments to build upon.
Could be. Maybe I am premature in saying you can't do what I want to with macros. There are two problems I have not been able to see how macros could solve without being designed specifically to do it:

Code:
;Assuming functions use zero page addresses $00-$07
main:
;This macro should jump to func1 without pushing anything
macro_jsr func1

func1:
;This macro should assign a-d to addresses $00-$03
macro_var a, b, c, d
;Code using a, b, c, and d
...
IF something
;This macro should jump to func2 without pushing anything
macro_jsr func2
ENDIF
rts

func2:
;This macro should assign a-d to addresses $04-$07 since $00-03$ are taken
macro_var a, b, c, d
;Code using a, b, c, and d
...
;This macro should push $00-$03 since the first call to it is already using them
;Note that the first call to func1 did not push anything!
macro_jsr func1
rts
I think to do this you might need some higher level abstraction but maybe it can be done with macros.

Another problem is where to put the memory to reduce pushes. Imagine if you assigned $00-$07 like this:
func1: $00-$03
func2: $04-$07
func3: ??? (4 bytes)

If the calling order is func1->func2->func3, it doesn't matter since it will have to push anyway.
With this order: func2->func1 and func2->func3 , func3 should use $00-$03.
With this order: func1->func2 and func1->func3 , func3 should use $03-$07.


Assuming I understand correctly what you want to do (which may not be a safe assumption), you can get around these problems by using a data stack, separate from the return stack, as discussed in the 6502 stacks treatise, particularly chapters 4 through 6. Although I'm a macro junkie, this might eliminate the complication that macros are meant to hide or automate, because the matter of passing parameters becomes implicit, rather than explicit. (I hope you're not disappointed.)

One or more of the variables a, b, c, and d may not need to exist as variables at all if they were just calculated by recent operations and won't be needed ever again after main and/or func1 and/or func2 is done with them. They take space on the data stack only while they're needed, then they cease to exist. They never had a hard address, so the problem of trying to figure out where to put them (like $00-03 versus $04-07) is gone. If you pass them on the return (page-1) stack instead, C-style, there's extra overhead. [Some of the following is just copied from the stacks treatise.] Consider the problem where one routine passes parameters to another, and now there's a subroutine-return address on top putting the target data farther down the stack than expected, so you get incorrect results.

The separate data stack in ZP avoids these problems, because the subroutine return addresses don't go on it. ZP addressing itself is also more efficient than absolute addressing of course, and there are more addressing modes for ZP. Parameter-passing becomes implicit. Samuel Falvo (kc5tja on this forum) has an excellent post about this, the second post in the topic "Stack Frames?" Key benefits he mentions are:

  • not having to re-mirror the input parameters on the stack all the time. This results in the fastest possible execution while still using stack frames.
  • reducing the impact of recursion (the 6502 stacks treatise covers recursion in chapter 15)
  • it's easy to return multiple values on the data stack too (even varying numbers of values, if it's advantageous enough)
  • it results in more compact code, and even permits better factoring because you don't have to constantly push and pop duplicate arguments all the time when calling sub-subroutines

If a high-level language passes parameters through the hardware stack, you could have an instruction like:

Code:
        CALL REPLINE (A(4), M(), "Limit=", B, 600, T2)

and all those parameters have to be specifically dealt with in the call. Several steps must be carried out at the call and return times, beyond just the call and return. However, with a separate data stack, input parameters that have been derived earlier can be left on the stack, just waiting there, without interfering with anything, until REPLINE is called, not storing them elsewhere, and not having to re-load them to put them in a stack frame now. Subroutine outputs can also be read from the stack and dealt with whenever you get around to it, again without having to store them right away as part of the return process. The resulting simpler, more efficient instruction might then be simply,

Code:
        REPLINE

or in the case of 6502 assembly language,

Code:
        JSR  REPLINE

func1 can call func2 which can call func3 etc., and maybe func1 calls func3 directly after func2 is done running, and there is absolutely no overhead to passing the parameters. This is the way Forth does it, although you can do the same kind of thing in assembly.

In the following:
Quote:
Another useful feature would be basic optimizing like this:
Code:
main:
ldx #25
lda #VALUE
macro_jsr func
ldx #35
...

func:
clc
adc #4
IF A, EQUALS, 9
ina
rts
ELSE
dex
stx PERIPHERAL1
lda #0
ENDIF
If #VALUE is 5 then it should automatically reduce to this, since it is both smaller and faster than what it replaces:
Code:
lda #10
ldx #35
On the other hand if #VALUE is 7 (and it's your only call to func) it could become this:
Code:
ldx #24
stx PERIPHERAL1
lda #0
ldx #35

The trick is, I think, having the assembler spot that #VALUE is a constant that always leads to the same result and computing it for you. The strength is you only have to write the function once and it still works when A is loaded with something not constant, in which case the function should stay intact.

the accumulator being tested in the IF line doesn't exist at assembly time, so the assembler would have to somehow examine the preceding instructions, ones that are not in the macro definition or invocation. No assembler that I know of can do this. A possible workaround is to have a macro to replace the LDA#, and besides laying down the LDA# instruction, it stores what you loaded there, and the func macro (if that's what it is) can examine this to use in its conditional assembly. Even then, it might be risky, because what if another intervening instruction alters A, so it's no longer what the LDA# put in it.

Quote:
I can deal without most of what C has to offer but it would be nice to have just enough abstraction to do those kinds of optimizations. I think if you 1.) treated normal labels as function names,
I do it sometimes. Macro invocations do that automatically.

Quote:
2.) only used temporary labels within functions,
That's covered in chapter 14 of the 6502 stacks treatise. Putting local variables on one stack or the other, they only take space while they're needed, during the running of the particular function or for being passed as inputs and outputs, and then they cease to exist.

Quote:
and 3.) only jumped to functions using their names, rather than hard coded addresses, you could do quite a lot. I think you would also never have to compromise performance and you would only increase program size if you chose speed over size. I suspect you would need a lot more than macros though.
Every routine will have an address, although part of the assembler's job is to hide it. If you want the routines relocatable, well, the '02 can do it—sorta—but it's very poorly suited for that. Chapters 12 and 13 deal with that. The 65816 does much better.

Quote:
This is in Forth though, so I very seldom need X for anything else.
Interesting. Out of curiosity, do you have an idea of the speed trade off between doing the same job in assembly?

Depending on the characteristics one is after, there are several ways to run Forth, so there could be a wide range of answers. I won't write them all up here as it would get too long; but I'll just give a few tidbits:

  • Data on the stacks is normally handled in cells which are a standard size. For the 6502, that's 16 bits, ie, two bytes, even if you only need 8 bits and you ignore the other 8. The 16 is because the '02 has a 16-bit address bus, and cells often contain addresses. If you want 32-, 48-, or 64-bit double-, triple-, or quad-precision numbers, you can combine cells. (You virtually never need more than double; and most operations are on single cells) Strings normally use one byte per character; but if you put just one character in a stack cell, it will take two bytes and the high byte is zeroed and/or ignored.

  • In this sense, the 8-bit 6502 has to work harder than it would to handle just 8 bits at a time, so there's some inefficiency that is accepted in order to avoid the programming confusion and bugs you'd get if you tried to have mixed-length cells. (The treatise discusses this too.) It definitely works out well for development though. The 65816 with its 16-bit accumulator in particular is much more efficient at it, and my '816 Forth runs two to three times as fast as my '02 Forth at a given clock rate.

  • There is always some overhead to get from one routine/function/word to the next. In the case of subroutine-threaded code, or STC, that's a JSR-RTS pair; but if something takes no more bytes than the JSR instruction itself, you might as well straightline it and speed it up; for example the word DROP drops a cell from the data stack by using INX INX which is only two bytes compared to a JSR's three, and much faster than JSR, INX, INX, RTS, so obviously it's worth straightlining. Even if something is a little longer, you might decide to optimize for speed instead of program size. Even with direct-threaded code (DTC) and indirect-threaded code (ITC) Forth, there are surprisingly low-overhead ways to get from one routine to the next in a list. (See Bruce Clark's ideas here and here.) Token threading is one of two other ways, but I don't know of any implementations of it in '02 Forth. Each of the 255 most-used routines are given a byte value so they can be referred to by a single byte, making the program more compact, but I'm sure the address look-up time slows things down.

  • In any case, if you need greater speed for something, you can dip into assembly language at any time; so you're not really giving up performance when it matters most.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 27, 2018 9:43 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
Assuming I understand correctly what you want to do (which may not be a safe assumption), you can get around these problems by using a data stack, separate from the return stack, as discussed in the 6502 stacks treatise, particularly chapters 4 through 6. Although I'm a macro junkie, this might eliminate the complication that macros are meant to hide or automate, because the matter of passing parameters becomes implicit, rather than explicit. (I hope you're not disappointed.)
The point of doing things differently to me is the gain in performance with local variables (not with passing arguments), although I do see how you would prefer using stacks for that too, even though it's slower. LDA, STA, and ADC (just as examples) only take 3 cycles from zero page, whereas they take 4 cycles with X indexing into a zero page stack or the return stack.

Imagine a function that uses four temporary variables in zero page with values calculated in the function (ie not arguments passed in). Here is an example of a fragment of an X indexed loop.
Code:
LDA var1 ;3 cycles
STA peripheral ;3 cycles
STX peripheral ;4 cycles (if peripheral not in zero page)
LDA var2 ;3 cycles
STA peripheral ;4 cycles
LDA (var3,X) ;6 cycles
STA peripheral ;4 cycles
STA var4 ;3 cyclesl
If you store your variables in zero page, this fragment takes 30 cycles. Here is how I think you would do it with a data stack in zero page (please let me know if there is a faster way to do it).
Code:
LDX data_sp ;3 cycles if in zero page, or ? cycles if ? (on stack?)
LDA var1,X ;4 cycles
STA peripheral ;4 cycles
LDX x_copy ;3 cycles if in zero page
STX peripheral ;4 cycles
LDX data_sp ;3 cycles
LDA var2, X ;4 cycles
STA peripheral ;4 cycles
LDA var3, X ;4 cycles
STA indirection_temp ;3 cycles, assuming in zero page
LDX x_copy ;3 cycles
LDA (indirection_temp,X) 6 cycles
STA peripheral ;4 cycles
LDX data_sp ;3 cycles
STA var4, X ;4 cycles
This takes 56 cycles (or can you speed it up?), which is almost twice as slow! Of course, I am deliberately trying to create the worst case scenario to prove a point, which is that unindexed zero page is faster when you are working with intermediate values. If you are using arguments passed to the function, which might already be on the stack, as you do with the Forth style stack you mentioned, I see why it is faster to leave them on the data stack and not make a new copy for every function call. However, every time the function rereads one of those stack arguments with X indexing, it incurs a one cycle penalty. If you access it enough times, it will be faster to allocate space for it in zero page and make a copy.

Quote:
One or more of the variables a, b, c, and d may not need to exist as variables at all if they were just calculated by recent operations and won't be needed ever again after main and/or func1 and/or func2 is done with them. They take space on the data stack only while they're needed, then they cease to exist.
Right. I am using a, b, c, and d as temporaries. They also cease to exist when func1 or func2 exit since other functions are free to use that zero page memory (the same concept as a stack).

Quote:
They never had a hard address, so the problem of trying to figure out where to put them (like $00-03 versus $04-07) is gone.
I see the convenience in not having to figure out where to put them. My only point was that figuring out where to put them yourself avoids X indexing for temporary local variables like a, b, c, and d, which is faster. I was doing things like this myself and it quickly became unmanageable, as well as probably not as efficient as a computer could manage. One caveat I see is that the speed gain has to outweigh the longer time it takes to push and pull zero page to a stack, versus adjusting X for your data stack.

Quote:
However, with a separate data stack, input parameters that have been derived earlier can be left on the stack, just waiting there, without interfering with anything, until REPLINE is called, not storing them elsewhere, and not having to re-load them to put them in a stack frame now.
How would you handle an HLL function like this?:
Code:
x=Calculation1();
y=Calculation2();
z=Calculation3();
func(x,y,z);
func(z,x,y);
Would you take these steps?:
1.) leave the calculated x, y, and z values on the stack without storing them somewhere else (yet)
2.) not make a copy of them for the first call to func
3.) make a copy of them on the stack in a different order for the second call to func

Quote:
the accumulator being tested in the IF line doesn't exist at assembly time, so the assembler would have to somehow examine the preceding instructions, ones that are not in the macro definition or invocation. No assembler that I know of can do this.
I imagine you would have to expand all the macros first of all so that you are left with only instructions and data. Then you could interpret the code line by line and figure out which paths are useful and which can be boiled down to constants. Then you could hand it over to the assembler after you had made your own changes.

Do you know of an assembler that will output source after macro expansion but before assembling?

EDIT: Woops, missed part of your post:
Quote:
Quote:
and 3.) only jumped to functions using their names, rather than hard coded addresses, you could do quite a lot. I think you would also never have to compromise performance and you would only increase program size if you chose speed over size. I suspect you would need a lot more than macros though.
Every routine will have an address, although part of the assembler's job is to hide it. If you want the routines relocatable, well, the '02 can do it—sorta—but it's very poorly suited for that. Chapters 12 and 13 deal with that. The 65816 does much better.
I meant if you wanted to try to get a program to optimize like in my example, you would have to follow a few rules of abstraction. There is no way to tell from a JMP instruction if it is entering a new function or just going back to the head of a loop in the same function. You would have to know that (I think) to optimize well.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2018 9:07 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8461
Location: Southern California
Druzyek wrote:
Quote:
Assuming I understand correctly what you want to do (which may not be a safe assumption), you can get around these problems by using a data stack, separate from the return stack, as discussed in the 6502 stacks treatise, particularly chapters 4 through 6. Although I'm a macro junkie, this might eliminate the complication that macros are meant to hide or automate, because the matter of passing parameters becomes implicit, rather than explicit. (I hope you're not disappointed.)

The point of doing things differently to me is the gain in performance with local variables (not with passing arguments), although I do see how you would prefer using stacks for that too, even though it's slower. LDA, STA, and ADC (just as examples) only take 3 cycles from zero page, whereas they take 4 cycles with X indexing into a zero page stack or the return stack.

If somethng can be done both ways, the stack method often will be slower. Its benefit comes partly from being able to be re-entrant (including recursive) (the title of this topic is "Reentrant Programming [...]"), and reducing the number of variables needed, variables which may even run you out of ZP space otherwise. The translation of your first five lines below bring no memory penalty and almost no speed penalty. If you're doing thing by stacks, your X will be the stack pointer almost full time, and you don't keep swapping it out to use for something else like your LDX/STX data_sp / x-copy. It may take a little adjustment period to accustom oneself to doing things differently. The stacks method for your last three lines won't look favorable, at least not up front.

I shortened your comments a bit below so hopefully when I added more, they won't exceed anyone's window width:
Quote:
Imagine a function that uses four temporary variables in zero page with values calculated in the function (ie not arguments passed in). Here is an example of a fragment of an X indexed loop.
Code:
        LDA  var1        ;3 cycles
        STA  peripheral  ;3 cycles
        STX  peripheral  ;4 cycles (if peripheral not in zero page)
        LDA  var2        ;3 cycles
        STA  peripheral  ;4 cycles

        LDA  (var3,X)    ;6 cycles
        STA  peripheral  ;4 cycles
        STA  var4        ;3 cyclesl

First I'll comment on the first five lines:
Quote:
If you store your variables in zero page, this fragment takes 30 cycles. Here is how I think you would do it with a data stack in zero page (please let me know if there is a faster way to do it).
Code:
    x   LDX  data_sp     ;3 cy if in ZP, or ? cycles if ? (on stack?)      X is a nearly full-time stack pointer,
        LDA  var1,X      ;4 cy                                             so you probably don't need to re-load it.
        STA  peripheral  ;4 cy
    x   LDX  x_copy      ;3 cy if in ZP  You'd probably also have the desired value on the stack, so it's just LDA ZP,X.
        STX  peripheral  ;4 cy           You could re-use A anyway, because you don't need its contents from above again.
    x   LDX  data_sp     ;3 cy           Then this line would be unnecessary, because you didn't overwrite X above.
        LDA  var2, X     ;4 cy
        STA  peripheral  ;4 cy           So so far, we have 5 lines and 13 bytes (just as before), and 20 verus 17 cycles.


The indexing for the next part is particularly inefficient for the '02 because indirection_temp is two bytes and you have to handle the low and high bytes separately, meaning it's even worse than you wrote. (It may be alleviated a bit if the high byte can remain fixed.) The '816 is a lot more efficient at handling 16 bits at once. Instead of using more ragtag variables which are hard to tame so one routine won't accidentally overwrite data still needed by another pending routine, a ZP space of about 8 bytes which we call N works well along with the stacks. N is always finished being used when you get to the end of a routine, and it's never used to pass parameters, and its use is never interrupted by subroutines. If the rules are followed, it's always safe. Anyway, I can't offer much of an improvement on the part replacing your three lines starting with LDA (var3,X).

Quote:
This takes 56 cycles (or can you speed it up?), which is almost twice as slow! Of course, I am deliberately trying to create the worst case scenario to prove a point, which is that unindexed zero page is faster when you are working with intermediate values.
Fortunately it's also not very common.

You can however do an unlimited number of levels of indirection. With the '816 with A in 16-bit (for shorter, clearer code, it's just a series of
Code:
        LDA  (0,X)
        STA  0,X
        LDA  (0,X)
        STA  0,X
        <etc.>

and you can insert an indexing anywhere, with something like CLC, ADC, STA 0,X.

Quote:
If you are using arguments passed to the function, which might already be on the stack, as you do with the Forth style stack you mentioned, I see why it is faster to leave them on the data stack and not make a new copy for every function call. However, every time the function rereads one of those stack arguments with X indexing, it incurs a one-cycle penalty. If you access it enough times, it will be faster to allocate space for it in zero page and make a copy.

True; but if avoiding the stack means you have to do more processing elsewhere to find a suitable place to put the data where it won't step on other data you still need, that will take extra time. If you further need it re-entrant (again the title of the topic), you're sunk. The ZP data stack solves these problems as well as ones that may have no solution at all outside of stacks. You can mix methods when needed though.

Quote:
Quote:
One or more of the variables a, b, c, and d may not need to exist as variables at all if they were just calculated by recent operations and won't be needed ever again after main and/or func1 and/or func2 is done with them. They take space on the data stack only while they're needed, then they cease to exist.

Right. I am using a, b, c, and d as temporaries. They also cease to exist when func1 or func2 exit since other functions are free to use that zero page memory (the same concept as a stack).

I've used the same memory locations for multiple things on the PIC16 where RAM is so limited and stack operations are rather prohibitive, and you sure have to be careful to make sure two different things don't need it at the same time when there are flurries of subroutines calling each other in different orders and directions. It's very bug-prone. I documented heavily, and still wound up with some real head-scratchers.

Quote:
Quote:
However, with a separate data stack, input parameters that have been derived earlier can be left on the stack, just waiting there, without interfering with anything, until REPLINE is called, not storing them elsewhere, and not having to re-load them to put them in a stack frame now.

How would you handle an HLL function like this?:
Code:
x=Calculation1();
y=Calculation2();
z=Calculation3();
func(x,y,z);
func(z,x,y);

Would you take these steps?:
1.) leave the calculated x, y, and z values on the stack without storing them somewhere else (yet)
2.) not make a copy of them for the first call to func
3.) make a copy of them on the stack in a different order for the second call to func

Code:
        JSR  <calc_x>   ; Leaves x on the data stack.  (Not necessarily a subroutine.)
        JSR  <calc_y>   ; Leaves y on the data stack, on top of x.      "
        JSR  <calc_z>   ; Leaves z on the data stack, on top of y.      "
        JSR  3DUP       ; DUPlicate the top 3 stack items.  (See notes below.)
        JSR  func       ; I assume your func takes 3 stack items, and returns none.
        JSR  minusROT   ; ROTate z back behind x & y.  (ROT goes the other direction.)
        JSR  func       ; 

If you wanted func to use the stack data but not do the customary deletion of cells it uses when it's done with them, you could omit the 3DUP.

Quote:
Quote:
the accumulator being tested in the IF line doesn't exist at assembly time, so the assembler would have to somehow examine the preceding instructions, ones that are not in the macro definition or invocation. No assembler that I know of can do this.

I imagine you would have to expand all the macros first of all so that you are left with only instructions and data. Then you could interpret the code line by line and figure out which paths are useful and which can be boiled down to constants. Then you could hand it over to the assembler after you had made your own changes.

The assembler would also have to see if any intervening called subroutines affect it, and if not, if any of the ones they call affect it, etc., however many levels deep it might go. If any of those have not been encountered yet in the assembly because they're further donw in the source code, later passes might produce phase errors, and it may take a lot of passes to resolve them. (That part in itself is not necessarily a problem.) If there's a way to do it though, it sounds attractive. You could do similar things with watching the carry flag for example so it knows if it can omit the CLC before an ADC or the SEC before a SBC, or more rarely, the decimal-mode flag. Maybe more.

Quote:
Do you know of an assembler that will output source after macro expansion but before assembling?

No, but enso started a topic "A sensible macro engine" where he proposes and explores the idea of a macro pre-processor that could then be used with various assemblers including ones with no macro capability. (This and other links are in my article on forming program-structure macros in assembly language through macros.)

Quote:
EDIT: Woops, missed part of your post:
Quote:
Quote:
and 3.) only jumped to functions using their names, rather than hard coded addresses, you could do quite a lot. I think you would also never have to compromise performance and you would only increase program size if you chose speed over size. I suspect you would need a lot more than macros though.

Every routine will have an address, although part of the assembler's job is to hide it. If you want the routines relocatable, well, the '02 can do it—sorta—but it's very poorly suited for that. Chapters 12 and 13 deal with that. The 65816 does much better.

I meant if you wanted to try to get a program to optimize like in my example, you would have to follow a few rules of abstraction. There is no way to tell from a JMP instruction if it is entering a new function or just going back to the head of a loop in the same function. You would have to know that (I think) to optimize well.

My article on using macros to form program structures shows this kind of thing where going back to the head of a loop does not require a label. It can be like BEGIN...UNTIL, BEGIN...WHILE...REPEAT, FOR_X...NEXT_X, etc., using indentation to make the structure clear, and not needing labels.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2018 1:57 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
barrym95838 wrote:
You might end up encountering some difficulties assembling those stz (_sp),y instructions :wink:

Mike B.

Oops. Good point.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2018 11:58 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
If somethng can be done both ways, the stack method often will be slower. Its benefit comes partly from being able to be re-entrant (including recursive) (the title of this topic is "Reentrant Programming [...]"), and reducing the number of variables needed, variables which may even run you out of ZP space otherwise.
As long as you have stack space to hold copies of the zero page you are using temporarily, using zero page memory is also re-entrant and permits recursion. If you leave arguments on the stack and only use zero page to hold temporary calculations, wouldn't you use the same amount of memory as doing the same job with stack space? Here is what I have done in the past:

Code:
main:
LDA #10
JSR count_down ;Recurse down to 1
...

count_down:
LDX $00 ;Save a copy of $00
PHX
STA $00 ;Do something useful with $00
STA screen
DEA ;Count down (from 10)
BEQ _zero
jsr count_down ;Recurse
_zero:
LDA $00 ;Do more useful stuff with $00
STA peripheral
PLX ;Restore calling function's $00
STX $00
RTS
The temporary usage of $00 here is not necessary but it shows that you can allocate temporary storage in zero page and still use recursion. Also, it doesn't take up more stack space than using the stack for temporary storage.

Quote:
If you're doing thing by stacks, your X will be the stack pointer almost full time, and you don't keep swapping it out to use for something else like your LDX/STX data_sp / x-copy.
You do have to do it at least once through each loop for the DEX/BNE if you are using X as your loop counter, although that might only potentially happen once per loop. Is there a faster way to loop than using X when you not only need to do something a certain number of times but also need to use the value of the counter each time through the loop? Too bad there is no LDA Address, (zp, X) to replace LDA Address, X

Quote:
LDX x_copy ;3 cy if in ZP You'd probably also have the desired value on the stack, so it's just LDA ZP,X.
Perhaps, but you still have to copy X to the stack somehow after you do DEX (unless you are looping a different way like I mentioned above. DEC Counter, X in that case?). It does seem like keeping the counter variable (updated only once per loop) on the stack would be faster than LDX x_copy/DEX/STX/BNE.

Quote:
Fortunately it's also not very common.
Agreed. Seems like worst case scenario.

Quote:
True; but if avoiding the stack means you have to do more processing elsewhere to find a suitable place to put the data where it won't step on other data you still need, that will take extra time.
Right but the difference is the optimizer/assembler will take longer to figure that out than not optimizing. It will not slow down performance, only the compile time (hopefully by less than a second).

Quote:
If you further need it re-entrant (again the title of the topic), you're sunk.
Pushing zero page addresses is just as re-entrant as using the stack and it also permits recursion. If a function uses $00-$03 in zero page, just push those bytes to a/the stack every time you enter the function and restore them when you exit. It is similar to pushing registers. The added benefit is you can spot yourself or with a macro/optimizer when they don't need to be pushed before the function begins and skip the push/restore sequence. Don't HLLs often assign register use in a smart way and only push the ones a function will need before jumping? I'm thinking of zero page as 256 registers.

Quote:
It's very bug-prone. I documented heavily, and still wound up with some real head-scratchers.
Exactly. This is why I think a program to figure out what goes where would be a great help.

Quote:
The assembler would also have to see if any intervening called subroutines affect it, and if not, if any of the ones they call affect it, etc., however many levels deep it might go.
This would also be absolutely necessary anyway to let the program assign zero page memory for temporary usage. I think it makes sense to do both things at the same time.

Quote:
No, but enso started a topic "A sensible macro engine" where he proposes and explores the idea of a macro pre-processor that could then be used with various assemblers including ones with no macro capability.
I got it to preprocess only with Macroassembler AS using arguments -P +G. I'm not sure how good it is as an assembler but it claims to support 65xx (and it was the first Google hit).

Quote:
My article on using macros to form program structures shows this kind of thing where going back to the head of a loop does not require a label. It can be like BEGIN...UNTIL, BEGIN...WHILE...REPEAT, FOR_X...NEXT_X, etc., using indentation to make the structure clear, and not needing labels.
My idea was that you could use any macro you wanted and the program would only work on the assembly that that generates. Otherwise, you would be locked in to supporting certain structures, which I would like to avoid. I think it would be neat if it relied on as little abstraction as possible. I think you would only need "byte" and "word" to reserve temporary zp, "call" and maybe "function," with all of those being optional and mixable with everything else like JSR.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 29, 2018 4:09 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8461
Location: Southern California
Druzyek wrote:
Quote:
If somethng can be done both ways, the stack method often will be slower. Its benefit comes partly from being able to be re-entrant (including recursive) (the title of this topic is "Reentrant Programming [...]"), and reducing the number of variables needed, variables which may even run you out of ZP space otherwise.
As long as you have stack space to hold copies of the zero page you are using temporarily, using zero page memory is also re-entrant and permits recursion. If you leave arguments on the stack and only use zero page to hold temporary calculations, wouldn't you use the same amount of memory as doing the same job with stack space? Here is what I have done in the past:

Code:
main:
LDA #10
JSR count_down ;Recurse down to 1
...

count_down:
LDX $00 ;Save a copy of $00
PHX
STA $00 ;Do something useful with $00
STA screen
DEA ;Count down (from 10)
BEQ _zero
jsr count_down ;Recurse
_zero:
LDA $00 ;Do more useful stuff with $00
STA peripheral
PLX ;Restore calling function's $00
STX $00
RTS
The temporary usage of $00 here is not necessary but it shows that you can allocate temporary storage in zero page and still use recursion. Also, it doesn't take up more stack space than using the stack for temporary storage.

What you're suggesting is the first point after the heading "A couple of locals methods:" in the Local variables & environments chapter of the 6502 stacks treatise. I recommend going through the whole treatise.

Quote:
Quote:
If you're doing things by stacks, your X will be the stack pointer almost full time, and you don't keep swapping it out to use for something else like your LDX/STX data_sp / x-copy.
You do have to do it at least once through each loop for the DEX/BNE if you are using X as your loop counter, although that might only potentially happen once per loop. Is there a faster way to loop than using X when you not only need to do something a certain number of times but also need to use the value of the counter each time through the loop? Too bad there is no LDA Address, (zp, X) to replace LDA Address, X

For direct indexed operations (ie, not indirect), Y can be used for most of the same stuff. You can always use a variable as a counter too, doing DEC <var>, BNE. I'm not sure what you mean about there not being an LDA (ZP,X). It's op code $A1, used constantly in Forth.

No one method will be best for all situations. The point is that when you add tools to your toolbox, you open up more possibilities of what you can do.


Quote:
Quote:
True; but if avoiding the stack means you have to do more processing elsewhere to find a suitable place to put the data where it won't step on other data you still need, that will take extra time.
Right but the difference is the optimizer/assembler will take longer to figure that out than not optimizing. It will not slow down performance, only the compile time (hopefully by less than a second).

You have to leave room for varying conditions that may come up during run time. There may be varying sizes of data which cannot be anticipated at assembly time, or combinations of branches based on conditions that also cannot be anticipated at assembly time. There is of course the hardware stack you can save things on and work in to some extent (more in the 65816 because of its stack-relative addressing modes), and a ZP data stack, and you can have other stacks as well, in higher pages.

A way you could avoid some use of stacks is to allocate buffers. These have their own advantages and disadvantages. It's just another tool. One advantage is that they can be bigger than a one-page stack allows—even tens of KB. Another is that they don't have to be a last-on-first-off kind of thing, IOW, you can delete (or even resize) allocated buffers that are not at the end. A disadvantage is that it's generally slower, especially if you have to do some memory-moving to prevent fragmentation. An example buffer usage would be to hold a symbol table during assembly on the 6502 computer itself. The table remains for the duration of the assembly process, then gets deleted, possibly after being stored on mass storage. The speed of the assembler is not as important as getting the flexibility you need. The resulting assembled code can be fast; but the onboard assembler that translated the assembly language into machine language doesn't need to be that fast. I recently wrote a set of words to allocate, resize, and delete buffers which would never fragment memory. I was a bit disappointed at how much code it took to do it. The code itself could sure benefit in clarity from a method I have in mind to implement named local variables (including local arrays, which ANS Forth doesn't provide for) in Forth, which again use a buffer. Chicken and egg, but it should be doable. Although the existing code needed almost no debugging to get it going, I still have never had this degree of stack gymnastics to deal with. The local variables would clear the air. I may also re-write the material in assembly someday.


Quote:
Quote:
My article on using macros to form program structures shows this kind of thing where going back to the head of a loop does not require a label. It can be like BEGIN...UNTIL, BEGIN...WHILE...REPEAT, FOR_X...NEXT_X, etc., using indentation to make the structure clear, and not needing labels.
My idea was that you could use any macro you wanted and the program would only work on the assembly that that generates. Otherwise, you would be locked in to supporting certain structures, which I would like to avoid. I think it would be neat if it relied on as little abstraction as possible. I think you would only need "byte" and "word" to reserve temporary zp, "call" and maybe "function," with all of those being optional and mixable with everything else like JSR.

That's pretty much what the proposed method allows. Unused macros of course don't take any memory at all in the target system, and I've had 50 lines or more of a macro that only assembled a couple of bytes of machine code because that's what the conditions called for at the time, and the conditional assembly in the macro complied with those conditions. In any case, you're never locked into any set of macros, as you can always modify them or make your own. The ones I present are debugged for the assemblers I use, and I have used the PIC16 set quite extensively. For the program structures, you could have for example
Code:
        CMP  #14
        IF_EQ
           <actions>
           <actions>
           <actions>
        ELSE_
           <actions>
           <actions>
           <actions>
        END_IF

and it would assemble exactly what you would do by hand, but it's more clear. It's also nestable. It will reduce bugs too, because you can spot them more easily before you even try assembling.

OTOH if you're talking about extending into using macros for forming nestable data structures and hiding some of the stack-type details, I'm interested, and if I ever catch up, I'll implement that too, and then expand the stacks treatise to cover it. :D

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 29, 2018 5:11 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
What you're suggesting is the first point after the heading "A couple of locals methods:" in the Local variables & environments chapter of the 6502 stacks treatise. I recommend going through the whole treatise.
I skimmed it a while ago. You have a lot of good stuff in there. I like your idea of mixing different methods depending on the situation. This is another area where an optimizer could do a lot of good by comparing the penalty of pushing zero page versus X indexing for you (I would not want to have to figure this out by hand).

Quote:
Quote:
Too bad there is no LDA Address, (zp, X) to replace LDA Address, X
For direct indexed operations (ie, not indirect), Y can be used for most of the same stuff. You can always use a variable as a counter too, doing DEC <var>, BNE. I'm not sure what you mean about there not being an LDA (ZP,X). It's op code $A1, used constantly in Forth.
Woops, that was my lame attempt at a joke. :oops: You can do LDX zp, X then LDA Address, X but not combined in one step like LDA Address, (zp, X).

Quote:
You have to leave room for varying conditions that may come up during run time. There may be varying sizes of data which cannot be anticipated at assembly time, or combinations of branches based on conditions that also cannot be anticipated at assembly time.
Right, you can't pre-compute everything but why not look for things that you can? You can also look for parts of code that your program will never reach.

Quote:
In any case, you're never locked into any set of macros, as you can always modify them or make your own.
I just mean if I worked on a program to optimize assembly I would definitely be against adding any sort of keywords you have to use. I like the idea of keeping it 100% compatible. Otherwise, you start to drift towards your own language, which I would not want to do.

Quote:
It will reduce bugs too, because you can spot them more easily before you even try assembling.
Good point.

Quote:
OTOH if you're talking about extending into using macros for forming nestable data structures and hiding some of the stack-type details, I'm interested, and if I ever catch up, I'll implement that too, and then expand the stacks treatise to cover it. :D
That could be interesting too but all of the stuff I'm thinking about seems to be well beyond what you could do with a macro.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 29, 2018 7:55 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
Druzyek wrote:
Exactly. This is why I think a program to figure out what goes where would be a great help.

They call programs that do this "compilers".

I don't think it would be a real stretch to write a "high level assembler" that has Garth-like iteration and control structures with higher level routine management, and the compiler can do some of these optimizations of allocating zero page work variables based on graph and data flow analysis.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 30, 2018 3:15 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 126
Location: Lake Tahoe
whartung wrote:
Druzyek wrote:
Exactly. This is why I think a program to figure out what goes where would be a great help.

They call programs that do this "compilers".

I don't think it would be a real stretch to write a "high level assembler" that has Garth-like iteration and control structures with higher level routine management, and the compiler can do some of these optimizations of allocating zero page work variables based on graph and data flow analysis.


This is exactly what prompted me to come up with PLASMA: Proto Language AsSembler for Apple. Too low level to be called a real compiler, but higher level than an assembler. PLASMA has grown a little beyond this over the years, but being able to do a little analysis can reap big rewards. One of the big benefits was to track the zero page evaluation stack during expression evaluation and adjust the stack offset instead of INX or DEX. Only with flow-of-control changes did X need to be reconciled. Caching TOS in A and Y helped a great deal. Local function variables can easily be managed as well.

Dave...


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: