6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 4:22 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue Apr 02, 2024 1:30 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 51
For obscure reasons (background below), I got interested in this problem:

We need a relocatable 256 byte 65c02 code sequence where I can safely JSR to any of the 256 locations and have the IP eventually advance to the final location (all roads lead to `rome` ...), without changing any memory location nor the X register.

A naive solution is to fill the 256 locations with NOP ($EA), but that requires an average of 2*128 cycles to get from a random landing point to the end of the region. i.e.

Code:
rome - 256:
    nop
    nop
    ...
rome:


What's the best average number of cycles you can come up with? What if you can't use 65c02 opcodes? I have a version with a ladder of BRA instructions using offsets that are also safe landing instructions which averages about 10-12 cycles.

Background: thinking about more compact literals in TaliForth, which are normally coded as JSR literal_runtime / lsb / msb, i.e. 5 bytes per literal. For byte-sized values, we could instead use JSR (rome - value), capturing the literal value in the JSR address (so 3 bytes instead of 5) which we can peek via the return address on the stack. I don't necessarily think it's a very useful solution but is amusing to think about.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 02, 2024 3:30 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Some mixture of BVC and BVS should, I think, get close to your BRA example.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 02, 2024 3:42 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA
pdragon wrote:
I have a version with a ladder of BRA instructions using offsets that are also safe landing instructions which averages about 10-12 cycles.
That sounds optimal, or very close to it.
Quote:
Background: thinking about more compact literals in TaliForth, which are normally coded as JSR literal_runtime / lsb / msb, i.e. 5 bytes per literal. For byte-sized values, we could instead use JSR (rome - value), capturing the literal value in the JSR address (so 3 bytes instead of 5) which we can peek via the return address on the stack. I don't necessarily think it's a very useful solution but is amusing to think about.
If I understand the situation properly, the return address on the stack will be conveying information about the address of the LIT, not the value of the literal you want to push.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 02, 2024 4:27 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 51
right, after JSR (rome - value) the return stack has the address of MSB, so one byte earlier is the LSB from which we could infer value


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 02, 2024 8:49 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
I am a little lost here since if you already know the value, why would you need to infer it?

But here is an ideaa. Fill the entire 256 byte page with RTS ($60). Then any location can be caluclated with:

JSR location

...

TSX
LDY $00FF,X
LDA $0100,X

and Y becomes the value if on a page boundary, or if not, then subtract Rome lo-byte from this Y value (lo-byte) to get the infered value. It is assuming that Rome is always at the end of the 256 byte page.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 03, 2024 2:38 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 51
i should have explained the context a bit more. It arises when compiling forth code like "3 2 + .", which results in something like this:

Code:
    jsr literal
    .byte $3, $0
    jsr literal
    .byte $2, $0
    jsr plus
    ...


so each literal compiles to 5 bytes (where the `literal` routine uses the caller's return address to grab the 16-bit word and push it on the forth stack, returning 2 bytes further).
Many literals are byte sized, so have been looking for a scheme that would use less space to compile small literals. Obviously a separate `literal_byte` could save one byte by knowing to extract only one trailing byte. The `rome` idea is another probably impractical idea. it would replace `literal` and use the caller's jsr address to infer the literal, so use only 3 bytes per compiled literal, but 'wastes' 256 bytes for the landing pad. Another possibility would be to hijack BRK and use the isr to extract trailing data byte(s). But that seems extreme. other suggestions welcome!


Last edited by pdragon on Wed Apr 03, 2024 3:13 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 03, 2024 3:10 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 387
Location: Minnesota
If I understand what you're trying to do, your proposed solution is very general, capable of handing a lot of small values. It might be better to figure out just which ones are likely to be used. Zero and one, perhaps minus one and two as well. Making each one into its own subroutine might save space.

LDA #immediate, PSH, LDA #$00, PHA, RTS (or reverse the order if the high byte should be first) is seven bytes. In 256 bytes there's enough space for 37 routines like that.

Clever coding might get you even more. How many are really common enough to be worth it?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 03, 2024 3:14 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 51
yup, that's true, and the current approach has dedicated words for zero, one, and two iirc.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 04, 2024 2:09 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
The Forth I use has a CLIT which is very easy to implement. The routine fills in the hi-byte with a zero for you. And the compiled code is JSR CLIT followed by one byte instead of 2.

Aother idea is create math routines that treats the literal as two single 8-bit bytes instead of a 16-bit word. It would basically compile to:

Code:
JSR LITERAL
.byte 0A 0B
JSR PLUS8B


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 04, 2024 8:59 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 122
Location: Lake Tahoe
Not a general solution for all 8 bit values, but for 64 values, one could fill the 256 bytes with:
Code:
LIT64BASE:
    LDA #1
    BNE +126 ; FIRST 32 ENTRIES MUST USE BRANCH TRAMPOLINE TO GET TO ROME
    LDA #2
    BNE +126
    ...
    LDA #62
    BNE ROME
    LDA #63
    BNE ROME
    LDA #64
ROME:
    STA STACK,X
    INX
    LDA #0
    STA STACK,X
    INX
    RTS


Then call as:

Code:
    JSR LIT64BASE + (literal - 1) * 4


Four bytes per literal in the 256 page. Note that my branch trampoline offset for the first 32 entries may be off a bit, I'm thumb nailing this as I go along ;-) Also expecting there to be a specific word for ZERO.

I kind of did this for PLASMA where the first 32 byte codes represent a constant nibble value (no odd byte codes, so had to double it)

Oh, and the Pascal p-code interpreter does this for the first 128 byte codes (constant 0 through 127). The Java VM does something similar for constants -1 to 5.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 04, 2024 10:41 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3346
Location: Ontario, Canada
pdragon wrote:
Another possibility would be to hijack BRK [...]. But that seems extreme. other suggestions welcome!
If you're soliciting extreme suggestions then I am game for the challenge! :twisted:

What I'll describe is a novel encoding for CLIT (and it's also adaptable for LIT). It's very small and very fast, but there's a unusual requirement which, depending on circumstances, may or may not be deemed acceptable.

Usage is like this...
Code:
 jsr cliteral
 .byte $3
 ...

... and cliteral looks like this.
Code:
cliteral:
 pla
 sta n      ;n and n+1 are a pair of scratch bytes in z-page
 pla
 sta n+1    ;the z-pg pair at n points to the third byte of the JSR instruction

 dex        ;make space for a 16-bit value
 dex
 stz 1,x    ;msb of the value is zero
 ldy# 1
 lda (n),y  ;lsb of the value is found 1 past the third byte of the JSR instruction
 sta 0,x 

 jmp (n)    ;sic
So, yeah... There's no RTS! :lol:

Instead, execution jumps to the last byte of the JSR instruction, which of course is its ADH operand... except now it gets digested as an opcode. :shock:

And most of you already know this next trick. Certain two-byte instructions opcodes such as CMP immediate -- whose opcode is $C9 -- can be used in place of a BRA when the goal is to skip forward by one byte only. In this case, and using $C9 as an example, the last byte of the JSR instruction is arranged to be $C9, and the CPU parses the $C9 and the one-byte inline parameter as a 2-byte, CMP# instruction. To us, the CMP functions as a NOP. Its only purpose it is to harmlessly guide the CPU to the code that follows the 2-byte CMP# instruction. This all works wonderfully smoothly, but we need cliteral to be mapped to an address whose high-byte is equal to a suitable opcode.

Luckily there are plenty of candidates! CMP#, CPX#, CPY# and BIT# are acceptable because Tali won't mind if the flags get altered. And AFAIK Tali can also tolerate LDY#; likewise LDA#, ADC#, SBC#, AND#, ORA# and EOR#.

And, for 65C02, the opcodes $02, $22, $42, $62, $82, $C2 and $E2 are also acceptable, because they are all 2-byte, 2-cycle NOPs. (The 65C02 has other intriguing potential in regard to NOPs; I wrote about that subject here.)

None of the 2-byte instructions mentioned so far will access memory (other than the 2 cycles accessing memory at PC). But at the cost of one or more extra cycles, a bunch of additional candidate 2-byte instructions become available, such as BIT z-pg, LDA z-pg,x and CMP (ind),y. Just be aware that the extra memory accesses can be dangerous if there's I/O in the system whose state may get altered by an unexpected read.

Finally, opcodes for certain three-byte instructions can be used in place of a BRA when the goal is to skip forward by two bytes. This means the inline value after the JSR could be increased to 16-bit (as required by LIT). And, as above, there's a read that occurs which adds delay and could touch I/O.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Fri Apr 05, 2024 3:35 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 05, 2024 3:28 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 51
this is a nice way to avoid manipulating the return address to make cliteral faster.

in my use case I care more about space than time, so another approach I've been fooling with is to prepend
`jsr literal` before the definition of certain compiled words that are often preceded by a literal, e.g. "+".
When compiling "1 2 +" the compiler would normally generate
Code:
  jsr literal
  .word 1 
  jsr literal
  .word 2
  jsr xt_plus

but if we've defined plus as
Code:
   jsr literal
xt_plus:
   ...
z_plus:

then it's easy for the compiler to notice that it's just emitted a literal when it's ready to emit "+", and could rewind to instead write:
Code:
  jsr literal
  .word 1 
  jsr xt_plus - 3  ; slurp the literal and then run plus normally
  .word 2


this costs a few bytes per targeted word but often saves 3 bytes (or 4 if you handle cliteral as well). this strategy folds any literal into a subset of words, rather than optimizing a subset of literals for any word. in the code base I looked at you only need to tag a few dozen words with the prefix to fold about 80% of literals away.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 05, 2024 3:43 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA
pdragon wrote:
in my use case I care more about space than time ...

TaliForth is very nice, but if you're really all in for the highest code density you're gonna have a hard time beating Charlie's DTC PETTIL, even though he limits himself to the NMOS instruction set. It's no slouch in the speed department either.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 05, 2024 6:03 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 122
Location: Lake Tahoe
Take 2:
You could always fill the page with $4C. At location $4C4C, put a JMP ROME there. Probably pretty hard to stick a jump like that in the middle of the address space like that.

However, I think there are better ways than trying to encode the literal into the JSR address.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 05, 2024 9:06 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1398
Location: Scotland
pdragon wrote:
this is a nice way to avoid manipulating the return address to make cliteral faster.

in my use case I care more about space than time, so another approach I've been fooling with is to prepend
`jsr literal` before the definition of certain compiled words that are often preceded by a literal, e.g. "+".
When compiling "1 2 +" the compiler would normally generate


Bytecode VM?

the one I use would compile that into 2 bytes:

Code:
L1                         Reg A := 1
A2                         Reg A := Reg A + 2


Maybe in your case a set of JSRs to load small constants to the stack might help? This one has instructions to load 0 through 10 and minus 1, and add 1 through 5. (as well as a few subtracts. 1 through 4) although it's not Forth.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 26 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: