6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Apr 29, 2024 1:00 pm

All times are UTC




Post new topic Reply to topic  [ 54 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Sun Apr 16, 2023 4:12 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8155
Location: Midwestern USA
Hugh Aguilar wrote:
The weirdest thing about the '816 was that it had 8-bit and 16-bit mode.

Those aren’t “modes.” They are register sizes. The 816 has two modes: emulation and native. In native mode, the accumulator and index registers may be set to eight- or 16-bit width.

As you note, that design was a direct result of Apple’s influence on Bill Mensch. A better arrangement would have been to use 16-bit opcodes for instructions that operate on 16-bit quantities, thus eliminating the primary purpose of REP and SEP.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 7:18 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
(I think of the '816 as having 5 modes - it seems the simplest way, for me, to think about it.)

Hugh Aguilar wrote:
BigEd wrote:
A relatively major problem, I think, is the number of corner cases in the '816 behaviour...

What is a "corner case" in the '816 behavior?


I'm thinking of all the little details found in Bruce's major document


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 7:44 am 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
BigEd wrote:
(I think of the '816 as having 5 modes - it seems the simplest way, for me, to think about it.)

Hugh Aguilar wrote:
BigEd wrote:
A relatively major problem, I think, is the number of corner cases in the '816 behaviour...

What is a "corner case" in the '816 behavior?


I'm thinking of all the little details found in Bruce's major document

that document/webpage has helped me so much while learning the 65816 (and still does).
and one corner case/quirk that i found on there was that JMP (abs) and JMP [abs] read from Bank 0, while JMP (abs,X) will do it in the current Program Bank instead.
it makes little sense to me, and i would honestly rather have the non-indexed versions also access the current Program Bank (or better have all of them access the Data Bank like all other absolute addressing modes, so the Jump Table can be anywhere in Memory regardless of program execution)

but ey that's just how i see it. maybe they had a reason to restrict the non-indexed indirect jumps to Bank 0.


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 7:56 am 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 293
BigEd wrote:
(I think of the '816 as having 5 modes - it seems the simplest way, for me, to think about it.)

WDC wrote:
The Decimal (D), IRQ Disable (I), Memory/Accumulator (M), and Index (X) bits are used as mode select flags

E = 0 and E = 1 are described as modes a little further down. There are several equivalent way of thinking about these bits, and none of them are wrong.

(I've never thought of I as a mode flag, but it does make sense)


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 8:28 am 
Offline

Joined: Thu Mar 03, 2011 5:56 pm
Posts: 277
drogon wrote:
Hugh Aguilar wrote:
The W65c816 is a good target for a byte-code VM because all of the opcodes are one-byte, as compared to the more popular MC6811 and MC6809 that have multi-byte opcodes.


I wouldn't let the CPUs underlying opcodes be a reason for/against running a bytecode VM on it, however ...

I have a bytecode VM running on my '816 systems. I do not feel It's a good target for a bytecode engine host for many reasons...

One - the core function of fetching bytes - is that the '816 lacks an instruction to load an 8-bit byte into the 16-bit Acc, zeroing the top 8-bits at the same time. So the core fetch code in my VM wastes a memory cycle loading 16-bits, then wastes another 3 cycles with a 16-bit AND instruction before it can then index that value in the jump table. This overhead adds 4 cycles to every single byte code fetched which may not seem a lot, but it adds up.

You may think that dropping into 8-bit (memory) mode might help but that just adds cycles going each way and it still doesn't zero the top 8-bits of the 16-bit acc.

Other reasons include dealing with the 64K banks of RAM because my target high level language (BCPL) wants a linear address space. But even with that, it dos run well and has enabled me to do what I want in BCPL including running the compiler on the hardware and writing a multi-tasking OS with it.

The only C I've used on the 65xx is cc65 which doesn't support the '816 - I've really no idea how the other compilers fare when dealing with data structures > 64KB in ways transparent to your code or things that compile to > 64KB of code. Then there's the issue of loading your program+data into all those 64K banks before you can even run it... I'm suspecting that may be left as an exercise to the user... but if anyone has some examples (make a video?) I'd like to see..

Cheers,

-Gordon


If you have 8-bit indexes and one index register free, you may be able to use a combination of LDX followed by TXA (or LDY followed by TYA)... If I understand correctly, TXA/TYA from an 8-bit index to a 16-bit accumulator should clear the upper 8 bits of the accumulator.


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 4:43 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
rwiker wrote:
drogon wrote:
Hugh Aguilar wrote:
The W65c816 is a good target for a byte-code VM because all of the opcodes are one-byte, as compared to the more popular MC6811 and MC6809 that have multi-byte opcodes.


I wouldn't let the CPUs underlying opcodes be a reason for/against running a bytecode VM on it, however ...

I have a bytecode VM running on my '816 systems. I do not feel It's a good target for a bytecode engine host for many reasons...

One - the core function of fetching bytes - is that the '816 lacks an instruction to load an 8-bit byte into the 16-bit Acc, zeroing the top 8-bits at the same time. So the core fetch code in my VM wastes a memory cycle loading 16-bits, then wastes another 3 cycles with a 16-bit AND instruction before it can then index that value in the jump table. This overhead adds 4 cycles to every single byte code fetched which may not seem a lot, but it adds up.

You may think that dropping into 8-bit (memory) mode might help but that just adds cycles going each way and it still doesn't zero the top 8-bits of the 16-bit acc.

Other reasons include dealing with the 64K banks of RAM because my target high level language (BCPL) wants a linear address space. But even with that, it dos run well and has enabled me to do what I want in BCPL including running the compiler on the hardware and writing a multi-tasking OS with it.

The only C I've used on the 65xx is cc65 which doesn't support the '816 - I've really no idea how the other compilers fare when dealing with data structures > 64KB in ways transparent to your code or things that compile to > 64KB of code. Then there's the issue of loading your program+data into all those 64K banks before you can even run it... I'm suspecting that may be left as an exercise to the user... but if anyone has some examples (make a video?) I'd like to see..

Cheers,

-Gordon


If you have 8-bit indexes and one index register free, you may be able to use a combination of LDX followed by TXA (or LDY followed by TYA)... If I understand correctly, TXA/TYA from an 8-bit index to a 16-bit accumulator should clear the upper 8 bits of the accumulator.


(Apologies for more thread derailment)

It's a bytecode interpreter and in this particular bytecode there are 255 instructions defined. So the index could be 8-bit, but it would have to do 2 fetches to fetch the 16-bit address of the target code for that particular instruction.

For the most part the system runs in native mode with 16-bit registers/memory. The dispatch code which is executed in-line for almost all the bytecode instructions takes 29 cycles most of the time and 37 cycles when it has to increment the top 16-bits of the 32-bit program counter. (I know we only have a 24-bit real PC but that's not an issue here - it's also a rare thing to have happen, but does happen from time to time with large programs)

All registers/modes/widths/whatever are free on entry but the targets expect registers+memory to be in 16-bit mode. Additionally, all '816 executable code and data lives in Bank 0. It's entirely possible I'm missing a trick or 2 but I really don't know.

Have fun...

Code:
        lda     [regPC]         ; Load 16-bit value                             (7)
        and     #$00FF          ; We only want 8-bits...                        (3)
        asl                     ; Double for indexing in 16-bit wide jump table (2)
        tax                                                                     (2)

; Increment the PC

        inc     regPC+0         ; Low word                                      (7)
        beq     incH1           ; 2 cycles + 1 when branch taken                (2)
        jmp     (opcodesLo,x)                                                   (6) = 29

incH1:  inc     regPC+2         ; Top word                                      (7)
        jmp     (opcodesLo,x)                                                   (6) = 37



Numbers in () are the opcode cycle counts. Saving 4 cycles from the LDA and the AND would be nice. I've been through all the scenarios of delaying the PC increment (by pushing it to the target op-code code) which might make jump instructions go faster but the compiler sometimes tries to adjust things such that the jumps don't happen, if possible (by negating the test and using the 'opposite' conditional jump) so gains there might be less than the effort required.

In practice it's not that bad on my 16Mhz system, but there's always the "wouldn't it be nice if ..." :-)


-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 7:16 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
i mean is that AND really that bad? those few cycles will add up sure, but it could be worse!

look at my own bytecode VM for the 65816 (SW32VM) for example...

because i'm dumb and wanted to make the whole thing relocatable (which it isn't anymore anyways) i didn't want to make use of indirect jumps and instead instructions are choosen by just going through a chain of DEC A instructions, checking if A is 0, if not continue to the next, if it is execute the selected instruction.

it's not fast at all and means instructions further down take longer to execute by default. i really need to rewrite that someday.

but on the other hand i'm also not really making use of indirect addressing in general. because the Index Registers are 16-bit wide, i've opted to use X as the lower 16-bits of the PC, and the Data Bank Register as the upper 8-bits.
this means i load instructions by doing LDA a:$0000,X
and incrementing the PC is just INX followed by a BNE in case it rolled over and the Data bank needs to be adjusted. though i still keep a copy of the PC in memory in case X gets muddled.
in hindsight, it might've been much easier and faster to just use indirect addressing.

again the entire thing is ripe for an almost complete rewrite...
but maybe there is something in there that could help you, or make you realize that it probably isn't worth trying to hyper optimize every last line of code...


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 16, 2023 8:15 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
Proxy wrote:
i mean is that AND really that bad? those few cycles will add up sure, but it could be worse!


You're right - It's not that bad, but like the extra cycle on the LDA it sort of makes me feel a bit "Grr" thinking it should have been better. Not just that but removing the AND will save 3 bytes - and yes, some will point out that I've said very recently that you should use that RAM, but in this case for reasons of "architecture" I want my BCPL Bytecode VM to live entirely inside a 16KB region of RAM. Right now I have about 100 bytes free... And because this code is in-lined with every one of those 255 opcodes (sometimes twice), saving 3 bytes here will save me at least 765 bytes overall which i can use to speed something else up...

It's also annoying when some of the opcodes take less cycles than the dispatcher.... Here is an example

Code:
.proc   ccA1
        .a16
        .i16

        inc     regA+0          ; Add 1 - ie. increment. Special case to go faster
        beq     :+
        nextOpcode
:       inc     regA+2
        nextOpcode
.endproc


The "add 1" opcode is used in almost every single FOR loop...


However, it did help me achieve my objective of letting me run a high level language on the '816 and also run the compiler directly on it too. Even if it's is a shade slow. But how slow? It might be nice one day to run the benchmarks I recently run under Basic and BCPL with a C compiler for the '816 ...

Quote:
look at my own bytecode VM for the 65816 (SW32VM) for example...

because i'm dumb and wanted to make the whole thing relocatable (which it isn't anymore anyways) i didn't want to make use of indirect jumps and instead instructions are choosen by just going through a chain of DEC A instructions, checking if A is 0, if not continue to the next, if it is execute the selected instruction.

it's not fast at all and means instructions further down take longer to execute by default. i really need to rewrite that someday.


That's quite something!


Quote:
but on the other hand i'm also not really making use of indirect addressing in general. because the Index Registers are 16-bit wide, i've opted to use X as the lower 16-bits of the PC, and the Data Bank Register as the upper 8-bits.
this means i load instructions by doing LDA a:$0000,X
and incrementing the PC is just INX followed by a BNE in case it rolled over and the Data bank needs to be adjusted. though i still keep a copy of the PC in memory in case X gets muddled.
in hindsight, it might've been much easier and faster to just use indirect addressing.

again the entire thing is ripe for an almost complete rewrite...
but maybe there is something in there that could help you, or make you realize that it probably isn't worth trying to hyper optimize every last line of code...


I will have a deeper look, thanks. I have also been guilty of the "premature optimisation" scenario too - details here https://projects.drogon.net/ruby816-premature-optimisation-and-all-that/

Trying to get back on-topic.. One reason I went down the path of BCPL was because I think we have lost all the C compilers that once did run directly on the 6502 (and 816). And by lost I mean no public source code available to let us port them to new systems. I used Aztec C on the Apple II back then. The Apple IIgs has a Pascal compiler and a C compiler (written in Pascal) that runs on the system. There are a few ones that run on the BBC Micro but what else?

We have to cross compile now. Not always what I want to do at least.

So BCPL lets me edit, compile and run programs directly on my system without using BASIC. If only it were a shade faster... ;-)

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 3:27 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
drogon wrote:
...why are you bothering with it at all? Just pick a nice CPU that's well known and understood, doesn't have the 64K bank shenanigans and use that instead. I can recommend RISC-V having worked with it recently - it's a joy to use. Several C compilers and easy to code for in assembly and FreeRTOS already going on it...

The RISC-V certainly has fanatic followers! More than one person has tried to sell me on this. I glanced over the specs for it and had the impression that it is a toy. Nobody has ever indicated any practical use for the bloated monster --- they just say that is a "joy to use" (I have no idea what that means).

The #1 priority in micro-controllers is low interrupt latency. There is no indication that the RISC-V makes any attempt at this. My own processor has only a few 16-bit registers, and it stores these in shadow registers during an ISR (rather than push them to the return-stack). Exchanging multiple registers with their shadows is a single clock-cycle save or restore.

Note that, historically, the 6502 was primarily famous for having low interrupt latency. It did push the registers to the return-stack rather than exchange them with shadow registers, but there was only the 8-bit P, A and Y (the X wasn't usually needed). The (zp),Y addressing mode of the 6502 was great for implementing circular buffers of 256 bytes. The pointer to the buffer was in a zero-page variable, so it didn't need to get loaded into a register in the ISR, which means that there was no register to save/restore in the ISR. The (zp) addressing mode of the 65c02 was even better because now the Y register didn't need to be saved/restored in the ISR. The only register that needed to be saved and restored was the A register (the P and PC were saved/restored automatically). The 6502/65c02 would have been slightly improved if the A registers got saved/restored automatically along with P and PC, but the PHA and PLA instructions were pretty fast so this wasn't a big deal.

All in all, the 6502 seemed to be designed for micro-controllers. The fact that it is mostly famous for use in personal computers was due to Steve Wozniac choosing it for the Apple-II, and he did that mostly because the only other choice was the Z80 that was horribly slow (https://map.grauw.nl/resources/z80instr.php). Notice that a lot of Z80 instructions take 19 or more clock cycles, and nothing takes less than 4 clock cycles. The Z80 does have more interrupts (the 6502 only had IRQ and NMI), but no effort was made to provide low interrupt latency.

From what I've seen, the RISC-V is to the 2020s what the Z80 was to the 1980s. Processors such as these seem to be designed by academic types who want a scaled-down mini-computer on their desktop so that it will be a "joy to use," but they have no understanding or concern for real-world issues such as low interrupt latency.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 4:37 am 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
where did you see that RISC-V has high latency interrupts?

AFAIK the Unprivileged Spec doesn't specify an interrupt system at all, so you can just use something custom (like shadow registers, though they're pretty useless if you allow for nested interrupts).
and even though the Privileged Spec does have interrupts, the latency would be basically nothing as no memory accesses ever occour during an interrupt sequence, because only internal registers get used.
and of course during an ISR you would use a few registers as possible so you don't have to save and restore a lot of them.

Hugh Aguilar wrote:
Nobody has ever indicated any practical use for the bloated monster

Now you've lost me... RISC-V is designed to be simple and easy to understand with optional extensions to increase functionality if desired.. It's pretty much the opposite of bloated as you only include what you need.
Moat current RISC-V systems aim for modern functionality so they use the more complex Privileged Spec for things like memory protection, multiple system modes, etc.
Of course such features could seem like "bloat" to people who've mostly worked with basic 8 bit systems. :wink:

Hugh Aguilar wrote:
they just say that is a "joy to use" (I have no idea what that means).

did you never write assembly in a specific ISA and thought to yourself "damn this is nice to use"? that's exactly what they mean, inlcuding myself.
RISC-V has a very nice ISA to write assembly for. fully orthogonal (no specific registers like on the 6502/Z80), no status register to worry about, and of course with macros you can build yourself powerful pseudo instructions.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 6:01 am 
Offline

Joined: Mon Jan 19, 2004 12:49 pm
Posts: 664
Location: Potsdam, DE
Hugh Aguilar wrote:

All in all, the 6502 seemed to be designed for micro-controllers. The fact that it is mostly famous for use in personal computers was due to Steve Wozniac choosing it for the Apple-II, and he did that mostly because the only other choice was the Z80 that was horribly slow (https://map.grauw.nl/resources/z80instr.php). Notice that a lot of Z80 instructions take 19 or more clock cycles, and nothing takes less than 4 clock cycles. The Z80 does have more interrupts (the 6502 only had IRQ and NMI), but no effort was made to provide low interrupt latency.


Remember though that the Z80 often had a faster clock than the 6502 - 4MHz was common, I think.

Though I do recall an informal side-to-side comparison of a 4MHz Nascom Z80 system against my Tangerine 0.75MHz 6502 system, both running MS basic (32 bit floats for the Nascom, 40 bit floats for the Tangerine) had the 6502 coming out slightly ahead.

(Personally I was never a fan of the Z80 instruction set; I like the 8080/85 set with which the Z80 is backwards compatible, but the extra Z80 instructions - with the possible exception of the register exchange, which was handy? They never seemed to do quite what I wanted them to do. A couple of employers ago, I had to rewrite someone else's non-standard floating point code to give better precision and a more standard format, on a Z80. It made my head hurt, but at least it was a change from PIC assembly...)

Neil


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 7:43 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Indeed, you simply can't compare clock speeds of z80 and 6502 - you need to count memory accesses. There's a difference in architecture.

I'm interested to know if RISC-V has, or could have, the sort of shadow registers which ARM has, allowing some few private registers for each level of interrupt. That saves on stack churn.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 9:18 am 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
Well as said, the unprivileged ISA doesn't specify an interrupt system, so you can basically do whatever you want as long as it doesn't interfere with the standard.
So you could implement shadow registers if you wanted to, just none of the existing compilers would be able to make use of them.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 17, 2023 9:22 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
the sort of shadow registers which ARM has

The shadow registers were a feature of the older ARM architectures. Modern ARM Cortex has a mechanism where a number of registers (the "caller saved") are automatically pushed on the stack so that the ISR can be handled in a normal C function.

As far as latency is concerned, the ARM Cortex takes 12 cycles to save the registers, and automatically jump to the correct ISR. If a 2nd interrupt is pending when the first is finished, the registers are left on the stack, reducing latency. Since you already have a number of free registers, you can usually handle the entire ISR in fewer cycles than a 6502. This is especially true if multiple interrupt sources are used with different priorities.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 18, 2023 7:14 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Interesting - thanks! (Link for more info.)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 54 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 32 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: