Improving the 6502, some ideas

John West · Post by **John West** » Mon Aug 10, 2009 12:54 pm

BigEd wrote:

The start of this thread was about ideas for improving the 6502, but it became a thread about implementing an improved 6502, which is hard. On the positive side, several forum members have in fact done it (at least Ruud, John West, Rob Finch, Sprow: apologies to others I've missed)

I have to protest this. I never implemented anything, just followed through an idea for extending the 6502 to 32 bits (it was very much like your "a 6502 where every byte is 16bits"). I never implemented it.

I've been keeping out of this because while ideas are easy, implementing them is not. Anyone doing an implementation will probably want to work on their own ideas, not mine. If mine are ever to get implemented, it'll be up to me to do it. Given the amount of work required and my usual attention span, I'm not even going to try.

So, to anyone who wants to see a bigger better 6502, I say this: Do it! Get yourself an FPGA development board and some software (I have Xilinx's Spartan 3A starter kit - the software is excellent). Or just a simulator. Find a lot of spare time. Be prepared to get a lot of things wrong. Make something.

kc5tja · Post by **kc5tja** » Mon Aug 10, 2009 3:18 pm

OwenS wrote:

You have a completely different notion of a read/write barrier to me (And GCC - and Intel in it's Itanium ABI).

OK, it looks like we have yet another case of the industry f**king over everyone as a whole by once again overloading terms incompatibly.

I already linked to a paper which uses read- and write-barriers in a very, very different way than what's apparently used by the compiler industry.

I feel sorry for those who are writing compilers for a garbage collected language!!!

OwenS · Post by **OwenS** » Mon Aug 10, 2009 8:54 pm

kc5tja wrote:

I feel sorry for those who are writing compilers for a garbage collected language!!!

Me?

The GC design I'm working with doesn't use that kind of synchronization though; when a collection is ordered, it walks the stack to find pointers, and compacts the heap by rewriting them. Of course, the language I'm writing would best be described as a scripting language - though a fast one [Thanks to being built on LLVM's brilliant JIT] - where it's acceptable (for now) to use a stop-the-world collector

fachat wrote:

Removing the need to snoop other processors' memory accesses greatly improves scalability, as snooping only scales IIRC to about 4 procs and quickly gets inefficient. Java programs using the memory model with the read/write barriers I described are capable of running on a lot more processors in parallel.

I believe AMD implement this using a complex messaging system over HyperTransport in their Opteron processors. It doesn't scale to thousand CPU clusters, but it does to ~8 chip machines fine. [I presume it scales better, just I don't know of any higher chip count machines

]

BigEd · Post by **BigEd** » Sat Mar 13, 2010 6:42 pm

John West wrote:

I never implemented anything, just followed through an idea ...

So, to anyone who wants to see a bigger better 6502, I say this: Do it! Get yourself an FPGA development board ... Be prepared to get a lot of things wrong. Make something.

Hi John
sorry for misrepresenting your project. I'm absolutely with your position on this.

VBR: I wonder, did you get what you wanted from this thread? Are you likely to try to implement something?

jt_eaton · Post by **jt_eaton** » Wed Jul 07, 2010 8:54 pm

What are the simplest, most useful improvements you could make, taking the WDC65C02 as a base?

1) Split the instruction space from the data space.

Run the program counter into a 64K rom that contains all opcodes and
instructions. It would not contain any vectors,tables, data or other non-
executables. Those would be a in separate 64K data space. This would
double your address space without having to bank and you could fetch
instructions during the same cycle that you are accessing data or the
stack.

2) Move the stack out of page 01 and into a LIFO. If all you need are
pushes and pulls then a dedicated lifo can do that without competing
for memory cycles. Plus you can size the stack depth for your
application and have hardware checks for overflow and underflow.

You couldn't do any sort of stack manipulations so this would only
be useful for smaller embedded designs

3) Move page 00 off the memory bus and into the cpu. A direct connection
to this ram would take a lot of load off the main data bus. Plus you
could build it as a 16 bit memory so that pointer addresses stored on
even addresses could be read in one clock instead of two.

John Eaton

barrym95838 · Post by **barrym95838** » Tue Aug 06, 2013 6:29 am

Hi all.

I'm new to the forum, but have been lurking for some time, and this is one of my favorite threads, because it is so thought-provoking. I decided to just pull the trigger and post it back near the top ... I hope that I don't upset anyone by doing so. I know that there are forums out there that frown on this, so I'm taking a chance ...

The idea of designing a more capable work-alike to the 6502 family is very appealing to me, if only from a "mental exercise" angle. My background is not in operating system design or hardware, just simple programs and algorithms as a hobby. The following excerpts are from an incomplete specification document that has gone through several revisions without actually being completed (kind of the story of my life).

My question to the group: Could I be on to something here, or am I barking up the wrong tree?

Code: Select all

Proposed instruction set for the 65m32, by barrym95838.

I started with my understanding of Garth's proposal, and took it as far as I could, short of
  writing an actual simulator.
I borrowed ideas from the pdp-11, Nova, 6809, and most of all, the 6502.
Proposed additions, simplifications, and/or criticisms are welcome. 

Instruction bit format:  oooo ooaa ffff rrri iiii iiii iiii iiii

15 bits specify the operation, and 17 bits provide an 'inherent' constant that can be used to
  encode -65536 to 65535 without using a second word.

Addressing modes:

rrr  =     0       1       2       3       4       5       6       7
aa   =0   #,a     #,b     #,x     #,y     #,z     #,u     #,s     #,n
     =1   $,a     $,b     $,x     $,y     $,z     $,u     $,s     $,n
     =2   $,a+    $,b+    $,x+    $,y+    $,z+    $,u+    $,s+    $,n+
     =3   $,-a    $,-b    $,-x    $,-y    $,-z    $,-u    $,-s    $,-n

There are eight registers, plus p.  n is PC.  z is permanently hardwired to zero, meaning that
  '$,z' '$,z+' and '$,-z' are all equivalent.  # and $ come from ...i iiii iiii iiii iiii (the
  17-bit twos-complement number is extended to a full 32 bits before the operand calculation).

An example opcode matrix:

Code: Select all

     x0    x1    x2    x3    x4    x5    x6    x7    x8    x9    xa    xb    xc    xd    xe    xf
   +-----------------------------------------------------------------------------------------------
0x | ora   ora   ora   ora   and   and   and   and   eor   eor   eor   eor   bit   bit   bit   bit
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
1x | adc   adc   adc   adc   add   add   add   add   sub   sub   sub   sub   sbc   sbc   sbc   sbc
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
2x | mul   mul   mul   mul   div   div   div   div   mod   mod   mod   mod   ???   ???   ???   ???
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
3x | ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
4x | lda   lda   lda   lda   ldb   ldb   ldb   ldb   ldx   ldx   ldx   ldx   ldy   ldy   ldy   ldy
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
5x | ldz   ldz   ldz   ldz   ldu   ldu   ldu   ldu   lds   lds   lds   lds   ldn   ldn   ldn   ldn
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
6x | hla   hla   hla   hla   hlb   hlb   hlb   hlb   hlx   hlx   hlx   hlx   hly   hly   hly   hly
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
7x | hlz   hlz   hlz   hlz   hlu   hlu   hlu   hlu   brk   brk   brk   brk   hln   hln   hln   hln
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
8x | sta   sta   sta   sta   stb   stb   stb   stb   stx   stx   stx   stx   sty   sty   sty   sty
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
9x | stz   stz   stz   stz   stu   stu   stu   stu   sts   sts   sts   sts   stn   stn   stn   stn
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
ax | asl   asl   asl   asl   rol   rol   rol   rol   ror   ror   ror   ror   lsr   lsr   lsr   lsr
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
bx | exa   exa   exa   exa   ???   ???   ???   ???   dec   dec   dec   dec   inc   inc   inc   inc
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
cx | cmp   cmp   cmp   cmp   cpb   cpb   cpb   cpb   cpx   cpx   cpx   cpx   cpy   cpy   cpy   cpy
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
dx | cpz   cpz   cpz   cpz   cpu   cpu   cpu   cpu   cps   cps   cps   cps   psh   psh   psh   psh
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
ex | ???   ???   ???   ???   adb   adb   adb   adb   rep   rep   rep   rep   sep   sep   sep   sep
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   |-----------------------------------------------------------------------------------------------
fx | ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???   ???
   | #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r  #,r   $,r   $,r+  $,-r
   +-----------------------------------------------------------------------------------------------
     x0    x1    x2    x3    x4    x5    x6    x7    x8    x9    xa    xb    xc    xd    xe    xf

Some simple instruction translations:

Code: Select all

------------- Examples that translate with minimal effort -------------------
--------- 65816 ----------------        --------- 65m32 ------------
:a90080	        lda  #32768             :40088000         lda  #32768 (*)

:a200f0         ldx  #-4096             :4809f000         ldx  #-4096 (*)

:ac9a78         ldy  $789a              :4d08789a         ldy  $789a  (*)

:e8             inx                     :48040001         ldx  #1,x

:88             dey                     :4c07ffff         ldy  #-1,y

:48             pha                     :830c0000         sta  ,-s

:68             pla                     :420c0000         lda  ,s+

:60             rts                     :5e0c0000         ldn  ,s+

:4c5476         jmp  $7654              :5c087654         ldn  #$7654 (*)

:6c5713         jmp  ($1357)            :5d081357         ldn  $1357  (*)

:8a             txa                     :40040000         lda  #,x

:205634         jsr  $3456              :7c083456         hln  #$3456 (*)

:d00d           bne  .+15               :5c6e000e         ldn  [ne],#14,n

:10ce           bpl  .-48               :5cafffcf         ldn  [pl],#-49,n

:5c563412       jml  $123456            :5e0e0000         ldn  ,n+
                                        :00123456         .dw  $123456

:22658709       jsl  $98765             :7e0e0000         hln  ,n+
                                        :00098765         .dw  $98765

:fcefcd         jsr  ($cdef,x)          :7d04cdef         hln  $cdef,x

(*):  There is a hidden ",z" within this instruction; this is assumed if
      no other index register is specified in the operand field.

Notes:  Binary code density obviously favors the 65816, but memory cycle
  counts obviously favor the wider-bus 65m32.
  These 65m32 assembly examples show the nuts-and-bolts for the sake of
  illustration, but an assembler could easily allow macros and/or aliases
  with more familiar mnemonics -- 'ldn  ,s+' <-> 'rts', 'ldn  [ne],# ...'
  <-> 'bne ...', etc.

I am in the process of hand-translating FIGFORTH, and it looks like I'm able to do pretty much whatever I need to do with about 1/4 of the instructions required by the NMOS 6502, with the added benefit (?) that everything has been promoted to 32 bits.

What I believe is the true key to the 65m32's efficiency and ease of use is NOT its instruction repertoire, which is rather ordinary with few exceptions, but its flexible operand structure. Once one fully understands how this structure works, programming with it becomes natural and simple (at least for me). The way that it works is as follows:

ANY register except for the processor status register can be used as an index, including the accumulator and the instruction pointer.

There are two families of operand modes, immediate and absolute. The immediate mode is indicated by a leading # in the operand field, and means that the operand value is to be used at 'face-value'. The immediate value isn't just a static entity, though, because it is (with few exceptions) added to the contents of the specified index register (identified with a leading comma) before use. #1,x is always equal to the contents of register x, plus 1. To make the assembly language easier to type, I have specified that either the numeric part or the register name (but not both) can be omitted. A missing numeric is assumed to be 0, and a missing register name is assumed to be ,z.

There are three absolute modes; they are indicated by the absence of a leading # in the operand field, and always imply an additional memory access (read, write, or read-modify-write). This is because the operand value (which is calculated in the same manner as it is for immediate mode) is used as a pointer to main memory. Automatic post-increment and pre-decrement options for the indicated index register should be self-explanatory.

The 65m32 is 32-bits all-the-way, and technically EVERY instruction is a single word. Of course, most instructions require operand data to specify an immediate value or an address, and it is impossible to fit a 32-bit operand and an op-code into 32-bits.

One way that the 65m32 gets around the problem is by promoting an embedded 17-bit numeric operand to 32-bits before using it, by duplicating bit 16 in bits 17 to 31 before adding it to the index. But that only works most of the time, depending on what you're doing with the operand. -65536 ... 65535 is a respectable range that can be used for small increments, constants and offsets, but doesn't enable the 65m32's full potential.

The other way that the 65m32 gets around the problem is by treating the instruction pointer as just another index register. This allows in-lining a full 32-bit operand immediately after the instruction, and loading it using the instruction pointer in absolute addressing mode, with auto-increment (so the next op-code after the operand is executed next). The PDP-11 does this, and I think that it's quite elegant. When composing small (<64kW) programs, this technique is typically only needed for large constants, like bit-masks and such, since the inherent 17-bit constant provides plenty of reach for relative branch targets, increments, initializations, and more. While translating FigForth from 6502 to 65m32, I have so far only found two occasions in hundreds of instructions where this 'long-immediate' technique is necessary, and they were only necessary because of the four-char-per-word dictionary name storage convention that I've implemented.

Before I spend too many more hours on this, it would help me to know what the more experienced readers think about the design. Does it still have some of that 6502 flavor, or is it too polluted from its other influences to deserve to be called something like 65m32? I am a fully-grown man, and I can take the negative opinions with the positives, so I would like to ask that you don't pull any punches if you have a reasoned argument as to why it might suck in some way or another. I promise that I won't get butt-hurt and run away.

Sincerely,

Mike

GARTHWILSON · Post by **GARTHWILSON** » Tue Aug 06, 2013 7:30 am

I'm glad to see more discussion on it, but I might need some more explanation on the notation. I do understand some of it.

It does indeed lose some of the 6502 flavor, but apparently still deals with the same registers and mostly the same instructions-- just different notation, with some flexibility added.

If I understand the way you're doing it, then for your "LDN #$7654", you will want to be able to add the ",_" where the underscore is filled in with the name of the 32-bit registers that are the equivalent of the 65816's data and program banks, for 32-bit offsets. This will be necessary for address-agnostic code, including for when a program or an array it accesses get moved after already being loaded. There are several reasons for wanting to be able to move it. I started writing up examples and then decided to leave it for later if it comes up.

barrym95838 · Post by **barrym95838** » Wed Aug 07, 2013 2:34 am

Thanks for responding, Garth. I was hoping that you, above all others, would ... excellent!

My operand notation is relatively simple, but maybe a little bit eccentric:

# usually indicates an immediate numeric value added to a register (exceptions below). A missing register name is replaced with z, and a missing immediate constant is replaced with 0. If you want to do y = x + 100, you code it as ldy #100,x. If you want to branch 1000 words forward, you code it as ldn #1000,n or its equivalent bra .+1001. The 6502's tsx would be ldx #,s or tsx, depending on how 6502-friendly the assembler is.

No # indicates that the operand will be used as an effective address. A missing register name is replaced with z, and a missing numeric offset is replaced with 0. If you want to load a with the first element of a table in memory pointed to by u, you code it as lda ,u. In my translation of FigForth, NEXT would be ldn ,y+ or its equivalent jmp (,y+).

My mnemonics have only a few irregularities, and plenty of op-code space for important stuff that has certainly slipped my notice ... lots of available bits means lots of flexibility!

The first that may be unfamiliar are hla ... hln. What they do is push the indicated register on the system stack before loading it with a new value, but after any operand side-effects like auto-increment. This allows short or long jsrs (with post-word) to push the correct return address. They are also an economical way to save a register and immediately begin using it, in one instruction. If you want to stack a table pointer address contained in a and load a with the second element from that table, you would code it as hla 1,a. Later, you would restore the pointer with a lda ,s+ or its equivalent pla.

What about the weird stuff like sta #,b or asl #,a? You can't store into or shift an immediate, but you can use sta # to transfer register-to-register without affecting the N or Z flags, which lda # would do (at least for a, b, x, and y). For asl #, the immediate constant would be a shift count: negative values could exclude the carry, and positive values could include it.

Regarding your questions about relocation, please share or link me to any examples that you can create ... I'm chomping at the bit to see if I can adapt my concept to a processor that someone would want to use in a more complex environment. My 6502 experience does not include writing any 65c816 programs, but I can certainly grok the need for additional address translation beyond what the original NMOS could do ... I just need appropriate examples to dissect!

Thanks,

Mike

GARTHWILSON · Post by **GARTHWILSON** » Wed Aug 07, 2013 4:50 am

It would be nice to have an additional register or two that could be used as stack pointers too (although interrupts and subroutines would still use S). Bruce's idea of just using bank 0 on the '816 for a DTC Forth program and RTS as a 6-clock NEXT is quite efficient but mostly forbids interrupts. His ITC Forth idea of using PLX, JMP(0,X) for a 13-clock NEXT also has that problem. I have had other situations too where a separate hardware stack would make things more efficient. If the pushing and pulling instructions were the same thing only specifying a different register, that would be nice.

The '816 has a program bank register (PBR, also called K) and a data bank register (DBR, also called B) which are only 8 bits but dictate which 64K bank the referenced program addresses and data accesses will take place in, minus things like DP and stack and vectors which are always in bank 0. Any given program can be using the normal 16-bit addresses but the system can assign what bank(s) it will use to make it happily coexist with other programs that get their turn to run in a multitasking environment. Every program can think that it starts at address 0 and owns the hill, but in reality they start at integer multiples of 65,536 apart. I would like these "bank" registers to be 32-bit, like everything else in the processor, and then of course there are no bank boundaries. Your method of specifying the registers does the same thing but adds flexibility. An example I'm thinking of is where you have data mixed in with the program, as well as a data array somewhere else, and data accesses for both could be done with the same instructions but specifying which register to add in, with no need to be changing those registers. In this example, instructions to fetch data mixed in with the program in the program area would specify K to use the PBR equivalent instead of automatically going with B to use the DBR equivalent just 'cause they're accessing data.

The '816 also has a 16-bit direct-page register (also called D) which is an offset in bank 0 for where the direct page starts. Direct page is like zero page except that it can be moved around, so different tasks can have their own ZPs without stepping on each other. In an all-32-bit version, it's kind of like the whole 4gigaword address space is in direct page, but a 32-bit offset register would still be valuable for the same reasons the PBR and DBR are.

I'm perhaps on thin ice talking at the limits of my knowledge, because I don't have any experience myself with using multiple banks on the '816 since I only have the '802 which is like a bank-0-only '816, and that's what I wrote and tested my '816 Forth on. BDD and others might be able to contribute more in this area.

If you haven't already, be sure your ideas accommodate actions like the 816's PEI, PER, and PEA. These can be synthesized in 65c02 also, but it takes a lot of instructions there to do what the '816 does in a single instruction.

What does the the H stand for when it's the first letter of a mnemonic?

Although the above would be useful in multitasking, I have written several times that the kind of work I do with the workbench computer rules out normal multitasking OSs. I still have plenty of use for having multiple programs in memory at once though, and in fact I might like to do something like my HP-71 hand-held computer from the 80's does. It is not really a multitasking system but it does allow you to have hundreds of files, many if not most being program files, in file chains in non-volatile RAM, and they do not have to be loaded into another part of RAM to run, but can be run in place, even calling other programs and subprograms in other files, or even itself, with deep nesting. It supports local environments too, so the various programs that have been started don't step on each other's resources. Recursion is of course possible then too, and a pseudo-multithreading method allowed me to write a text editor for it that allowed me to have dozens of files open at once. If a program resizes a file that comes before it in the file chain, especially enlarging it, the system has to scoot things around to prevent fragmentation, and so all the free memory is together at one end. You can see that without the various offset registers I discuss above, that program would crash the first time there's a jump or return-from-subroutine after everything got moved and things are no longer where it expected them. In this case the system would update the offsets, and a return-from-subroutine for example goes to the right place even though the stack contents were not changed.

barrym95838 · Post by **barrym95838** » Wed Aug 07, 2013 5:47 am

Quote:

It would be nice to have an additional register or two that could be used as stack pointers too (although interrupts and subroutines would still use S).

Although my design is still accumulator-centric like the 6502, my addressing modes are brutally orthogonal, meaning that practically any registers besides a, z, and n can be stack pointers without penalty. The main ALU's destination is almost always a, but the operand adder can do all kinds of LEA-style things.

Did you notice my FigForth NEXT from above? It's a single 32-bit instruction: ldn ,y+ AKA jmp (,y+)
EXECUTE is jmp (,x+)
I didn't have to use y as the code pointer, but it looked more 6502-like than jmp (,u+) which reminds me more of the 6809.
It shouldn't be much of a surprise that it greatly resembles rts, which is jmp (,s+)
It also follows neatly that a long jump is jmp (,n+) and a long call is jsr (,n+)
In fact, I have already translated over 100 machine instructions from 6502 FigForth to 65m32 FigForth, and so far all except one are 32-bits long! Here's a brief hand-assembled list file excerpt ... the binary translations are NOT correct, because I remapped the op-codes after doing it, and haven't gotten around to re-assembling it until my op-code map is stable:

Code: Select all

00001021:            43 ;Execute a word by its code field address on the
00001021:            44 ;  dstack
00001021: 00001025   45 EXEC    .dw  .+4
00001022: c5584543   46         .db  "E",'XEC'  ;EXECUTE
00001023: d5544500   47         .db  "U",'TE'
00001024: 00001000   48         .dw  LIT        ;link to LIT
00001025: 29700000   49         jmp  (,x+)
00001026:            50 ;Adjust IP by in-line literal
00001026: 0000102a   51 BRAN    .dw  .+4
00001027: c252414e   52         .db  "B",'RAN'  ;BRANCH
00001028: c3480000   53         .db  "C",'H'
00001029: 00001021   54         .dw  EXEC       ;link to EXECUTE
0000102a: 31000000   55         tya
0000102b: 38500000   56         add  ,y+        ;IP += literal
0000102c: 01300000   57         tay
0000102d: f170ffd8   58         bra  NEXT
0000102e:            59 ;If bottom of dstack is zero then branch
0000102e: 00001032   60 ZBRAN   .dw  .+4
0000102f: b0425241   61         .db  "0",'BRA'  ;0BRANCH
00001030: ce434800   62         .db  "N",'CH'
00001031: 00001026   63         .dw  BRAN       ;link to BRANCH
00001032: 60040000   64         lda  ,x+
00001033: f177fff6   65         beq  BRAN+4
00001034: 31300001   66 BUMP    iny             ;Skip IP over literal
00001035: f170ffd0   67         bra  NEXT
00001036:            68 ;Increment loop index, loop until >= limit
00001036: 0000103a   69 PLOOP   .dw  .+4
00001037: a84c4f4f   70         .db  "(",'LOO'  ;(LOOP)
00001038: d0290000   71         .db  "P",')'
00001039: 0000102e   72         .dw  ZBRAN      ;link to 0BRANCH
0000103a: 66f00000   73         inc  ,s         ;index
0000103b: 500c0001   74 PL1     lda  1,s        ;limit
0000103c: 67000000   75         cmp  ,s         ;index
0000103d: f17bffec   76 PL2     bmi  BRAN+4
0000103e: 61600002   77         lds  #2,s
0000103f: f170fff4   78         bra  BUMP

bra NEXT is replaced with jmp (,y+) after debugging.

Quote:

The '816 has a program bank register (PBR, also called K) and a data bank register (DBR, also called B) which are only 8 bits but dictate which 64K bank the referenced program addresses and data accesses will take place in, minus things like DP and stack and vectors which are always in bank 0. Any given program can be using the normal 16-bit addresses but the system can assign what bank(s) it will use to make it happily coexist with other programs that get their turn to run in a multitasking environment. Every program can think that it starts at address 0 and owns the hill, but in reality they start at integer multiples of 65,536 apart.

This is the kind of stuff that I would like to investigate further. My understanding is that the '816 doesn't have a privileged mode, so a misbehaving 16-bit program could pretty much walk all over the other processes by modifying B and K. So far, I have only investigated and partially implemented the process-view model here, where everything ORGs around zero. It would help greatly if I could figure out an efficient way to add a privileged 'bank' register that always adds to the effective address, and can only be accessed and modified in 'privileged' mode, but at this point, my only concern would be making sure that I leave enough op-code space available for those privileged instructions.

Quote:

If you haven't already, be sure your ideas accommodate actions like the 816's PEI, PER, and PEA. These can be synthesized in 65c02 also, but it takes a lot of instructions there to do what the '816 does in a single instruction.

Done, I think.
pei #10 would be psh #10
pea foo would be psh #foo or psh ,n+ : dw foo if foo is > 16-bits.
per foobar would be ... uhh ... I'd better look that up first, but I think that I can do it hassle-free.

Quote:

What does the the H stan for when it's the first letter of a mnemonic?

Well, it's a little bit embarrassing, but I think that equal-length mnemonics are sexy. They make for aesthetically-pleasing source listings, and all of my designs use them. My 65m32 exclusively uses three-letters from respect for the 6502, my m-824 exclusively uses four-letters, and my 10-bit decimal design uses ... well I'm not sure yet, but I'm sure that it will be equal-length.

hla and its siblings are a disappointing side-effect of this obsession. It means to pusH then Load register A with something. PLA is too ambiguous, and SLA lingered for awhile, but I'm not completely satisfied with any of them. Suggestions are welcome, as long as it's three letters.

More to come,

Mike

[EDIT 2013.10.07]: After some further translation efforts, I have discovered the embarrassing fact that jmp (,y+) is not compatible with a classic ITC Forth implementation, since IP points to the word's CFA, not the word's actual machine code.

Since register u is available for use as W, I have decided that the best 65m32 ITC NEXT would look like this:

Code: Select all

NEXT    ;(2 instructions, 2 machine words, 5 machine cycles)
        ldu  ,y+        ; W = (IP) , IP += 1	
        jmp  (,u+)      ; execute code @ (W) , W += 1

I have also revised my mnemonic table: hla (pusH-Load-A) is now pda (Push-loaD-A) ... I believe that this change is in the spirit of the 6502's pha and php mnemonics.

GARTHWILSON · Post by **GARTHWILSON** » Wed Aug 07, 2013 7:07 am

I'm not very concerned with privileged mode or memory protection, as this we don't expect this to be used commercially. Instead, probably anyone who uses it will be developing their own software, or at least starting with someone else's source code. Either way, they'll have full control, and can debug it all and not be at the mercy of any software suppliers that are secretive or unresponsive or incompetent to fix their bugs. When I get done developing a piece of software for my workbench computer, it has no bugs at all--or should I say, none ever show up--and there's definitely no crashing. This cannot be said for commercial bloatware; but I don't see this processor being used that way, especially for consumer-type uses. (To go further, I can usually have it recover from a crash in a second or two and get back on the job without re-loading things, since most crashes seem to be a matter of being safely stuck in a loop, not scribbling all over memory. I tell about it in Tip #40.)

Quote:

Did you notice my [...]

Yes, but did not understand it all yet. It may still take awhile.

I do like that the 6502's and 816's mnemonics are all three letters. Many things could be given aliases to get them down to three letters also, like your JMP(,S+) being RTS. I do that with PIC16 programming too, using 6502 mnemonics to invoke a macro to do something the PIC needs multiple instructions for but the 6502 can do in a single instruction.

Quote:

I didn't have to use y as the code pointer, but it looked more 6502-like than jmp (,u+) which reminds me more of the 6809.

On the 6502 & '816 we use X for the data stack pointer in Forth (as you know). Although the '816 does not need Y in Forth in the same way the '02 does, it would be nice to keep X and Y available, not tied up. Even the '816 does sometimes require saving X as the data stack pointer to use X temporarily for something else, and then has to restore X before the end of the primitive.

Quote:

In fact, I have already translated over 100 machine instructions from 6502 FigForth to 65Org32b FigForth, and so far all except one are 32 bits long!

So is that one 64 bits long? (Since the data bus is 32 bits wide, there are no narrower options, just as the 6502 does not have a 4-bit load or store.) What does your @ (fetch) look like? I compare the 6502 @ (10 instructions) with the much, much shorter (2 instructions) '816 @ at viewtopic.php?f=9&t=1505&p=9705#p9705 . Does yours get this down to a single instruction also, in spite of being 32-bit? That would be great for performance! Lemme see... Can it be done with 1 clock for instruction read, one for operand read, one for data read, and one for data write? That would be four clocks to do the 32-bit job whereas the '816 takes 12 clocks to do the 16-bit job. I realize of course that so far it's only talk, with no hardware.

My 32-bit 6502 DO...LOOP and associated words are at http://wilsonminesco.com/Forth/32DOLOOP.FTH and you can see how many, many instructions it takes to do it on the 6502. It will be fun to see that shortened to just a few instructions.

barrym95838 · Post by **barrym95838** » Wed Aug 07, 2013 7:26 am

Thanks for your interest Garth! I need to get some sleep now, but I'll find (or quickly write) @ and DO tomorrow after work, and post them here ... unless someone else beats me to it!

Mike

[Edit: I snuck away from work long enough to write @:]

lda ,x ;get address
lda ,a ;fetch value @ address
sta ,x ;store value

One of the possible disadvantages with my design is that (direct,x) and (direct),y have to be synthesized. There is much more flexibility though. With no wait states, it's conceivable that my @ could execute in nine (or possibly even six) cycles.

No time for DO ... back to work!!

GARTHWILSON · Post by **GARTHWILSON** » Wed Aug 07, 2013 7:36 am

About your listing above-- I know you were copying out of the fig-Forth source, but zowee what a difference a good macro assembler makes! Here are similar pieces from my '816 Forth:

Code: Select all

         HEADER "EXECUTE", NOT_IMMEDIATE   ; ( addr -- )
EXECUTE: PRIMITIVE
         LDA     0,X
 xeq1:   STA     W
         INX_INX
         JMP     W-1
 ;-------------------
         HEADER "PERFORM", NOT_IMMEDIATE   ; ( addr -- )
PERFORM: PRIMITIVE                         ; same as  @ EXECUTE
         LDA     (0,X)
         BRA     xeq1
 ;-------------------
         HEADER "EXIT", NOT_IMMEDIATE      ; ( -- )
EXIT:    DWL     unnest+2                  ; (primitive)
 ;-------------------
         HEADER "branch", NOT_IMMEDIATE    ; ( -- )
branch:  PRIMITIVE                         ; Set the IP to the absolute addr
         LDA     (IP)                      ; pointed to by the cell following the
         STA     IP                        ; execution token of branch. It's faster
         GO_NEXT                           ; this way not making it relative.
 ;-------------------
         HEADER "0branch", NOT_IMMEDIATE   ; ( n -- )
Zbranch: PRIMITIVE
         INX_INX
         LDA     $FFFE,X         ; (FFFE,X works for '802.)  Get the value that was at
         BEQ     branch+2        ; TOS before INX_INX .  Do the branch if TOS was 0.

 bump:   LDA     IP              ; bump (advance) the instruction pointer by two
         INA_INA                 ; LDA, INA, INA, STA  is faster than  INC INC.
         STA     IP
         GO_NEXT
 ;-------------------

BigEd · Post by **BigEd** » Wed Aug 07, 2013 5:27 pm

barrym95838 wrote:

... Does it still have some of that 6502 flavor, or is it too polluted from its other influences to deserve to be called something like 65Org32b?

Hi Mike, I'm happy to see the discussions of your core, but if I can express a preference, I'd prefer you don't call it the 65Org32b: the 65org32 is a particular thing, and if we make any variations like we did with 65Org16, they might be called 65Org32.b - so, if you could pick something like 65X32, for any letter of your choice other than O (or I, or B,...), that might help minimise future confusion.

Cheers
Ed

barrym95838 · Post by **barrym95838** » Wed Aug 07, 2013 6:20 pm

BigEd:

Message received, understood, and respected! I will go back and edit all of my posts ... my core will be 65m32.

Mike

Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas

Re: Improving the 6502, some ideas