WDCs W65T32 Terbium 32-Bit Processor

GARTHWILSON · Post by **GARTHWILSON** » Thu Feb 01, 2007 6:19 am

Quote:

I'm not aware of any microcontrollers which have separate instruction and data address spaces. I'm aware of quite a few DSPs which have this, but not any microcontrollers.

Microchip PIC microcontrollers have separate space, but I know they're a pretty pittiful example of a RISC. I don't use a compiler for them, but I can definitely say their separate address and data spaces are a pain to work with in assembly-language programming. Things like indexing are a very mickey-mouse, inefficient proposition with them too.

fachat · Post by **fachat** » Thu Feb 01, 2007 9:35 am

TMorita wrote:

From a compiler code generation point of view, it is indeed a mistake.

For example, in certain situations, gcc needs to generate dynamically generate code at runtime. On most processors, this can be done on the runtime stack, but if a processor has separate instruction/data address spaces, this can't be done.

Does that not collide with the "no-execute" features of modern e.g. x86 CPUs?

From your description I see that from a compiler builder point of view consistency and flexibility is important - flexibility in that many/all registers can be used for the same purposes (e.g. output values) and consistency in that all registers and operations behave the same/similarly (e.g. with addressing modes)
I guess this much improves the compiler's ability to schedule(?) operations on different registers, so as to avoid moving data around between registers and temporary variables.

BTW: How does cc65 work in this respect?

André

kc5tja · Post by **kc5tja** » Thu Feb 01, 2007 5:13 pm

Quote:

For example, in certain situations, gcc needs to generate dynamically
generate code at runtime.

!!!

What situations are these for? I cannot think of any reason why gcc would need to dynamically create code. This might be a feature of glibc (e.g., to optimize memory moves given a specific pattern of input parameters), but not gcc.

Quote:

I'm not aware of any microcontrollers which have separate instruction and data address spaces. I'm aware of quite a few DSPs which have this, but not any microcontrollers.

Atmel microcontrollers do, and if I remember correctly, PICs do as well.

I might add that GCC for ATmega microcontrollers (Atmel's top of the line units) produces amazingly good code. Although, I hardly exercised it when I tried it. Still, it was amazingly nice.

kc5tja · Post by **kc5tja** » Thu Feb 01, 2007 5:24 pm

fachat wrote:

Does that not collide with the "no-execute" features of modern e.g. x86 CPUs?

Not on a flat address space model like what Windows NT and Linux use. The code and data segments are slapped on top of each other, so that both cover $00000000 through $FFFFFFFF, and therefore, it's possible to generate code in-place or even use self-modifying code if you wanted.

The biggest hit comes from the fact that you cannot have a single cache line in code and data caches at the same time (engineered specifically this way because, at the time the 80486 came about, self-modifying code was still an issue, apparently). Other CPUs, like the PowerPC and 68K series, had to have special instructions executed that specifically flushed a cache line with intent for execution or data access (DFLUSH and IFLUSH instructions). The Amiga proved that having these were not such a huge burden; when the 68030 became available, the impact on existing software was minimal, and within two months, all errant software had new releases that were 68030 compatible, and fully backward compatible with the 68000 through 68020. In short, it was a non-issue, a normal industry growing pain that was quickly resolved.

However, as usual, Intel favored increasing levels of complexity rather than simplicity.

Quote:

I guess this much improves the compiler's ability to schedule(?) operations on different registers, so as to avoid moving data around between registers and temporary variables.

It actually, in many cases, also reduces hardware cost too, albeit at the cost of increased run times (multiplexors and demultiplexors, the core of the register files used in register-register CPUs, have non-zero propegation delay which must be factored into instruction execution times, which in turn, affects maximum clock speeds). However, with today's semiconductor processes, even these are non-issues. I mean, we have CPUs with better than 1GHz runtime performance, with multiple execution units, etc. The average performance boost this all gives is on the order, minimum, of 400 these days (relative to a commonly quoted baseline of 1MHz 6502).

Given a 65816 that runs with modern semiconductor processes, you should be able to realize a Commodore 64 that runs at several GHz performance too. But, the caches would need to be engineered differently to compensate for the 65816's heavy use of RAM (instead of registers). It's doable, but is the return on investment worth it?

Quote:

BTW: How does cc65 work in this respect?

cc65 produces decent code considering the limitations of the 6502. It's a grass-roots C implementation, based originally on SmallC. Unfortunately, it ONLY supports the 6502 -- there is no 65816 support, and therefore, no way to more fairly compare the two platforms (since the 65816 compares more closely to the 68000 in performance when fully utilizing its features, and its new instructions which are MUCH more compiler friendly).

However, I'm thinking more seriously now of implementing a MachineForth compiler for the 65816 (yes, I know it's not C, but not all languages have to be like C!) using GForth, and hopefully with any luck, I can speed up my Kestrel's firmware development by doing so.

TMorita · Post by **TMorita** » Thu Feb 01, 2007 7:27 pm

kc5tja wrote:

What situations are these for? I cannot think of any reason why gcc would need to dynamically create code. This might be a feature of glibc (e.g., to optimize memory moves given a specific pattern of input parameters), but not gcc.

Some front-ends for gcc support nested functions. I know Ada definitely supports this feature, and I think Modula-2 may as well.

Anyway, when nested functions are used, gcc will create and execute code on the runtime stack when calling the nested function. I don't remember the exact reason why, unfortunately. I think it had something to do with massaging the arguments, but I'm not sure about this.

Toshi

blargg · Post by **blargg** » Mon Feb 05, 2007 3:10 am

I think TMorita might be referring to small blocks of code created at run-time that must be able to find associated data without any additional information. They would do this by storing the data with the code, relative to the program counter. I remember this kind of thing from the M68K days of the Mac, and the solution then was to store the data with the code, but not in an instruction, so no code was being modified when the data values were set. But on the Mac you weren't generating this at run-time the way it sounds like gcc does. GCC has lots of non-portable extensions to the language. I don't think plain C has any need for run-time code generation.

kc5tja · Post by **kc5tja** » Mon Feb 05, 2007 3:59 am

Indeed, that kind of runtime code generation sounds like a no-win situation to me. I've observed my C code output in disassembly form, and never observed dynamically generated code. But, I also don't use the nested functions solution.

However, I do occasionally use Oberon, and I'm patently aware that that environment also does not utilize dynamically generated code either. So GCC's solution to the problem seems, to me, very much like installing a finishing nail with a rock thrown from a ballista to me.

Where I do see dynamically generated code as being exceptionally helpful is in graphics blitter routines, where blit operations can be compiled given a set of constraints (kind of like compiling SQL queries prior to their use). I intend on using this approach for a more sophisticated graphics interface to the Kestrel, since the 65816 is so utterly horrible when it comes to general purpose bit manipulation (single-bit shifts and rotates really, really, really suck

).

lordsteve · Post by **lordsteve** » Wed Feb 07, 2007 5:31 pm

Back to the topic... It looks like Terbium as a chip just disappeared and now there's Terbium the IDE. I feel gypped, but perhaps I am missing something? Just what is Terbium, anyway?

kc5tja · Post by **kc5tja** » Wed Feb 07, 2007 6:23 pm

Terbium "the chip" didn't disappear; something which never existed cannot possibly disappear.

The 65T32 product announcement was for a 32-bit address space, 16-bit word-addressed processor (thus an 8GiB address space) that is a superset of the 65816. It is still in development. And, yes, it's still vaporware until we see real silicon or IP licenses being offered.

However, what started out as a project name has evolved into a trademark for the company. They're applying that trademark to all their products. When I asked for an update on the status, they were tight-lipped -- far more so than they ever have been. This suggests that they are relatively close to some kind of Terbium-related announcement. I cannot predict what kind of announcement it will be. It's definitely NOT the TIDE though, for it is already public knowledge.

This is not an unheard of thing for companies like WDC to do. It happens all the time. When I worked for Hifn, we did the same thing.

Though, I did suggest to WDC that they get rid of the 65T32 name -- it's not well liked by WDC nor by me because the 6532 is already taken by another 65xx-class chip.

I personally would prefer 65000; however, I did humorously suggest that they resurrect the 65832 name, since that processor died while still on the drawing board. They got the market all hyped up about it, gave release dates, and even sample datasheets. And then . . . *poof*. Nothing. Which adds all the more to the humor of using the 65832 moniker.

I still prefer the name "65000" though, because that leaves plenty of space open for families of processors based on the architecture.

TMorita · Post by **TMorita** » Sat Feb 10, 2007 8:49 am

kc5tja wrote:

Indeed, that kind of runtime code generation sounds like a no-win situation to me. I've observed my C code output in disassembly form, and never observed dynamically generated code. But, I also don't use the nested functions solution.
...

http://en.wikipedia.org/wiki/Trampoline_(computers)

"In the GCC compiler, trampoline refers to a technique for implementing pointers to nested functions. The trampoline is a small piece of code which is constructed on the fly on the stack when the address of a nested function is taken. The trampoline sets up the static link pointer, which allows the nested function to access local variables of the enclosing functions. The function pointer is then simply the address of the trampoline. This avoids having to use "fat" function pointers for nested functions which carry both the code address and the static link. See GCC internals: Trampolines for Nested Functions. "

Toshi

VBR · Post by **VBR** » Thu May 31, 2007 8:52 pm

fachat wrote:

BTW: How does cc65 work in this respect?

A conventional C compiler for the 6502 is hampered by the lack of stack-based addressing modes. Ideally, a 6502 C compiler would not use a stack at all (screw recursion) and globally optimize the allocation of variables to zero page. But that's just a dream.

kc5tja · Post by **kc5tja** » Thu May 31, 2007 11:14 pm

Hampered, but not impossible. You can use dp,x as a substitute, for example. Recursion is limited, of course.

However, that being the case, the 6502 is better suited for languages which lack any concept of "activation frames" (e.g., Forth). The 65816 fares MUCH better than the 6502 for supporting stack frames. Generalized activation frames (e.g., Scheme's "continuations"), however, still give it quite a bit of trouble.

VBR · Post by **VBR** » Fri Jun 01, 2007 1:59 am

kc5tja wrote:

Hampered, but not impossible. You can use dp,x as a substitute, for example. Recursion is limited, of course.

However, that being the case, the 6502 is better suited for languages which lack any concept of "activation frames" (e.g., Forth).

Or BASIC.

C is actually okay if you use an all-globals programming style. But then you lose some of the benefit of using C.

kc5tja · Post by **kc5tja** » Fri Jun 01, 2007 3:41 pm

Thus proving my point. mid-80s-era BASIC implementations (of which EhBASIC is an instance) simply lack activation frames, thus preventing effective use of structure and, in particular, modules when programming. Nobody needs local variables!!

Of course, you can fake locals with arrays:

Code: Select all

10 SP=0:REM My fake stack pointer
20 DIM IS(100), SS$(100):REM The stacks.
...
1000 REM Compute celcius from farhenheit the HARD way.
1010 SP=SP+1
1020 IS(SP)=(IS(SP-1)-32)/180
1030 R=IS(SP)*180
1040 SP=SP-1:RETURN
...

All you're doing here is hard-coding the activation frame (check it -- it even allows recursion, but I still won't recommend it) mechanisms by hand. It might even compile efficiently (it certainly won't interpret very fast!).

This reminds me of a VisualBasic 3 programming project I once had to do, where I sorely lacked various features provided by proper activation frames. I ended up not only re-implementing them as above, but I also implemented "virtual CPU registers" as global variables too, to facilitate parameter passing. This was for a dentistry management application. The code was ugly, but the code ran correctly. We shipped on schedule.

However, the above approach simply doesn't give you the benefit of modular programming, and moreover, heavens help you if you attempt object oriented or functional programming.

Languages like Lisp or Scheme have activation records, not frames. That means, these things are actually data structures, allocated on the heap just like any other object a program would manipulate (this is taken advantage of in Scheme by exposing continuations, an entire chain of activation records). These provide all the same benefits of activation frames, but may be more suitable to the 6502 because they're normal data structures, and not based on a stack. As such, you can implement them much more freely (their semantics aren't so hard-set in concrete). Hence, I actually expect a language like Lisp/Scheme to outperform C on the 6502.

I should note that Forth on the 65816 relies heavily on stacks, and even in the best of cases (subroutine threaded with primitive inlining), you're looking at an average of 10 clock cycles per Forth primitive. Colon definitions or their equivalents are a minimum of 12 cycles. So, as a general rule, if you have a 10MHz 65816, you're pulling about 1 Forth MIPS. It varies erratically, and deeply nested colon definitions will drop that to about 0.5 Forth MIPS, but still amazingly consistent. I would expect all MIPS measurements to drop by a factor of 2 for the 6502.

Although it relies heavily on stacks, it gets its performance from not having activation frames -- it implicitly assumes the operands you're working with are always at the top of the stack, something even a 6502 can handle relatively easily.

Someday, I'll toy with the idea of making a Lisp-like language of some kind for the 65816, based on my experience with Haskell.