The RTF65002 Core

ElEctric_EyE · Post by **ElEctric_EyE** » Wed Sep 25, 2013 11:20 am

Sounds like you got it alright. With Arlet's core it requires some gutting of the code for the overlap. I quit a year or so ago, after a quick dive in because it got abit complicated, plus I was busy and still am. But soon I would like to use a 32 bit 6502 core for a controller board I will need. Doesn't have to be 100MHz, but it would be nice as the hardware it will be driving will be fast as well.
So what I am saying is, I can add my time as a troubleshooter if someone else does this core.

BigEd · Post by **BigEd** » Wed Sep 25, 2013 7:21 pm

Rob Finch wrote:

You've probably heard this before but,

Interesting idea about w - of course you are right that some pairs of opcodes collapse into the same behaviour.

If I were to tackle the 65Org32, I'd try to do it as a conditional configuration of Arlet's core - which might be a mistake - and I'd make a version which changed the minimum possible. This version would act as a vanilla base version for launching off into architectural variations. That's what I tried to do with 65Org16.

Such thinking does rule out a short form immediate. As the machine is word-addressed, all instruction fetches and other memory accesses are 32 bits wide. Packing an operand into an instruction is tempting for 65Org16 and even more tempting for 65Org32, but in a base version I personally would leave it out.

Interesting idea about ADD and SUB. I'd have to think about that, but on the face of it you're right, that support for easy multiword arithmetic is barely worthwhile.

It's quite possible that a from-scratch effort would make more sense than to use an existing core as a base, and in that case a few of the "extras" ideas might well be attractive enough to appear even in the base version.

Cheers
Ed

GARTHWILSON · Post by **GARTHWILSON** » Wed Sep 25, 2013 8:19 pm

Rob Finch wrote:

Quote:

the 65Org32 is strictly a 32-bit machine, both for address and data...

You've probably heard this before but,

It's many pages long, but the discussion on it is at viewtopic.php?f=1&t=1419. You'll see I mentioned you right up front. The 65Org32 is like a 65816 (minus the emulation mode) but with all the registers being 32-bit; so it's far more suitable than a big 6502 for multitasking, multithreading, relocatable code, higher-level languages, etc.. Since the bank registers are 32-bit also, there are no bank boundaries, because the entire 4gigaword space is available all the time to every addressing mode, so the bank registers simply become offsets, and every program can think that it (and its private data) starts at address 0 (even if it had to be moved after having been loaded).

Quote:

If the 65org32 is strictly 32 bits, then the absolute addressing modes are redundant. abs,x is the same as zp,x. and abs is the same as zp.
I'd suggest reusing the address mode opcodes to add another index register 'w'. abs,y becomes zp,y and abs,x becomes zp,w.

There would certainly be enough op codes available to do anything we wanted. Although ZP (or, in the parlance of the '816, "DP" since it can be moved around) covers the whole 4gigaword space, it is still desirable to keep it, and the DP register (which is also 32-bit) becomes another offset. The 816's "long" address modes would be used not for anything particularly longer but for ignoring the bank registers in instructions that look outside the currently running program.

I might be in favor of another index register if it doesn't come with a penalty somewhere else. BigEd observed, "With 6502, I suspect more than one beginner has wondered why they can't do arithmetic or logic operations on X or Y, or struggled to remember which addressing modes use which of the two. And then the intermediate 6502 programmer will be loading and saving X and Y while the expert always seems to have the right values already in place." I really had little desire for an additional index register until working through an idea for a third stack for high-precision or floating-point as we discussed on the forum years ago and I'm expanding on it for the stacks primer (which I've been able to work on again for the last few days). [Edit: It's up, at http://wilsonminesco.com/stacks/ .] Really the only higher-level language I've used on the 6502 is Forth, and it uses X constantly as the data stack pointer and seldom has any need to save it to use X for anything else. If I were to implement a complex floating-point stack too though, it would make it nice to have the equivalent of another X register, in this case apparently W. It would be good to hear from those who intimately know the insides of C or other compiled languages, to see what would be most helpful there. Just throwing registers at it without a clear plan of what to use them for may not be a very good idea.

Quote:

It would also save code space to have a short form immediate eg. 16 bits instead of 32 (did this onthe rtf65002).

Do you mean merging the operand with the op code in the same word? The BBS, BBR, SMB, and RMB do that which are considered to be kind of oddball instructions on the 65c02 since they're mainly useful for I/O and yet I/O is seldom in ZP where they operate except in the microcontrollers these instructions get used in. Since I know very little about HDL design of processors' internals, it would seem to me that instruction decoding would become more complex than the traditional 6502 way. On the PIC microcontrollers, having the op code and operand merged sometimes makes it harder to check things in the list code, but that might be remedied with a clean and consistent division between hex digits so no digit has bits from both.

Quote:

Then with 32 bits it also makes sense to use a plain ADD/SUB instruction rather than ADC/SBC.
Otherwise I'd keep the rest of the processor the same in order to keep it small.

There's total freedom to do whatever a designer wants of course; but the goal with the 65Org32 itself is to retain the 6502/816 flavor and make a truly 65-family processor with a lot more computing power. I have lamented that other efforts like the 65GZ032 which, although never finished, did have working hardware, was really a RISC with very little resemblance to the 6502 except in emulation mode. I think the idea was a plug-in replacement to run in the Commodore 64, which is why the emulation mode, for boot-up.

Quote:

there'd be no backwards compatibility.

Probably the only reasons for backwards compatibility is so customers will migrate more easily and so already-written software can continue to be sold. Otherwise I see backwards compatibility as big damper on performance. The backward-compatibility part here is really more a matter of taking advantage of a lot of 6502 programming experience and to make it capable of going much further than it ever could on a 6502 or even '816. That can still happen even if same mnemonics result in different op codes, the bus widths are different, etc..

Quote:

No barrel shifter, no additional registers (save w). no additional instructions. No cache. The goal being to fit the processor in a relatively small FPGA.

With 32 bits, I think a barrel shifter is going to be rather necessary. Doing ASL for example 20 or 30 times would be a killer to performance.

Rob Finch · Post by **Rob Finch** » Wed Sep 25, 2013 11:38 pm

Quote:

Do you mean merging the operand with the op code in the same word?

Yes, I'd use up some space in the 32 bit opcode for a 16 bit constant. Otherwise a lot of space would be wasted.
This could be done for zero page mode too. If zero page were limited to 64k. It would make zero page mode a cycle
faster too. It means the state machine is different than some of the 6502 cores.

Quote:

Doing ASL for example 20 or 30 times would be a killer to performance.

Often one only needs to shift a couple of bits. Multiply and divide instruction can also be used to shift bits, and
aren't a lot more expensive than a barrel shifter. Multipliers are built into some FPGAs. It might be cheaper to use a multplier rather than a barrel shifter. I seem to remember reading an article about using multiplers to do rotates
as well.

enso · Post by **enso** » Thu Sep 26, 2013 4:44 am

OK, I have to know. How long does it take to build your core, from verilog to configured FPGA?

GARTHWILSON · Post by **GARTHWILSON** » Thu Sep 26, 2013 6:03 am

Rob Finch wrote:

Quote:

Do you mean merging the operand with the op code in the same word?

Yes, I'd use up some space in the 32 bit opcode for a 16 bit constant. Otherwise a lot of space would be wasted. This could be done for zero page mode too. If zero page were limited to 64k. It would make zero page mode a cycle faster too. It means the state machine is different than some of the 6502 cores.

With a 4 gigaword address space, myself, I have not been very concerned for the amount of memory taken for the programs. In my experience, the main reason to have a huge memory space is for data. I could wish SRAM were cheaper today, but even without adjusting for inflation, I think it's far cheaper than even DRAM was in the mid-1970's when the 6502 was designed. Without using more-major pipelining, would merging the operand with the op code really make for any significant performance improvement, since the instruction won't be in until the end of the read cycle and the operand could be getting fetched in the next cycle while the instruction is geting decoded? It seems to me like you'll have the same delay, whether the operand is fetched in cycle 1 or cycle 2.

Quote:

Doing ASL for example 20 or 30 times would be a killer to performance.

Often one only needs to shift a couple of bits. Multiply and divide instruction can also be used to shift bits, and aren't a lot more expensive than a barrel shifter. Multipliers are built into some FPGAs. It might be cheaper to use a multplier rather than a barrel shifter. I seem to remember reading an article about using multiplers to do rotates as well.

I am definitely in favor of having a hardware multiplier, and the A and B accumulators of the '816 which are accumulator C when you put them together would become the two 32-bit inputs of the multiply, with accumulator C holding the 64-bit product when you're done. As for the shifting and rotating, I have also thought that being able to go in steps of 8 bits would be good so if you want to shift something over 24 bits for example, you do three shifts of 8 bits at a time instead of 24 single-bit shifts. So instead of being able to shift any arbitrary number of bits, the instructions could go in increments of 1 and 8 bits. This might be used when compressing text to get four bytes in one word for example. I would definitely like a hardware divide instruction too, but I feel like that's asking too much. Maybe I'm wrong.

Arlet · Post by **Arlet** » Thu Sep 26, 2013 6:17 am

GARTHWILSON wrote:

With a 4 gigaword address space, myself, I have not been very concerned for the amount of memory taken for the programs. In my experience, the main reason to have a huge memory space is for data. I could wish SRAM were cheaper today, but even without adjusting for inflation, I think it's far cheaper than even DRAM was in the mid-1970's when the 6502 was designed. Without using more-major pipelining, would merging the operand with the op code really make for any significant performance improvement, since the instruction won't be in until the end of the read cycle and the operand could be getting fetched in the next cycle while the instruction is geting decoded? It seems to me like you'll have the same delay, whether the operand is fetched in cycle 1 or cycle 2.

Compact code makes a lot better use of limited cache memory.

Rob Finch · Post by **Rob Finch** » Thu Sep 26, 2013 6:40 am

Quote:

Compact code makes a lot better use of limited cache memory.

I suppose the i-cache could implement just the low order eight bits.

In order to implement separate I and D caches, an additional signal like 'VPA' on the '816 is required.
Otherwise the cache controller would have to watch the bus and decode instructions to know what to store off.

I've been looking at Artlet's core and I can't see how the pc increment works for single byte instructions. It looks like the pc would be incremented by two. The pc increment looks like it takes place in both the IFETCH and DECODE states.

I tried synthesizing the 65Org16 code but with 32 bit databus width and 64 bit address bus width. If only 32 bits is desired for addressing, the upper 32 bits could just be left unconnected. Even with 64 bit addressing the core's only about 1,000 LUTs.

Arlet · Post by **Arlet** » Thu Sep 26, 2013 6:48 am

Rob Finch wrote:

I've been looking at Artlet's core and I can't see how the pc increment works for single byte instructions. It looks like the pc would be incremented by two. The pc increment looks like it takes place in both the IFETCH and DECODE states.

In the single byte instructions the state machine goes from DECODE -> REG -> DECODE. It doesn't do a FETCH state, because the next opcode has already been read.

GARTHWILSON · Post by **GARTHWILSON** » Thu Sep 26, 2013 7:13 am

Arlet wrote:

Compact code makes a lot better use of limited cache memory.

Um, cache? In the 65Org32? What does that mean for the determinateness of interrupt response time? The low jitter is perhaps even more important to me than the latency itself, and I suspect that a cache miss once in awhile could make interrupt response time very indeterminate.

Arlet · Post by **Arlet** » Thu Sep 26, 2013 7:32 am

Typically, you'd use a 32 bit system with a large memory, and large memories mean SDRAM, and SDRAM doesn't work well without a cache.

But, of course, if you only use SRAM, then a cache is optional. On the other hand, if you want to run (partially) from internal FPGA memory, compact code is equally important, because BRAM is a scarce resource.

And it's always possible to put general code in cached external memory, and put your low latency ISR in local memory, where it's fast and predictable.

What do you need low jitter for ? Depending on the application, it may be easier to add a smart peripheral. For instance, if you want to sample an ADC, the FPGA can take care of reading the ADC at an exact period, and put the results in a FIFO. The CPU then doesn't have to worry about jitter.

GARTHWILSON · Post by **GARTHWILSON** » Thu Sep 26, 2013 8:50 am

Arlet wrote:

And it's always possible to put general code in cached external memory, and put your low latency ISR in local memory, where it's fast and predictable.

What happens if the ISR hits during a cache refill because of the miss? Will it be dozens, hundreds, or even thousands of cycles before the processor resumes normal operation so it can even start the interrupt sequence? (I'm not particularly challenging, just wanting to make sure everything relevant is considered.)

Quote:

What do you need low jitter for? Depending on the application, it may be easier to add a smart peripheral. For instance, if you want to sample an ADC, the FPGA can take care of reading the ADC at an exact period, and put the results in a FIFO. The CPU then doesn't have to worry about jitter.

Yes, that's much of it, but I was hoping to avoid that complexity. The length of FIFO required also makes it harder to start instantly without delay, and stop on a dime. If you can do it with interrupts, I expect that it would open up a wider range of general-purpose applications beyond what a sound-card manufacturer had in mind.

Arlet · Post by **Arlet** » Thu Sep 26, 2013 9:03 am

GARTHWILSON wrote:

What happens if the ISR hits during a cache refill because of the miss? Will it be dozens, hundreds, or even thousands of cycles before the processor resumes normal operation so it can even start the interrupt sequence? (I'm not particularly challenging, just wanting to make sure everything relevant is considered.)

That depends on the design. Not more than dozens, even with the simplest design possible. But you have to realize that with slow external memory, like SDRAM, it's not much better without a cache. If you happen to be reading from SDRAM while an interrupt hits, the CPU will still finish the current instruction. Even with SRAM you can have this problem, if there are multiple bus masters, such as DMA or video access.

Quote:

Yes, that's much of it, but I was hoping to avoid that complexity. The length of FIFO required also makes it harder to start instantly without delay, and stop on a dime. If you can do it with interrupts, I expect that it would open up a wider range of general-purpose applications beyond what a sound-card manufacturer had in mind.

You could have a 'FIFO' of 1. Basically, the FPGA takes a sample from the ADC at the exact time you want it, keeps the value in a register, and sends an interrupt to the CPU. The CPU then has a whole period to respond before the ADC register is overwritten with a new value. Or use an ADC that already has periodic sampling capability.

Rob Finch · Post by **Rob Finch** » Thu Sep 26, 2013 9:17 am

Quote:

the state machine goes from DECODE -> REG -> DECODE

OKay, I got it. I just got confused because I called the instruction fetch stage IFETCH on the rtf65002.

Quote:

The length of FIFO required also makes it harder to start instantly without delay,

Could you use a dual cpu system with one cpu dedicated to servicing the ADC, and the other handling other tasks as required ? They could communicate through the dual port BRAMS.

GARTHWILSON · Post by **GARTHWILSON** » Thu Sep 26, 2013 9:48 am

How fast do you have to be going to need SDRAM? SRAM goes down at least as low as 10ns, and I know I've seen 6ns but maybe not in the denser ones.

Quote:

Could you use a dual cpu system with one cpu dedicated to servicing the ADC, and the other handling other tasks as required ? They could communicate through the dual port BRAMS.

I've thought about a dual-CPU setup, but for a different reason. Having only one do the realtime applications and the human I/O and file system too seems prohibitive. A dual-port RAM is one way I've thought about linking them though.

As for DMA, if there are any dead bus cycles at all, those can be used to get DMA without taking any time away from the processor, as discussed in the topic "The secret, hidden, transparent 6502 DMA channel."

The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core