The RTF65002 Core

Arlet · Post by **Arlet** » Thu Sep 26, 2013 10:26 am

GARTHWILSON wrote:

How fast do you have to be going to need SDRAM? SRAM goes down at least as low as 10ns, and I know I've seen 6ns but maybe not in the denser ones.

What makes plain old SDRAM interesting is the higher memory density/package, and low price, while still having a fairly simple interface. It's not that much faster than SRAM, except when doing a write burst. In SDRAMs, doing back to back single cycle writes is supported and documented, while on SRAM, it's a dark area that the datasheets do not speak about.

Keep in mind that 10 ns SRAM only gets you a random access cycle time of about 20 ns, as enso has discovered.

Rob Finch · Post by **Rob Finch** » Thu Sep 26, 2013 8:02 pm

Quote:

OK, I have to know. How long does it take to build your core, from verilog to configured FPGA?

Enso, it takes about 10 minutes to build the core from editing to configuring the FPGA. I'm a bit impatient, so
I keep the system small enough that it doesn't take too long to build. I build the system almost continously, one build after the other while editing and testing between builds. So it's built up little by little.

GARTHWILSON · Post by **GARTHWILSON** » Fri Sep 27, 2013 9:04 am

Arlet wrote:

GARTHWILSON wrote:

How fast do you have to be going to need SDRAM? SRAM goes down at least as low as 10ns, and I know I've seen 6ns but maybe not in the denser ones.

What makes plain old SDRAM interesting is the higher memory density/package, and low price, while still having a fairly simple interface. It's not that much faster than SRAM, except when doing a write burst. In SDRAMs, doing back to back single cycle writes is supported and documented, while on SRAM, it's a dark area that the datasheets do not speak about.

Keep in mind that 10 ns SRAM only gets you a random access cycle time of about 20 ns, as enso has discovered.

sure, with set-up times, address-decode times, etc.. I was expecting about 50MHz. A 50MHz 32-bit processor like barrym95838 has been working on would be about as fast as a 1GHz 6502 if you're constantly dealing with 32-bit values in a higher-level language, without the complexities of cache and DRAM management. (The instruction ratio is about 8:1, and he predicts an average of under two clocks per instruction versus the 6502's four.)

Rob Finch · Post by **Rob Finch** » Fri Sep 27, 2013 5:11 pm

Quote:

It's not that much faster than SRAM, except when doing a write burst.

Don't forget burst reading is very fast. It works well with a cache. It's just the random access reads that are tardy. On the Atlys board I've got the DDR2 clocked at 312.5 MHz. Since DDR2 uses both clock edges that give 625MHz memory performance. It works out to 1.25 GB/s. That's a 1.6 ns access time ! Seems to work.

SDRAM would work well if it's burst fed to/from a fifo.

Rob Finch · Post by **Rob Finch** » Sun Sep 29, 2013 5:38 am

I've managed to trim the core down to a size that might fit into an xc6sLx9 with a simple uart. 5258 LUTs. It still might not route.
Accomplished by removing several instructions, which can in theory be supported by emulating them with an illegal opcode routine. Performance would be lousy, but if it fits ? The core size can still be reduced slightly further by removing and emulating the barrel shift instructions.

Rob
Scratching my head over a software bug at the moment.

BigEd · Post by **BigEd** » Sun Sep 29, 2013 6:08 am

That's progress! The reason I have an LX9, and the reason it might be a good target, is that it's available on a relatively affordable dev board, with 16-bit wide RAM too. That makes it available to anyone who wants an FPGA project without a soldering project. (http://www.xilinx.com/products/boards-a ... MB-LX9.htm)

I do recall when I attempted to add a barrel shifter to 65Org16 that it came out pretty large. So, substituting 1-bit and 8-bit shifts might be very worthwhile from a point of view of fitting into a smaller device. Multiplication is cheap because the multipliers are already sitting there whether you use them or not. So it's only the right shifts which need a mux.

(Division is not cheap: in my view a divide step instruction is as far as it's reasonable to go, and even that is marginal)

barrym95838 · Post by **barrym95838** » Sun Sep 29, 2013 8:47 pm

GARTHWILSON wrote:

... A 50MHz 32-bit processor like barrym95838 has been working on would be about as fast as a 1GHz
6502 if you're constantly dealing with 32-bit values in a higher-level language, without the complexities of
cache and DRAM management. (The instruction ratio is about 8:1, and he predicts an average of under two
clocks per instruction versus the 6502's four.)

My ears are burning, Garth! Seriously, I have completed enough of the specification document to begin
earnest work on a simulator, thanks to ttlworks and teamtempest. I will not start a new 65m32 thread
until both are ready for public view. I have not given serious thought to a supervisor state yet, but a
working user-state should be adequate to illustrate the proof-of-concept.

As mentioned, the 65m32 needs only one 32-bit memory cycle for instruction fetch, and zero, one or two
additional cycles for the execution, making the average about two memory cycles per instruction. With the
exceptions of mul, div, and mod, the decode and execution should succesfully interleave and allow the
machine cycle and memory cycle to be synonymous ... those three instructions would likely be demoted to
instruction traps at this point, depending on details that I have not fully developed.

I am trying to study other examples to catch up on my knowledge in my rather limited spare time ...
please be patient if you can't offer to help me work out some or the dozens of unfinished details. I
am still finding myself wondering if it was a wise choice to "spill the beans" before I had them fully-
cooked ... remember, I'm just an amateur hobbyist who happens to hold a 22-year-old CpE degree, and
not much else!

Thanks to all,

Mike

Rob Finch · Post by **Rob Finch** » Mon Sep 30, 2013 4:42 am

Quote:

The instruction ratio is about 8:1, and he predicts an average of under two
clocks per instruction versus the 6502's four.)

It's really difficult to get under 2 CPI because loads typically stall the pipeline, and they make up about 25% of instrucitons. Branches also toast a pipeline and they make up another 25% of instrucitons. The complexities of pipelining may result in a lower clock frequency.

I'm guessing the CPI for the rtf65002 is somewhere between 3 and 4, slightly better than the 6502 because the core fetchs whole instructions at once. Like the '02 many instructions execute in just 2 clocks. Running at 25MHz the RTF65002 is probably equivalent to a 250MHz 6502. Given that 8:1 instruction ratio.

Quote:

I am trying to study other examples to catch up on my knowledge in my rather limited spare time

There's lots of sample cores on OpenCores.org, including a couple of MIPS compatible cores. My own core, the Raptor64, includes things like branch prediction, and caches. If you have questions, post or PM , I might be able to answer some. But then again maybe it's bad advice since I'm non-pro.

barrym95838 · Post by **barrym95838** » Mon Sep 30, 2013 5:29 am

Rob Finch wrote:

... If you have questions, post or PM , I might be able to answer some.
But then again maybe it's bad advice since I'm non-pro.

Thanks, Rob. I might take you up on that ... just as soon as I figure out
the proper questions to ask!

Mike

P.S. I just found this .pdf in which AMD claims a sustained 17 MIPS at
25 MHz for their 29000. I definitely want to add this to my reading list!

GARTHWILSON · Post by **GARTHWILSON** » Mon Sep 30, 2013 11:26 pm

BigEd wrote:

(Division is not cheap: in my view a divide step instruction is as far as it's reasonable to go, and even that is marginal)

Does the entirely different approach at http://6502org.wikidot.com/software-math-fastdiv help? It looks like Bruce's work. He's amazing at this kind of thing. It does need more explanation for me to understand it. To do it in software makes for a short routine with no looping; so maybe doing it in hardware would require only a small number of gates.

Rob Finch · Post by **Rob Finch** » Tue Oct 01, 2013 12:04 am

Quote:

Does the entirely different approach at http://6502org.wikidot.com/software-math-fastdiv help? It looks like Bruce's work. He's amazing at this kind of thing. It does need more explanation for me to understand it. To do it in software makes for a short routine with no looping; so maybe doing it in hardware would require only a small number of gates.

I'm not sure I understand that approach either. But I counted the number of operations and there seems to be at least as many ops as there would be in a standard hardware division. The code requires a shifter and an adder plus some branching (multiplexer) I've studied hardware dividers some, and not come across a divider that makes use of that approach. A hardware divider only uses a subtracter, shift register and multiplexor. A multiplier is simpler than a divider so uses less gates.

I could be using a higher radix (eg radix 4) divider because the clock frequency of the core (about 50MHz max) is low enough that a higher radix divider wouldn't affect it. However its an even larger design then. There's also a cached reciprocal divider that allows divides in only three clock cycles.

Rob Finch · Post by **Rob Finch** » Tue Oct 01, 2013 2:05 am

I modified the assembler to output some statistics. Here's the code density for the bootrom and TinyBasic:

Number of instructions processed: 5131
Number of opcode bytes: 14576
Bytes per instruction: 2.840772

Not bad for 32 bit processing.

Rob Finch · Post by **Rob Finch** » Fri Oct 04, 2013 12:01 am

I did some more statistics to calculate an approximate CPI and it turned out to be almost PI:

For the RTF65002:
Number of instructions processed: 5261
Number of opcode bytes: 15051 <- wow a palindrome
Bytes per instruction: 2.860863
Clock cycle count: 16560
Clocks per instruction: 3.147691 <- and PI

The above statistics are only estimates.

The CPI assumes data memory access requires two clock cycles and instruction
access is single cycle. The actual CPI may be higher if there are memory wait
states, or lower if data is found in the cache.

For the 6502 (EhBASIC):
Number of instructions processed: 4554
Number of opcode bytes: 9105
Bytes per instruction: 1.999341
Clock cycle count: 15929
Clocks per instruction: 3.497804

The above statistics are only estimates.

The CPI assumes data memory access requires two clock cycles and instruction
access is single cycle. The actual CPI may be higher if there are memory wait
states, or lower if data is found in the cache.

ElEctric_EyE · Post by **ElEctric_EyE** » Fri Oct 04, 2013 1:10 am

***Watching with interest***

Rob Finch wrote:

I did some more statistics to calculate an approximate CPI and it turned out to be almost PI...

Is this pure coincidence or some kind of clue to something that's happening on a deeper level?

Rob Finch · Post by **Rob Finch** » Thu Oct 10, 2013 3:35 am

I hooked up a temperature sensor (Dallas 1626) and now one can use the Atlys as an expensive thermometer. A readout of the temp can be done by typing TE at the prompt.

Added to the core most recently are bitmap bit instructions and a string compare instruction. Adding them didn't increase the code bloat too much. The bitmap instructions set/clear/flip or test a bit relative to a starting address for the bitmap. The bit number to work on is stored in the accumulator. These are read-modify-write instructions so the bus is locked until the update complete.
So
LDA #7000
BMC $1000 ; bitmap clear
clears the 7000th bit relative to the starting address $1000.

The string compare opcode (CMPS) compares two strings located in the .x and .y registers until the strings are different, or the count stored inthe acc expires. The flags are set appropriately as a result of the compare. I hope to have a character search function too.

Opcode for the processor spilled over into a second opcode page. So there is a prefix instruction to indicate a second page opcode.

The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core

Re: The RTF65002 Core - new instructions