6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 3:04 am

All times are UTC




Post new topic Reply to topic  [ 8 posts ] 
Author Message
PostPosted: Sun Nov 12, 2017 11:59 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
This is a 6502-like core which compiles to around 250-270 LUTS on a MachXO3 and runs up to 150MHz on that FPGA (100MHz stable). All instructions take 1 cycle to complete, with the exception of the first instruction after reset that takes 2 cycles.

That is about 4-16 times faster than a full 6502 core on this FPGA (WDC 6502 runs at 75MHz on this FPGA, Arlets extended 65C02 (by BigEd&Dave) runs at about 28MHz).

The N6502 has a cut-down instruction set from a normal 6502 set with a lot of restrictions:

Each opcode is two bytes including address/data. The total program memory is 512 bytes which means 256 words (or 256 instructions). Branch works in words (e.g. 2*bytes), but can be easily changed in the code for 6502 compability (but then limited to +/-128 bytes).

Data memory is a separate space of 256 bytes. So no self-modifying code with only one core.

Stack memory is separated with 256 bytes, making total memory 1024 bytes.

The following instructions are currently implemented:
Code:
   // LDA #nn      is $A9, %10101001
         // LDA $ZP       is $A5, %10100101
         // LDA $ZP,X   is $B5, %10110101
         // LDX #nn      is $A2, %10100010
         // LDX $ZP      is $A6, %10100110
         // STA $ZP      is $85, %10000101
         // STA $ZP,X   is $95, %10010101
         // STX $ZP      is $86, %10000110
         // BCC $xx      is $90, %10010000
         // BCS $xx      is $B0, %10110000
         // ADC #nn      is $69, %01101001
         // ADC $ZP      is $65, %01100101
         // ADC $ZP,X   is $75, %01110101
         // SBC #nn      is $E9, %11101001           
         // SBC $ZP      is $E5, %11100101
         // SBC $ZP,X   is $F5, %11110101   
         // CMP #nn      is $C9, %11001001
         // CMP $ZP      is $C5, %11000101
         // CPX #nn      is $E0, %11100000
         // CPX $ZP      is $E4, %11100100
         // DEX         is $CA, %11001010
         // AND #nn      is $29, %00101001
         // AND $ZP      is $25, %00100101
         // AND $ZP,X   is $35, %00110101
         // ORA #nn      is $09, %00001001
         // ORA $ZP      is $05, %00000101
         // ORA $ZP,X   is $15, %00010101
         // CLC         is $18, %00011000
         // SEC         is $38, %00111000
         // TXA         is $8A, %10001010
         // TAX         is $AA, %10101010
         // ROL A      is $2A, %00101010
         // ROR A      is $6A, %01101010
         // ASL A        is $0A, %00001010
         // LSR A        is $4A, %01001010
         // PHA         is $48, %01001000
         // PLA          is $68, %01101000
         // SQR #n      is $49, %01001001
         // SQR $ZP      is $45, %01000101
         // SQR $ZP,X   is,$55, %01010101
         
         // Note: $AD, $BD, $CD, $6D, $7D, $3D, $2D, $1D and $0D also triggers (as $x5 instructions)
         // NOP : all other instructions


The SQR instruction returns the Square product into the accumulator and x-register and replaces EOR. The square instruction requires a table of 512 bytes to work.

Note that all 1-byte opcodes requires an extra (unused) byte after them to become two bytes. Except for that and the difference in branch, you can use your preferred assembler for this.

Constructive comments are appreciated. The core can probably run stand-alone in some small CPLD's or as a co-processor for a full 6502. Stack sharing is not currently implemented (for multi-core systems), but is planned in a future release.

Please note that this is a core under development, so files and features may change without further notice.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Attachment:
File comment: Source files for the N6502 version 0.1.
N6502-v0.1.zip [24.69 KiB]
Downloaded 274 times


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 3:27 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
An interesting variation on a theme! Of course my first reaction is to wonder about mechanisms to eke out just a little more program space, or a little more data space. It's a bit of a challenge with fixed-length instructions, how to embed an address which is adequately long. I suppose an indirection register or two would do the job - perhaps banking for the program memory and indirection for the data memory.

Then again, perhaps you have confidence that 256 words is enough for some interesting or useful programs?


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 9:19 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
The intention was to make something fast, and one has to choose between fast and large. Now, its not intended as a single core, and I have already tested it as two cores, and you can add even more.

Why many cores? Difficult to program and limiting in many instances, but for some applications you need massive data throughput and then it helps alot with more than one core. Fast spectroscopy for instance (I am a physicist) in which you need to collect 16 or 24 bits of data at ns-intervals and analyze it on the go. Or neural networks for AI applications in which you want to push through a lot of simulated "neurons". If you can get 25 cores at 100MHz, you can probably push through 1 million neurons in few milliseconds.

The data memory is set up as dual memory so that one can read and write to the same memory space at the same time (thereby enabling single cycle instructions). This is not true for stack of program memory, so the idea is that a separate core (65C02, 65816, 6522 or whatever) can write to the program memory to manipulate the code or pull/push data out of the stack.

So there are many ways to use this core. If you need larger program (or data) memory, add a blitter that swaps out a word at a time to some external memory at 100MHz, and it can run a continuous program for as long as you want. Or just increase the program memory. Since it doesn't write or read to the program memory, 8-bit addressing only limits the data memory, which in this core can be viewed as a kind of large register. Feed it a memory stream from outside, and it can continue to gnaw on that for as long as you want.

The other way is, as you point out, to increase absolute addressing. The reason for all instructions being two bytes is that RAM access can then happen with 16-bit databus. It prevents the need for two accesses to the RAM. There is no problem to increase this to 24- or 32-bit to get 16- or 24-bit address space, but it will probably hurt the speed (and size). And in that respect, a traditional full 65C02 or 65C816 is probably better.

I understand that this is quite deviated from the 6502, but its interesting to explore ways to get higher speed while keeping to the instruction set we all know so well. I added the SQR instruction since I need to do calculations, but its in no way finished, and I probably need to do some changes there.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 13, 2017 9:27 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
OK, I do see that: a small and potentially tileable acceleration engine.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 27, 2018 4:12 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
I have to comment on the speed of this core. If you syntesize it, it will not report a maximum speed that is anything close to what it achieves. In fact, on Lattice Diamond it reports around 20-25MHz as maximum based on signal timing and slack. I still ran it at 106MHz without problems (with memory clocked at twice that).

This is possible since data and code areas are separated. E.g. writes that happen late are not interfering with instruction reads. I was thinking of making this into a full 6502 core, which is still possible, but it would give severe speed limits once data and code area were no longer separated. It may be possible to get around this with paged memory in mind. E.g. writing to the same page (as the code is in) will take longer than writing to a separate page. That should (in theory) enable a larger common address space without always slowing it down.


Top
 Profile  
Reply with quote  
PostPosted: Sat Apr 28, 2018 2:33 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I don't know that I would rely on the type of behavior that you describe. It is generally an indication that the RTL style / design is confusing the synthesizer. It may be worthwhile studying the design with the objective of identifying the reason that the synthesizer missing so badly on the predicted speed. In addition, being able to run the core to such a high rate should be setting off alarm bells for you. I was chasing this type of behavior with one of my cores.

I finally got the core, after some rework from a dual cycle to a single cycle configuration to have matched predicted, reported, and real speeds. There is a long and rambling discussion regarding that core over on anycpu.com. It exhibited performance roughly twice the predicted performance; very similar to the behavior you reported above.

The two cycle version of the core used a clock enable that put a significant amount of the logic into a clock domain having half the frequency of the system clock. By using both edges of the clock, I was able to use internal Block RAM in the same cycle as the rest of the processor. This allowed the synthesizer to correctly determine the asynchronous logic path delays (in each half of the cycle). The result was that the core's predicted, reported, and actual operating speeds were all consistent.

With the previous implementation of the core, a very slow rate was reported; in a manner similar to what you described above. When I put it on a prototype card, I found that I could boost the clock rate to 117+ MHz. Although it appeared stable at that frequency, when running at these high speeds, some small parts of the logic misbehaved. That was very unsatisfying: being unable to rely on the reported speed from MAP/PAR to operate the SoC in which the core was used.

Therefore, I recommend examining your RTL/implementation for muti-cycle paths/domains in your design. I would remove that coding/design style to the extent possible. Although it may not be universally true, I think a simpler, most consistent design will be more reliable, maintainable, etc. Being unable to rely on the values that MAP/PAR and synthesis provide is a sure way to get into an endless loop that considers the core as erroneous first, and which is hard to break out from.

Since you are looking for ways to put many small processors into your projects, perhaps you should consider a technique known as Symmetric Multi-Threading (SMT), or otherwise known are hyper-threading, or C-slow retiming. Tobias Strauch has a number of articles on the subject, with an emphasis toward implementations on FPGAs. Attached is one of his articles in PDF form.

I've investigated the technique applied to the address generator of my M65C02A core, and there's a marked improvement of the performance. For the cost of some additional pipeline registers, I've been able to boost the speed of the address generator from 25-30 MHz to 250-300 MHz. Thus, the additional complexity of the technique allows the basic core logic to be reused in a manner that effectively creates C cores in slightly more area than 1 core. Each core still runs at 25-30 MHz, but there can be as many as ten simultaneous threads of execution running.
Attachment:
File comment: C-Slow Re-Timing Article - Tobias Strauch
1502.01237.pdf [291.46 KiB]
Downloaded 257 times

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 30, 2018 11:35 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Thanks for the notes and the pointer to C-slow timing, very interesting.

I wish I could do a decent writeup of it, but we had some interesting adventures and developments in the OPC (one-page computing) world. From the usual starting point, where the CPU expected async RAM and so we clocked the on-chip RAM on the opposite edge, we tried with gated clocks and with different phase relationships, and with wait states. Eventually we rejigged the CPU to work with sync RAM. We went from 25MHz, to 50MHz with a wait state, to 50MHz.

It was very useful - almost essential - to have the tools build the CPU as a fully contained clocked design. Having unconstrained paths entering and leaving the chip is bad enough, but if you have combinatorial paths which leave and re-enter then timing analysis doesn't stand a chance.

Having got the CPU into a shape where timing analysis was complete and correct and informative, revaldinho was able to spot that the long path was a false path - a situation which did not happen in reality - and a bit of recoding, in this case to make a register recirculate, was enough to bump the synthesis from 44MHz apparent speed (but working at 50) to 57MHz apparent speed. One problem with a false path being critical is that there might be almost-as-long true paths which are not then minimised because they seem not to be critical. So, fixing up the design to remove false paths results in an actually better implementation.

Long story short: We went from 25MHz, to 50MHz with a wait (effectively 33MHz approx,) to 50MHz with no wait. That was on Lattice, and the final result on Xilinx was 100MHz.

All that said, we found that the IceStorm tools (for Lattice FPGAs) were a bit lacking compared to ISE for Xilinx. It's not so easy to get a timing report for specific signals, and the Place & Route is not timing-aware so you kind of need to run it a dozen times and take the best result - the variation is 10% or more from run to run.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 10, 2019 7:04 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
I haven't been thinking about the mini-6502 for some time, but its always in the back of my head. The reason is that I am making a 6502 MMU for another project, one that I hope to combine with the mini-6502.

What does the 6502MMU add to this core? Well, the limit of one page for code and data is certainly very limiting, and a MMU has always been about handling and extending a CPU's memory needs. So the idea is that a mini6502 + 6502MMU can end up being a full 6502 and even offer an extended addressing range.

In general I hope I can implement it this way:

6502 code --> 6502MMU --> mini-6502

Currently the MMU works this way:

6502 code --> 6502MMU --> 6502

The (current implementation) of the 6502 MMU takes the $Axxx address and a page to extend the memory to 256MiB. All very classic with regard to MMUs. Then I started reading about the Borroughs B5000 (from 1961) which did all sorts of MMU things without a MMU, and I realized that the mini-6502's limitations could be used to its advantage by adding the 6502MMU into it. The mini-6502 kind of acts as a optimized core in this setup with the 6502MMU handling the 1-3 byte variable instruction size. It also makes it possible to add other types of memory (flash) to the extended address range.

For now its all in my head, but I think this can turn out to be a really interesting implementation. That is if it can be made (and I am certain it can).

Suggestions and ideas are always welcome.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron