6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed May 15, 2024 6:11 am

All times are UTC




Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next
Author Message
PostPosted: Tue Aug 11, 2020 11:15 am 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
65f02 wrote:
JimDrew wrote:
I would like to have some flexibility here to have some write-through caching.
Hmm, at what level would you do the caching? Not the low-level disk I/O addresses, I assume. And even the RWTS sector buffer is not likely to see a write-and-read-back-immediately pattern all that often, I'd think.
Are you planning to keep track of the track/sector numbers accessed by the RWTS routine, and buffer many sectors? That should work, but I would argue that it does not "belong" in a CPU accelerator, because it would need to have quite a bit of knowledge about the software (RWTS parameters and buffer use).

I am catering my device to work (and know) about a lot of different hardware. It's not designed to be a generic "unaware" device. It is programmable from the device itself. My original intention really was to have a diagnostic tool for stand-up arcade machines, having memory maps and information for hundreds of various machines contained within the device. It just kind of mutated into an accelerator because I found that I could emulate the 6502 instructions ridiculously fast. :)

The write through would work just like other CPU's caching works (I have a lot of experience with CPU emulation of the Motorola 68030/040/060 and Intel 80x86 through Mac and PC emulations I have written). Reads of data would come from internal (fast) memory that mirror'd external (slow) memory. You would have to map the region as write-through/cacheable. This would help things that read from graphics memory to update the same graphics memory (like scrolling, rotating, etc).


Last edited by JimDrew on Tue Aug 11, 2020 11:33 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 11, 2020 11:26 am 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
I am a little confused! I thought you were this project: https://hackaday.io/project/165624-mock ... eplacement

I looked at your board layout from a few pages back and it's obviously not this one! There are two of you doing the same FPGA thing?!?! Your board looks like it is using the TSX/TXE/LSF0108 level translators. Which are you using, and how are you wiring those (pull-ups on both sides with an open-collector setup, or some push-pull configuration)?


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 11, 2020 11:36 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
JimDrew wrote:
I noticed you are (likely) using the SN74CBTD3861 bus switches for your level shifting. How are you wiring these so you can get true 5v outputs? I looked at these and the downfall (noted in the various TI forums) is that you don't get a full 5v output, the diode clamped side is ~3v, and that voltage is affected by the voltage on the other side of the switch as well. We don't need much current, but the voltage levels have to be correct. I have used a variety of things and I ended up using a level shifter with dual voltage inputs to have proper 5v and 3.3v rails. I am curious what you did that everyone else seems to have problems with.

To translate the outputs from 3V to 5V, I use plain old 74HCT245 bus drivers (four in total for data, addresses, and control/clock signals). For the inputs, two 74LVC245 powered by 3.3V; these have 5V-tolerant inputs. All are available in the very compact QFN package from Nexperia.

For the data bus I had looked into "proper" dual-supply level converters -- TI's SN74LVC8T245 or Nexperia's 74AXP8T245. But these turned out to be difficult to source, and more importantly the slightly larger form factor didn't fit my layout well. So I stuck with two separate '245 drivers for input/output, which are mounted back-to-back (one on the top, one on the bottom) to keep the routing manageable.

The input threshold of the LVCs is at 1.6V. Nice for TTL environments; not perfect but quite workable for CMOS. No pullups are used.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 11, 2020 11:48 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
JimDrew wrote:
I am a little confused! I thought you were this project: https://hackaday.io/project/165624-mock ... eplacement
I looked at your board layout from a few pages back and it's obviously not this one! There are two of you doing the same FPGA thing?!?!

That's a different project indeed, and I only came across it a few weeks ago. The focus seems a bit different -- not on acceleration, but on replicating a range of hard-to-obtain vintage CPUs -- but the resulting hardware solution is obviously similar. The "Mocka65" needs configurable pin assignments for the different CPU types it supports, which I think drives its take on level conversion.

EDIT: Just noticed that a schematic has been posted for the MockA65; that's a recent addition to the page. It uses 74CBDT3861 level shifters. I have never looked into these, and only glanced at the schematic quickly. But at first glance it seems that the direction of each pin is actually fixed and can't be configured as I had assumed?


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 11, 2020 7:34 pm 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
Thanks for the info. I looked at the 74CBDT3861 level shifters as well. The TSSOP is the smallest footprint they make. You can see from the MockA65 schematic that every pin of the mock CPU has one level shifter (even VCC and GND???) along with a 10K pull-up resistor to +5V for all 40 CPU pins. The 74CBDT3861 is a bi-directional automatic sensing level shifter. It requires pull-ups to your required voltage level as it acts strictly as low active switch.

In my device I use two 74ACHT541's for the address bus, a Nexperia 74LVC8T245 for the data bus, and five SC-70 size 74LVC2G17 parts for the misc remaining pins. The 74LVC8T245 are easy to get and it works well, but you're right, the smallest package (24 pin TSSOP) is too large for something where you need to keep all of the components within the 0.6" width. I have never been concerned with size of my device because I have always had a large (more than double the width of 6502) board to make it easy for debugging. Now, I am looking at consolidating the board and trying to fit everything within the normal 6502's size. I did find the space solution for the MockA65 rather interesting. It uses a 20 pin machined pin male SIP header on each side, with the center 18 pins clipped. Each end of the SIP is through-hole and anchors it to the board, and the center 18 pins are soldered to the board like a SMT component. I wonder how robust this will be with a lot of insertions/removals? This approach does allow for a much larger surface area, reducing the routing headaches.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 23, 2020 10:03 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Here's an Apple II benchmark result I had meant to post for a while -- with thanks to BigEd, who suggested caching the video RAM (on the first page of this thread).

"Brian's Theme" was a graphics demo that shipped with the Apple, programmed by Brian Howard. It would draw sets of closely spaced radial lines from a random center point. The Apple's limited graphics resolution translated into nice moiree effects. It looked like this (not my video): https://www.youtube.com/watch?v=2SQGNY_shZw

I timed 20 repeats of Brian's plots on my Apple II plus:
Original 1 MHz speed: 390 seconds
100 MHz, reading and writing the video RAM externally: 16.5 seconds
100 MHz, caching the video RAM so reads are fast: 7 seconds

The caching brings more acceleration than I had expected -- apparently the code spends more time reading the video RAM than writing to it. So thanks for the idea, Ed!


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 23, 2020 12:54 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
Simply outstanding!


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 23, 2020 4:28 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Nice result! Oddly enough, in our own experiments, we're evidently not finding enough benchmarks that do enough reads, because the caching idea is turning out to bring very little gain... although some of that will have to do with relative speeds. The faster the accelerated machine, the more the slow reads would stand out and be worth avoiding. Your 100MHz CPU in a 1MHz system has a higher gain than our 16MHz or 80MHz accelerator in a 2MHz system.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 23, 2020 7:00 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
I tried this as a benchmark since I thought it would highlight the benefits of the cache: no complex computations, and a lot of plotting with access to the graphics RAM. But the effect was more pronounced than I had expected. It seems that for every write access to the RAM, the line-drawing routine requires three read accesses:

4s would be the "perfect" runtime if everything were accelerated to 100 MHz.
3s come on top with the read cache enabled -- let's assume that's mostly for video RAM writes.
9s come on top of that when the cache is disabled -- three times as much as the "penalty" for writes.

Surprising (to me); I would have expected a 1:1 ratio of reads to writes:
"read a byte, set the additional pixel, write the byte".


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 27, 2020 11:59 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
I am following your project with great interest and this may be what I need to finally learn FPGA.

hoglet had a good idea for using it to implementing an ICE: viewtopic.php?p=76834#p76834

Other things I find interesting are:

* accelerators for other processors such as 6800, 6809, 6309, 8080, 8085, Z80, 9900, 1802.
* fankenchips with the pinout of processor X but run the instruction set of processor Y
* a version of an ATMEL AVR 328p with close to 64K of SRAM.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 28, 2020 11:33 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Bill, thank you for your interest! I can certainly recommend learning FPGA development, whether for this flavour of project or other real-time designs.

Where would you use a "Frankenchip" which combines one chip's pinout with another chip's instruction set? It seems to me that this would imply a custom computer design -- in which case, why not use the "right" pinout for the chip? Or, if you want to experiment with various CPUs, design pinout adapters for the original CPUs?

For the large SRAM Atmel, what applications would you be thinking of? 64k of RAM will only work if you don't need any ROM, limited by the Atmel address space as well as the on-chip memory in a Spartan-6 (in this package). Do you have applications in mind which would benefit from the large RAM while running a very small program?

Another concern here would be the form factor. I don't think this PCB can be made much smaller than DIP-40, so it can't be a drop-in replacement for the Atmel ATmega 328. But if a custom host design is required anyway, why "fake" an Atmel 328, rather than starting with a suitable processor?

I still like hoglet's ICE idea; there obviously is a lot of commonality between an ICE and the accelerator hardware. But there are also significant gaps one still would have to fill: adding a fast interface and PC software. Probably not something I will get around to any time soon. But I would be pleased to see someone else take my design as a starting point and run with it. I will make it available "real soon now", but want to add a couple of features and clean things up first.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 28, 2020 12:11 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
65f02 wrote:
Where would you use a "Frankenchip" which combines one chip's pinout with another chip's instruction set? It seems to me that this would imply a custom computer design -- in which case, why not use the "right" pinout for the chip? Or, if you want to experiment with various CPUs, design pinout adapters for the original CPUs?


Just a wild idea to play with different processor instruction sets without having to have many different computers. Just swap the CPU, monitor ROM and boot a different operating system.

65f02 wrote:
For the large SRAM Atmel, what applications would you be thinking of? 64k of RAM will only work if you don't need any ROM, limited by the Atmel address space as well as the on-chip memory in a Spartan-6 (in this package). Do you have applications in mind which would benefit from the large RAM while running a very small program?

Another concern here would be the form factor. I don't think this PCB can be made much smaller than DIP-40, so it can't be a drop-in replacement for the Atmel ATmega 328. But if a custom host design is required anyway, why "fake" an Atmel 328, rather than starting with a suitable processor?


The AVR instruction set is very nice for an 8-bitter:

* one cycle each for many instructions
* 32 8-bit registers, 16 of which are full featured
* 3 register pairs can be used for indexing
* Harvard architecture which is sometimes a pain, but it does allow for up to 128K of program space and almost 64K of RAM with 16-bit addresses.

I did not realize that an FPGA cannot be had with more than 64K of memory. My thought is for a pseudo-AVR with static RAM for both SRAM and most of the program memory. A small bootstrap loader would be kept in the equivalent of flash program memory. Its function would be to boot an operating system off of an SD card on power-up, much like a traditional computer. Instead of "burning" one program into flash memory, it would load them from mass storage.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 28, 2020 12:23 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
One more thing: if the FPGA is built into a PCB in the form factor of an Arduino shield, it can be easily plugged onto an Arduino Uno with the ability to stack more shields on top of it. Pin headers on the bottom of the shield would plug into the 328p socket.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 28, 2020 2:51 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 133
Hi!
BillG wrote:
I did not realize that an FPGA cannot be had with more than 64K of memory. My thought is for a pseudo-AVR with static RAM for both SRAM and most of the program memory. A small bootstrap loader would be kept in the equivalent of flash program memory. Its function would be to boot an operating system off of an SD card on power-up, much like a traditional computer. Instead of "burning" one program into flash memory, it would load them from mass storage.

There are FPGA with embedded SRAM.

For example the iCE40 up5k is a small FPGA with 128kB of SRAM (in addition to the 15kB of embedded RAM). The Upduino board is US$7.99 plus shipping, so this is really cheap. I made a 6502 computer at 12.5MHz with VGA output, 64kB of main RAM plus 64kB of video RAM, 256 bytes of boot ROM, serial, PS/2 keyboard and using the included SPI flash for program storage, all using one third of the available FPGA resources: https://github.com/dmsc/my6502

Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 28, 2020 4:07 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
dmsc wrote:
For example the iCE40 up5k is a small FPGA with 128kB of SRAM (in addition to the 15kB of embedded RAM). The Upduino board is US$7.99 plus shipping, so this is really cheap. I made a 6502 computer at 12.5MHz with VGA output, 64kB of main RAM plus 64kB of video RAM, 256 bytes of boot ROM, serial, PS/2 keyboard and using the included SPI flash for program storage, all using one third of the available FPGA resources

Nice! I originally started my 65F02 concept around the iCE40, since it is cheap and small, and comes in an easily soldered package. But synthesizing Arlet's core for it suggested that the iCE40 architecture would be slower than what I was looking for; so I am actually reassured that you got to similar speeds in a real-world test.

Your design including the RAM is about as fast as I got with the CPU core only, if I recall correctly. So the routing penalty to access the RAM seems smaller than in the Spartan-6, probably because the RAM is not as scattered across the chip? (Or is it because any routing delays are dominated by logic/switching delays anyway?)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: