6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 7:56 pm

All times are UTC




Post new topic Reply to topic  [ 13 posts ] 
Author Message
PostPosted: Thu Jul 13, 2023 7:24 pm 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
I'm thinking I'd like to have a machine that uses the 65C02, preferably at 14 MHz if possible, uses SRAM, and has a Propeller 2 attached. It seems this would help keep the total chip count down. The advantages would be not needing VIAs, not having the headaches of various sound chips that won't like the bus speed, being able to have separate video RAM, etc. Plus, there may be times to program in logic shortcuts to where P2 peripherals can talk to each other without the 65C02 being involved.

Yes, I know, it would probably be a lot easier if I just had a 6502 cog, but I'm interested in a more traditional board (and yes, the extra work). The P2 could just be this one multi-I/O controller "ASIC."

I wonder if it would be feasible to run the 6502 at 14 MHz and run the P2 at over 300 MHz, using bus-snooping as the primary data transfer strategy. The SRAM could have a window for talking to the P2 and it can copy what is relevant into the hub RAM. The video cog can get it from the hub. Really, there should be more than one video cog. I mean like have a core just for sprites, primitives, text, etc. Then it copies to the hub as it works so the display cog can use it. The sound would have 1 or more cogs. It would be nice to do the sound the same way as proposed for the video. I mean having at least 1 cog for sound generation and a cog to externally modulate it to make more complex sounds. So a sound chip and a coprocessor for it.

The idea is to do asynchronous bus-snooping where possible to send data to the P2. But for communicating back, such as loading from an SD (or even CF card if there are enough pins) or eventual math coprocessing, some other strategy such as bus-mastering DMA could be employed.

A math coprocessor idea I've had is to leave some RAM register space for one and communicate via snooping. For that to work, I'd say load the operands first then load the "opcode" last. That way, the data is in place on the other side so it can work as soon as it gets the instruction. If nothing else, DMA can be used to send the result back. I'm unsure of how to best incorporate a spinlock or whatever. I'd like to hear ideas about a good strategy to return the results. It doesn't have to be DMA, it could be memory-mapped or something.

Now, I've never worked with a 6502, and I am glad that the 65C02 has extra lines that I'd be interested in using that the original chip lacked.

So what is the correct way to drive SRAM from a 6502? I know it has only the RWB line for that, and SRAM has the /WE and /OE lines. I saw in one schematic where it and the inverse of it were each NANDed with the clock.

I also have general questions about SRAM. Since I'd want the P2 to possibly do DMA, in what order would it need to send the SRAM signals? The P2 wasn't specifically made to handle external memory, so I'd need to know what "protocol" to use. I mean, what can be set at the same time, and what has to be sent before the rest, etc?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 13, 2023 8:44 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
The concept of using another CPU/MCU as a "host" to the 65C02 isn't new (See BBC Micro c1981) - it's also well tried. Go for it.

I do it myself, but not with a P2 but with a much more feeble ATmega1280p. (I did initially have it generating video, but it was a bit slow). I also use it to offload floating point in my BCPL system (with a 65c816). Offloading FP for the various BASICs might be possible but expect to do a lot of work there yourself...

I have a 256 byte memory window that's shared between the 6502/816 and the host. It's at $FFxx and works in a mutually exclusive manner - ie. the ATmega can read/write the RAM or the 6502/816 but not both at the same time. This works OK, not perfect, but it's easy. A[8:15] are pulled high via a 10K resistor pack and the ATmega sees A[0:7] and D[0:7] via 2 x 8-bit ports which I can tri-state on the ATmega. It also drives the RAM /Wr and /Rd signals which are again tri-stated when the 6502/816 is running.

I use a single GAL to turn the 6502 side R/W and clock into the memory /Rd and /Wr signals and the address decoding (I have a VIA) and to handle IRQs. (The '816 board has a 2nd GAL for the upper address bits latching)

The ATmega controls the 6502/815 Reset and BE lines and listens to the Rdy line. At power-on, it holds the 6502/816 in reset, attaches itself to the RAM, pokes in a bootloader and vectors, then lets the 6502 come out of reset. The 6502 uses the WAI instruction to signal to the ATmega that it needs something - so the ATmega basically sits in the tight loop, polling the Rdy line. When it goes low, it pulls BE low, attaches itself to the bus, reads the command, does the action which may involve writing data back to the RAM, then releases itself, lets BE go high then sends an NMI to the 6502/816 and off it goes... So no need for high speed bus polling, etc. Although you might want to do that if you're using the on-board RAM in the P2 as active RAM for the 6502.

I did think of the Propeller at one point, (P1) but having zero experience of it, decided to use the ATmega which I have shed-loads of experience with. Lazy/Easy solution, I guess.

As for shared video RAM - see threads elsewhere - the issues are/will always be things like just how much RAM and that's resolution and colour depth dependant - 9KB will get you 320x240x1bpp and it goes up from there. Then there's the cost/time for the 6502 to do the pixel poking - especially if it's not 8bpp then you need the read/modify/write cycle for each pixel and so on.

I decided to send high-level commands to the video system via serial (the ATmega is also the serial port) and let it do the hard work of drawing lines, characters, circles, sprites, etc. So for you, the P2 can be doing that, leaving more RAM for the 6502 - it all depends on just how much effort you want out of the 6502 side of things.

Speed wise - I run my boards at 16Mhz. Simple double sided PCBs.

Hope it goes well - do keep us informed.

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 12:43 am 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
I wasn't asking about shared video RAM. See, the P2 hub should work just fine for video RAM. It is an inherent feature of the P2. The hub RAM will be the slowest, but that shouldn't be a problem at the resolutions I'd be interested in.

The idea is to passively read the SRAM on the board. It seems that if one's bus is slow enough, then bus-snooping is a valid DMA strategy that will increase host CPU performance since you don't have to deal with the memory twice (ie., the program writes to RAM then bit-banging or similar reads the RAM again). You can simply read it as it writes the first time. The only downside would be needing more RAM elsewhere, and in this case, the P2 already has you covered. So have a snooper core with only the job of monitoring and writing from the bus. I don't know if operating the P2 at 22-24 times the FSB would be fast enough since you only have time for half that many instructions, which is barely enough for a hub access. But really, in the worst case, there is the RDY line.

Yes, leaving a page for communication sounds about like what I'd probably use if I were to do this.

And yes, doing primitives and/or a display list is part of the idea. For bitmaps, the idea would be to move stuff through the window in a fashion similar to the X16 and the Vera board. So software would specify the address on the P2 and maybe have an increment mode. So you don't have to do the costly address moves often since the video location registers would already be written to.

As a side thought, some bitmap compression might not be hard. Like sacrificing a color as a magic number and the next byte could be a repeat number to be expanded on the other end. But yeah, doing primitives and display list stuff would be good for games/demo where the program makes the graphics.

The most daunting part would be learning P2ASM and creating the firmware. I've never coded for the P2 and so I'd certainly need to play around with one first. The easiest part, in the beginning, is to use a prototyping board. Then later, the hard part would be if I wanted a PCB just for that. That would be tougher as that would mean PCB design SMD assembly, and other considerations (voltage domain crossing, for instance). Going that far, one might as well use a square 6502 (lower capacitance between pins).


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 7:10 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Interesting project!

Yes, snooping should be workable - I think we've seen projects recently which use a Pico (or RP2040) to do. For example here although that's very much slower than 14MHz. Feels fairly likely the P2 will be fast enough, although you might be surprised that you don't have a great deal of slack. (Although, again, 14MHz is very fast for this kind of thing!)

One big variable is this: whether the 6502 system is free-running, or whether you clock it from the P2. In the latter case, you can always relax the timing. As you say, RDY is also available, and that's quite similar, although it's different because you have a strict deadline to deassert RDY in the cycle that you're expecting to need to slow down. In the case of driving the clock, you simply choose not to take the clock low until you're ready. (I say "simply" - I'm not sure how complex your P2 code might end up!)

Another variable: if your peripherals only need to detect peripheral writes, and not service peripheral reads, that's a big difference. You only need to capture the write within a cycle, and you have at least 4 cycles (I think, maybe 3) before you might see the next write.) You could even put the writes into a queue.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 12:17 pm 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
Thank you, Ed. I'm still not sure if I want to build anything at all. I just lost the interest I once had. And in this project idea, is there really a need to have a separate 6502? I mean, the P2 can emulate one just fine, and sticking within healthy overclocking ranges, the P2 can emulate a 14 MHz 6502.

If nothing else, wait states could be added to writes to give more time for the snooper cog to store it.

I wasn't sure of how to do the timing. I was thinking asynchronous from each other. Maybe use an oscillator can and generate the P2's clock in firmware. It is rated for 180 and has been tested up to about 350 MHz. I'd stay under that for reliability and headroom.

I'm not sure how to handle device reads. I understand that the most important things should be reserved in the memory map. So I guess in that case, I could do DMA. What caveats are there here and how is that done? Drive RDY and BE low?

Still, I'm not 100% sure about the part about converting from the RWB line into the /WE and /OE signals that the memory needs. Is there a reason why some NAND those with the system clock? I get using an inverter there. I guess they do that to allow for strobing the lines and to help prevent erroneous reads/writes.

And I don't fully grasp the order that pin signals should be sent to SRAM.

That seems like a nice chip to use for most peripherals, and it has some FPGA features such as programmable PLL/VCO, 4 DACs per cog, so many ADCs, etc. And it has smart pins. So you can set individual pins to generate certain waveforms at certain frequencies, RNG, etc. Large mult/div operations only take 2 cycles. I mean, 32/32/64 multiplications and 64/32/32 divisions. If you need bigger than that or very complex transcendental functions, there is the CORDIC solver in the hub. That is a costly operation since it is around 56 P2 cycles. Most regular instructions take 2 cycles. So if you clock the P2 around 320 MHz, you effectively get the equivalent of 160 MHz and single-cycle instructions. The reason it would need to be clocked so high from the P2 is that you'd need time to store things in the hub (though, yeah, if timing is close or a problem, split that into 2 cogs and use the LUT space to pass data to the next cog. "Twin" cogs can share LUT RAM, but the penalty is that it only moves 32 bits at a time. That is fine in this case since you can have 8 bits and up to 24 address bits in a single LUT register. But then you'd have to copy it to work with it in 8 or 16-bit pieces. I guess that limitation comes from the addressing scheme and native memory map. Cogs are byte-addressable at a minimum, and so is the hub. You can send DWs at a time if you need to. But treating the hub as bytes means there is less room for LUT registers in the global memory map, so they are DW-addressable.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 8:08 pm 
Online
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
SpottedGal wrote:
Still, I'm not 100% sure about the part about converting from the RWB line into the /WE and /OE signals that the memory needs. Is there a reason why some NAND those with the system clock? I get using an inverter there. I guess they do that to allow for strobing the lines and to help prevent erroneous reads/writes.

You'll want to go through the 6502 primer, indexed at http://wilsonminesco.com/6502primer/ .  As it says, you must have a way to make sure RAM cannot be written when Φ2 is low!  Looking at the 6502's timing diagrams in the data sheet, you will see that the address lines are not guaranteed to be valid and stable before the R/W goes low; so it is possible to write to unintended addresses.  With an extremely simple program that you might use to see if the computer is working at all, the other addresses it writes to might not be ones you're using yet; but soon they will be, and you'll start writing garbage over your variables, or your stack space, or even your program, when you still need those areas to remain intact.  The result will likely be a crash.

Quote:
That seems like a nice chip to use for most peripherals, and it has some FPGA features such as programmable PLL/VCO, 4 DACs per cog, so many ADCs, etc. And it has smart pins. So you can set individual pins to generate certain waveforms at certain frequencies, RNG, etc. Large mult/div operations only take 2 cycles. I mean, 32/32/64 multiplications and 64/32/32 divisions. If you need bigger than that or very complex transcendental functions, there is the CORDIC solver in the hub.

If 16-bit scaled-integer is enough (as it is for many applications), you can use the big look-up tables section of my site, at http://wilsonminesco.com/16bitMathTables/ .  Looking up the transcendental functions is much faster than CORDIC, because the lookup just that—a lookup, accurate to all 16 bits, with no interpolation needed, because all the answers are there, pre-calculated.  (Note that I did not say "fixed-point."  Fixed-point is a limited subset of the more-powerful scaled-integer.)  Memory prices and density have improved a lot since I did those, so I keep thinking I should do the tables again for 24-bit.  The entire set now (if you even use the whole set) is 2MB.  Going to 24-bit would require a 1 or 2GB flash, which is doable now.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 9:34 pm 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
Good point. Well, even the P2 CORDIC solvers are good enough to use alongside a 6502. If you run the P2 at 22-24 times the 6502, then it would take 3 6502 cycles to do the 54 P2 cycles needed. And if you clock the 6502 at vintage speeds, you could do it in a single 6502 cycle. And whatever overhead.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 14, 2023 9:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
(I plan to reply but ran out of time today)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 15, 2023 4:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Let me try to put together a reply...
SpottedGal wrote:
... in this project idea, is there really a need to have a separate 6502? I mean, the P2 can emulate one just fine, and sticking within healthy overclocking ranges, the P2 can emulate a 14 MHz 6502.

There's never a need for a separate 6502, unless that need comes from your motivations at the time. Sometimes you might want a fully embedded emulation, sometimes you might want a six-chip single board computer, sometimes something in between. Personally, I do see the value in projects which are a hybrid, with a real microprocessor, maybe some other real chips like RAM, but also a microcontroller to do the work of other subsystems. Personally, I see value in being able to use a real 6502 bus - whether to monitor the behaviour of the system, or to extend it, or both. Different people are interested in different things, at different times. There is no single right way.

Quote:
If nothing else, wait states could be added to writes to give more time for the snooper cog to store it.

Quite so! I think the easiest approach is for the P2 to control the clock. Another way to start easy is to start with a slow clock speed. You'll find out what the sequence of events is and what's the critical path. Many threads here discuss the relative merits of changing the clocking or using RDY - both are valid.

Quote:
I wasn't sure of how to do the timing. I was thinking asynchronous from each other. Maybe use an oscillator can and generate the P2's clock in firmware. It is rated for 180 and has been tested up to about 350 MHz. I'd stay under that for reliability and headroom.


One datapoint which might be relevant: the PiTubeDirect project uses the 500MHz videocore in a raspberry pi to interact with the 2MHz bus of the BBC Micro. (Possibly, the timing is such that we only get a half-cycle to act, so the deadlines are a little like a 4MHz bus.) In this case we need to emulate both reads and writes at full speed, and it proves to be quite difficult to meet the timing, but possible. Emulating writes (in effect, actually snooping for writes) is going to be more relaxed. But it still might be surprising how few instructions you can run in the limited time.

Quote:
That [P2] seems like a nice chip to use for most peripherals...

Yes indeed, an interesting chip with lots of facilities. It's similar with other microcontrollers - many facilities. Which means you can't use the chips constraints to determine the direction of your project - you have to figure out what you want to do with it. (Which also means, you won't be using all of those facilities, and you need to be OK with that.) So, if you want a quick project, you do something simple, if your aim is to learn something, then shape your project accordingly, if you want something you can show to other people, or something to teach with, or something to entertain, or something battery-powered, or something very extensible, well the project changes shape as a result.

When it comes to learning something, you might be wanting to learn about the P2 specifically, or about 6502 systems, or about electronics, or logic, or programming. Lots of possibilities, and only you can decide what you want to do next.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 17, 2023 9:33 am 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
Nice reply. Thanks.

I was thinking, one thing that was nice about retro computers was that they came with BASIC in ROM. I was thinking, BASIC could be incorporated on the P2 side since the peripherals are there, but that begs a new question. What would be the role of the 6502 if the P2 did the peripherals and BASIC?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 17, 2023 9:59 am 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
SpottedGal wrote:
Nice reply. Thanks.

I was thinking, one thing that was nice about retro computers was that they came with BASIC in ROM. I was thinking, BASIC could be incorporated on the P2 side since the peripherals are there, but that begs a new question. What would be the role of the 6502 if the P2 did the peripherals and BASIC?


One thing I have in my Ruby project is a sort of "ROM" filing system where the host MCU (ATmega) can hold images of things (ROMs) that can be copied over to the 6502 RAM and run from there. I had BBC Basic, Applesoft, EhBASIC initially, but as the code in the ATmega grew and I had a good enough filing system I had to make space for just one "ROM" image - that's the Operating System.

So if there is plenty of flash space on the P2 side then loading images into the RAM of the 6502 is a nice and speedy way to bootstrap it with different BASICs for example.

Doing it this way preserves the real 6502 and gives it proper work to do with the P2 just being a somewhat clever peripheral - and if you think back to computers of that time, then the 6502 and RAM was just a small part of the whole thing anyway - the rest being made up from (usually) TTL ICs and discrete components. The Spectrum, BBC Micro (and others?) started to use programmable ICs (ULAs) to reduce the raft of TTL needed, but even in a BBC Micro or Master there are still a lot of additional ICs needed to make the thing work as a whole.

From my point of view, yes, it might be nice to do it all without the ATmega, but really I just wanted a nice little software platform for the 6502 and '816 and the ATmega (and using a GAL or 2) gives me the "glue" needed to realise that.


-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 20, 2023 1:11 pm 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
That is neat to be able to do things like let the SRAM take up the entire address space with reserved I/O pages. The ROM could be stored on the P2 side and dumped into the external SRAM via DMA before the 6502 is allowed to start. I guess all sorts of interesting things could be done, like even cold-starting the 6502 and configuring the memory map on the fly. So your idea about letting a microcontroller contain the ROMs and copy them as needed sounds like a good one.

I'd need to figure out what features I'd want it to have in terms of video modes and sound channels and strategies. At 14 MHz on the host CPU and a P2 going at maybe 336 MHz, there should be the power to use 320x240 graphics. I don't remember if the P2 has 512K or 1M of hub RAM space, but storing it on the P2 side in an 8-bit bitmap mode would take 75K of space. Depending on how much code needs to be stored in the slower hub RAM, if you can spare 300 KiB, that would be enough room for 4 320x240 pages or a single 640x480 page.

Regarding the P2, each cog has 512 longs which gives 2K, and it can be accessed as bytes, words, or doubles. Each cog also has 2K of LUT space. The LUTs can only be addressed as doubles. The LUTs can be shared with cogs that share the upper 2 address bits. I guess, for a text mode, the character map can be placed in the LUT in one of the video cogs. That is just enough room. Assuming 256 characters, you'd probably need 64 bits or a quad, so 2K would be enough. Just store each character in 2 LUT addresses. The hub also has 512K or 1M (I forgot which). The hub RAM is the slowest due to the P2's best feature. Essentially, the hub functions as concurrent DMA. Eight-channel memory would be slow.

I don't know what I'd want to use for sound. I mean, I'd like better than what was available in the day, or at least more of it. For some musical scores, four channels are not enough, though you can arpeggiate them if you have to. An example would be the soundtracks in Doom. E1M1 uses 5 channels, according to the sheet music. And of course, the TI chip only provided 3 melody channels. The 4th was for noise. I don't know which tone strategy I'd want, such as PCM, FM, or some combination.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 02, 2023 12:35 am 
Offline

Joined: Sun Dec 15, 2019 12:08 am
Posts: 19
I've also wondered if I should do something with a Gigatron TTL and P2, or make something just on a P2.

Emulating sound chips on the P2 is good and you can get around the issues that machines like the X16 have to work around. A lot of old sound chips don't like doing over 4 MHz.

I've heard about machines using a 6502 and the P1 controller. The P1s are nice in that they already have VGA facilities, with 64 colors, hardware syncs, and a built-in character set. So they are good enough to use as a terminal chip. The P2s remove the character set and might not have native VGA capabilities, but you should have the time to bit-bang this anyway, and plenty have.

On something like a modified Gigatron, one can use the indirection table, the specified pages there, and the sound and keyboard locations for bus-snooping, and the P2 or whatever could do all the work and use its memory. Of course, if you go that far, you might as well redo the memory map and make your own software and toolchain. In your own firmware, you can add backward ROM compatibility and a new memory map, loader format, etc.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: