65c816 address decoding help

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Apr 03, 2020 1:19 am

Skylie33 wrote:

GARTHWILSON wrote:

A rough cycle count from my own SPI bit-banging code looks to yield a little over 30 kilobytes (not bits) per second receiving at 10MHz.

Hmm. 30 isn't too bad for bit-banging, honestly. If the program were 2MB, it'd take.. 66 seconds? A little slow I suppose, but doable for now. That's also assuming everything is loaded into RAM at once; but I feel like that'd be the best option for speed.

66 seconds would seem like an eternity to an impatient game-player.

If you are going to page parts of your program from mass storage you're going to need something faster.

Just so you have some perspective, the parallel SCSI subsystem in my POC V1.1 unit has a raw bus speed of 3.5MB/second and a theoretical throughput to/from core of ~750KB per second while handling 32KB blocks of data. During actual testing, a 32KB block loaded in ~45ms ±4ms, which is a computed throughput of ~728KB/sec. I used the counter/timer in the DUART running at 250 Hz to track the performance—which is where the 4ms uncertainty comes in. In that test, the 65C816 was running at 12.5 MHz, and reading and writing from/to the disk was into bank $00 without use of indirection. This is being done with interrupt-driven code and some very tight loops, along with a SCSI ASIC that handles the bus protocol in hardware. If I had to do it entirely in software, I'd be lucky if I got 200KB/sec.

728KB/sec may sound pretty fast, but it's only about one-fifth the speed of the SCSI bus—and I'm running the bus at it's slowest rate (SCSI1-1986 speed, to be precise). There is a certain amount of unavoidable overhead, such as the disk seeking, bus protocol, etc. However, the real roadblock is the 65C816 itself. It can only move bytes so fast, and that "so fast" is much slower than the capabilities of the SCSI hardware. The performance solution would be a DMA controller operating on a two-clock bus cycle. That's a project for another time.

Something to be aware of is reading data from a fixed address with a 65C816, as would be the case with disk I/O, will involve long indirection if the data is going into or coming out of a different bank than the one in which the I/O device is located. Indirection of any kind costs clock cycles because it involves additional internal steps in the MPU. Any 24-bit load or store will incur a one clock cycle penalty for each access. If the access involves a read-modify-write operation, e.g. INC $123ABC, there will be multiple penalties, because such an operation requires multiple accesses. Most mass storage accesses don't involve R-M-W operations, but still require indirection and—usually—indexing, which also adds cycles. If bytes are gotten from one bank and written to another, the indirection involves an "extended" pointer (a 24-bit pointer) on direct page to hold at least one of the addresses. So there will be the double-whammy of indirection and 24-bit access.

This is where cycle counting gets into the picture. It's how you can predict the performance of your code. When you start adding up all those twists and turns the code will take during a mass storage access you'll wonder where time went.

Skylie33 · Post by **Skylie33** » Fri Apr 03, 2020 1:26 am

Hmm, I see. Thanks for the insight. I'm not fully sure how I'd solve this issue, because I do agree that 30KBps is way too slow for actual game loading, I just meant it's doable for testing.

Chromatix · Post by **Chromatix** » Fri Apr 03, 2020 2:28 am

In fact 30KB/sec is about three times as fast as a typical DSHD floppy drive, and about 100 times as fast as the Commodore 1541 running the stock loader. Ignoring the 1541 which was unusually slow, floppy disks were considered pretty fast in their time, as you rarely had to wait more than a few seconds to complete a file operation; it only starts to look slow when compared to hard drives or CD-ROM. So you'll only notice it being slow if you're loading entire floppy disks' worth of data in one go, and a game would almost certainly give you a loading screen to entertain you for that short interval.

Quote:

I don't know much about the SNES, but I wonder how much of that data was tile & sprite graphics and sound samples/music?

In typical games, art assets (graphics, 3D models and audio samples) constitute the vast majority of the data shipped. I think that's been true at least since the platform era began. There are of course exceptions that prove the rule. Some of those assets, incidentally, would be used all the time, while others would only be used for one level/area or group thereof. You could thus delay loading those into RAM until they became relevant.

The SNES apparently had 128KB of CPU RAM, 64KB of video RAM, and 64KB of sound RAM. More CPU-accessible RAM could be provided on the cartridge, along with the ROMs containing the actual game data, and even coprocessors in some cases.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Apr 03, 2020 5:13 am

Skylie33 wrote:

Hmm, I see. Thanks for the insight. I'm not fully sure how I'd solve this issue, because I do agree that 30KBps is way too slow for actual game loading, I just meant it's doable for testing.

While a faster form of mass storage (note the use of the word "mass," not "bulk") will certainly help, some careful planning can assist in not only keeping mass storage accesses down to a minimum but also in keeping the amount of data moved per access down as well. A 30KB load that takes a wall-clock second once in a while to complete is tolerable. If you can plan your data needs so relatively small loads occur the user will be less aware of the time required to get the needed data into core.

Or, you can do a massive load at startup and use the 65C816's block-copy functions to shuffle bytes at the rate of one per seven clock cycles (disregarding interrupts). While this is going on, a startup graphic can keep the impatient user amused.

Ultimately, I think you should set a goal to where you can get loads from and stores to mass storage up into the 200-300 KB/sec range in order to reduce the perception of lag to a minimum. At that speed, a 30KB load would be less than an eyeblink in length. It can be done with reasonable hardware, but not by bit-banging a communication protocol. Doing the latter will result in much of the processing time needed to access mass storage being consumed in breaking apart and reassembling bytes.

Your choice of mass storage medium is obviously a limiting factor. Although SD cards can hold a lot of data and internally access it at a relatively quick rate, they communicate serially at a relatively slow pace. That's because they are primarily intended for use in portable consumer electronics, in which nothing has to happen really fast, but power consumption has to be minimized to conserve the battery. An intentionally slow communication protocol helps in that regard...which gets into another mass storage consideration.

Any reasonable operating system is going to have to be able to organize mass storage so things can be quickly found by random access. That means some kind of a filesystem, which for your application, can be relatively simple. Simple or complex, filesystem manipulation adds overhead to the basic process of fetching data from mass storage, as I/O cycles are required to read directories, located file descriptors, etc. That overhead is tacked on to the cycles required to fetch a blob of data and copy it into core. This is not something that SD cards are particularly well-suited to doing.

Here again, I'm trying to highlight some of the things that have to be considered once you take the 6502 family into a realm where large amounts of data have to be quickly copied to/from mass storage and core. Back in the days of the Commodore 128, most people had the 1571 floppy drive, which was tolerably fast in loading programs, but not so fast in the sort of random access required to manage, say, a database. The 1571 didn't scale well—none of the CBM disks did. Then there was the 1541, which as Chromatix noted, was downright feeble, not much faster than tape—when the 1541 worked. It made a better door stop than storage device. We have much better gear now. Even a PC floppy drive is light-years faster than a Commodore 1571 in burst mode. And there are tons of cheap hard drives available almost anywhere you look.

Some food for though, I hope.

Skylie33 · Post by **Skylie33** » Fri Apr 03, 2020 5:38 am

BigDumbDinosaur wrote:

Ultimately, I think you should set a goal to where you can get loads from and stores to mass storage up into the 200-300 KB/sec range in order to reduce the perception of lag to a minimum. At that speed, a 30KB load would be less than an eyeblink in length. It can be done with reasonable hardware, but not by bit-banging a communication protocol. Doing the latter will result in much of the processing time needed to access mass storage being consumed in breaking apart and reassembling bytes.

I agree with that, 300 KB/s would be a reasonable speed to have, and I knew bit-banging wouldn't be fast.

BigDumbDinosaur wrote:

Your choice of mass storage medium is obviously a limiting factor. Although SD cards can hold a lot of data and internally access it at a relatively quick rate, they communicate serially at a relatively slow pace. That's because they are primarily intended for use in portable consumer electronics, in which nothing has to happen really fast, but power consumption has to be minimized to conserve the battery. An intentionally slow communication protocol helps in that regard...which gets into another mass storage consideration.

This is true, but the reason I chose SD cards is because they have an easy to use protocol and are easily available in the modern era. USB is.. not so easy, albeit a whole lot faster.

GARTHWILSON · Post by **GARTHWILSON** » Fri Apr 03, 2020 5:49 am

We've been through a lot in a short time regarding this project, so forgive me if I'm forgetting something, like whether you were considering running SPI in programmable logic. (Actually, Daryl, forum name 8BIT, was offering such an IC, calling it the 65SPI; but the CPLD he was using to make it went out of production. He was re-doing the design for a different CPLD, but I have not heard any updates recently, so I suspect it got backburnered like so many of our projects.) Anyway, SPI does not have a specified speed limit, and there are SPI devices that can do 200Mbps—and my information may be outdated now. SD card is fast enough to record video on our digital cameras, but that might be with a four-bit parallel bus rather than bit-serial. The SPI mode however does not require fees and consortium membership to learn about and implement.

Skylie33 · Post by **Skylie33** » Fri Apr 03, 2020 6:39 am

GARTHWILSON wrote:

We've been through a lot in a short time regarding this project, so forgive me if I'm forgetting something, like whether you were considering running SPI in programmable logic.

Well, it's certainly possible to run it in programmable logic. I'm sure the FPGA for my video generation would have some logic space left for that, but I'm not sure I'll go that route just yet.

GARTHWILSON wrote:

SPI does not have a specified speed limit, and there are SPI devices that can do 100Mbps—and my information may be outdated now. SD card is fast enough to record video on our digital cameras, but that might be with a four-bit parallel bus rather than bit-serial. The SPI mode however does not require fees and consortium membership to learn about and implement.

This is very true. That's why I felt like SD cards would be the optimal mass storage medium for what I'm trying to achieve. Even serial should be fast enough for my needs, however, bit-banging the protocol with a 6522 probably isn't.

cjs · Post by **cjs** » Fri Apr 03, 2020 7:01 am

BigDumbDinosaur wrote:

Something to be aware of is reading data from a fixed address with a 65C816, as would be the case with disk I/O, will involve long indirection if the data is going into or coming out of a different bank than the one in which the I/O device is located. Indirection of any kind costs clock cycles because it involves additional internal steps in the MPU. Any 24-bit load or store will incur a one clock cycle penalty for each access.

Hm! So that sounds like yet another reason to move the direct page to cover your I/O addresses when loading data, and then run the load routine in the bank you're loading. That would fix this problem, would it not?

drogon wrote:

If you want to use SD efficiently from the 65xx side, then you might need some sort of hardware SPI driver for it. Bit-banging is not going to be fast enough although it might approximate old-school floppy drive speeds. There have been some CPLD designs in the past.

I'm currently hacking together a 6821 PIA-based SD card interface for my Apple 1 clone, which will clearly be bit-banged. But I recently ordered a couple of 6522 VIAs, which have a built-in serial shift register; I'm wondering if that would be usable to drive SPI. (I'll have a look myself, but not until after I get my bit-banged interface working. Which may be some time; I'm also juggling a couple of other projects right now.)

BigDumbDinosaur wrote:

Your choice of mass storage medium is obviously a limiting factor. Although SD cards can hold a lot of data and internally access it at a relatively quick rate, they communicate serially at a relatively slow pace.

I agree with Garth; this doesn't look to me like it's an issue at all. I don't think this is an issue at all. I grabbed the "Physical Layer Simplified Specification Version 7.10" from the SD Association download page and found a table in section 3.17.3 (document page 33, PDF file page 53) stating that at "Default Speed" (which any SD card, no matter how old, should support) the maximum bus speed is 25 MHz. I suspect that most cards that are not ancient would support "High Speed" 50 Mhz. So assuming we stick with one-bit-wide SPI mode that's about 3 MB/sec or 6 MB/sec across the bus, both of which are pretty reasonable. Older cards will actually be limited by their memory speed, not the bus speed.

And even at 3 MB/sec, I think you'd probably need to be doing DMA to use the full SD Card bus speed. The fastest 8-bit wide programmed IO I can imagine off-hand would be a direct page read from the card (3 cycles) followed by an absolute indexed write of the data (maybe 5 cycles?) and an index register increment (2 cycles), for about 10 cycles per byte, for about 1 MB/sec at 10 MHz. (Maybe I'm calculating that wrong.) Perhaps the speed nearly doubles if you build an interface that allows reading in 16-bit chunks rather than 8-bit chunks.

At any rate, all of this is well above the 200-300 KB/sec that folks seem to think is good enough. Though it probably would be of interest if you wanted to play back 320×200×256 colour 30 FPS video from your card, which would need close to 2 MB/sec. (Uncompressed, obviously; I imagine that there's no way a 10 MHz 65816 is going to be able to run an algorithm to decompress video at that output rate.)

BigDumbDinosaur wrote:

Simple or complex, filesystem manipulation adds overhead to the basic process of fetching data from mass storage, as I/O cycles are required to read directories, located file descriptors, etc.

So long as you're willing to dedicate an SD card (or even a specific block range on an SD card) to a game, you can simply generate the specific block numbers during the game build process and hardcode them into the code you generate. This wasn't uncommon in the 8-bit era. The Apple II version of Prince of Persia, for example, got particularly clever with its disk read routines in that it would start reading a track with whatever sector happened to be the first passing under the head, thus always reading the entire track in just over a single revolution.

BigDumbDinosaur wrote:

66 seconds would seem like an eternity to an impatient game-player. :D If you are going to page parts of your program from mass storage you're going to need something faster.

Well, first of all, if you're paging in parts of your data you won't be pulling in 2 MB at a shot, so it won't be 66 seconds. I'd imagine you'd be pulling in something more like 128 KB for a level or section of a level, which would be 4 seconds or so, which could easily be covered by a message or an interesting graphic to look at, or maybe even a simple animation.

As for an intital load, if you can do it in 66 seconds you can advertise your system as being "faster than a Playstation 4." :-) That said, retro gamers (with probably the exception of those used to original drives on a Commodore 64) probably have higher standards for load time than modern gamers. But also, lower standards for quality of graphical assets, meaning loading less data.

Chromatix · Post by **Chromatix** » Fri Apr 03, 2020 11:15 am

To put things in perspective a bit more, 2x-speed CD-ROM was nominally 300KB/s. That would take up to 40 minutes to read out the entire disc sequentially, but remember it also corresponds to over an hour of uncompressed stereo hi-fi music. And that was also able to support Crash Bandicoot on the original PlayStation - streaming in bits of the level as you progressed through it, as only a small part would fit in RAM at a time.

Now, 300KB/s would require an SPI clock of 2.4MHz, which should be pretty easy to achieve with a little hardware to assist with the bit-shifting. The SD card should be able to handle an SPI clock equal to your CPU clock, and your CPU will the have a merry time serving the SPI port every 8 cycles (or as near as can be achieved).

jmthompson · Post by **jmthompson** » Fri Apr 03, 2020 5:45 pm

GARTHWILSON wrote:

We've been through a lot in a short time regarding this project, so forgive me if I'm forgetting something, like whether you were considering running SPI in programmable logic. (Actually, Daryl, forum name 8BIT, was offering such an IC, calling it the 65SPI; but the CPLD he was using to make it went out of production. He was re-doing the design for a different CPLD, but I have not heard any updates recently, so I suspect it got backburnered like so many of our projects.)

I've successfully burned André Fachat's VHDL version of Daryl's design onto current hardware (an XC9572XL) and it works great. His code is at http://www.6502.org/users/andre/spi65b/index.html. The XC9572XL is a 3.3V chip (as is my entire SBC) but it is 5V tolerant.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Apr 03, 2020 10:01 pm

cjs wrote:

Hm! So that sounds like yet another reason to move the direct page to cover your I/O addresses when loading data, and then run the load routine in the bank you're loading. That would fix this problem, would it not?

Pointing direct page at I/O will not have much effect on performance but will complicate software design. While accessing an I/O register as a direct page location does eliminate one clock cycle per access—assuming DP points at a page boundary—it doesn't help with memory fetches and stores, and of course will have no effect on elapsed time due to hardware latency. Furthermore, if DP is pointing at hardware you can't use indirection to access the memory addresses that are being read or written. How would you target addresses in different banks without indirection? You would have to use self-modifying code or use the block-copy functions to move data between banks.

I went through this exercise when I was designing the SCSI and multi-channel UART drivers for my POC V1 units. The ability to point DP to different places is of greatest value in functions (subroutines), in which DP can be pointed to reserved stack space to give the function ephemeral variable storage. Relocating DP is also useful in the context of an operating system kernel, in which the kernel can have its own location for non-ephemeral variables and some other location can be assigned for application variables. Pointing DP at hardware not only proved to be of no value in performance, it resulted in a a lot of hoop-jumping in order to get at things such as indices and pointers that were needed by the driver.

Skylie33 · Post by **Skylie33** » Fri Apr 03, 2020 10:40 pm

While this is not an SD card IC, I did find this on Mouser: https://www.mouser.com/datasheet/2/163/ ... -17703.pdf

It seems like it'd be an easy way to interface with USB mass storage, with a proclaimed speed of 1 MB/s.

What do you guys think? They're certainly very cheap, but would they be easy to interface with? Unfortunately it doesn't seem to handle FAT16/FAT32, though.

There's also the CH376 (datasheet here: https://www.mpja.com/download/ch376ds1.pdf) but it seems a bit sketchy and you can't buy the bare chips anywhere fromn what I can tell.

Chromatix · Post by **Chromatix** » Fri Apr 03, 2020 10:55 pm

I would probably map hardware to bank $01 or $FF on an '816 system, or possibly some other spare bank, whichever is convenient. You can still access it from emulation mode, by using the long addressing modes - but on an '816 the first thing you should do out of reset is select native mode anyway. Then it's straightforward to load the DBR for whichever page is convenient when accessing hardware, and use long or indirect-long addressing for the other.

That means you can reserve the more precious bank zero for direct-page and stack RAM, ISRs and boot code. You might even map an entire 64KB RAM there (or a 256KB RAM to banks 0-3), with a ROM overlaid on top only for boot. The reset handler would copy the ROM into the RAM, then unmap the ROM. This permits using a slow ROM without much performance penalty, and you can load additional software from another form of storage that's easier to update.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Apr 04, 2020 12:06 am

Skylie33 wrote:

While this is not an SD card IC, I did find this on Mouser: https://www.mouser.com/datasheet/2/163/ ... -17703.pdf

This part has been discussed in other posts around here. Try searching for it and see what turns up.

Skylie33 · Post by **Skylie33** » Sat Apr 04, 2020 12:28 am

BigDumbDinosaur wrote:

This part has been discussed in other posts around here. Try searching for it and see what turns up.

I did, and apparently it seems like it's more of a way of getting UART via USB, not so much for reading/writing to mass storage. I can't seem to find very many ICs. Preferably the IC would handle the filesystem as well, but if it isn't possible, then I suppose it can be done in code. Wouldn't be fast though.

65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help

Re: 65c816 address decoding help