6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 3:38 pm

All times are UTC




Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Thu Nov 11, 2021 5:31 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I have seen many comments which suggest 65816 bank zero must be treated as a special case. I question this wisdom. What is the downside of mirroring I/O and ROM in all banks? For example, having 32KB RAM, 16KB I/O and 16KB ROM in every bank where the RAM in each bank differs but I/O and ROM is the same in each bank?

Obviously, this allocates 4MB to I/O when only a few pages may be required. Likewise, it limits total RAM to 8MB. Worse, the RAM is only available in discontiguous chunks of 32KB. This obviously affects memory allocation and block copy functions. However, it doesn't affect the upper bound of stacks allocated in bank zero. It also has the benefit that I/O may use faster 16 bit addresses (or 8 bit addresses) and the data bank value may be unmodified. Most significantly, a board design which does not treat bank zero as special has much of the speed and simplicity of a 16 bit address system. The *processor* uses bank zero for vectors, stacks and direct page but the *board* only considers bank number when accessing RAM. The board is otherwise bank agnostic.

I've previously suggested a four latch scheme which provides three 6502 compatible bank windows and one 65816 bank latch. I remain fond of this arrangement but it has become apparent to me that all arrangements that treat bank zero as special may be unnecessarily slow. The four latch scheme requires handling of cases within cases and the tiers of logic constrain the fastest operational speed. I am concerned that designs which punch an I/O hole through RAM are limited in a less obvious manner while reducing the total accessible RAM. Latches and address ranges may obscure the problem but, ultimately, address decode is required to implement an L shape of complimentary OR/NAND outputs from a mix of inverted and non-inverted inputs. There are very specific constraints where this matches the speed of a bank symmetric board. Outside of these constraints, there are cases where one design may be preferable or the choice is moot.

Technically, one I/O hole makes the RAM discontinuous but it is otherwise possible to have 15MB or more of contiguous RAM. Indeed, it is possible to install 16MB RAM and make almost all RAM visible to the processor. If you need the very last address line, this is an obvious advantage. However, in systems with 4MB RAM or vastly less, advantages are less obvious. The "best" arrangement depends upon total RAM and usage patterns. Contiguous allocation may outweigh raw cycle speed and this is particularly true if "faster" hardware is then hobbled by 15 bit addressing (or similar) in software. However, itty-bitty tasks may benefit from itty-bitty allocation. For example, a multi-tasking system with tasks written in a mix of BASIC, Forth and assembly may have more tasks if memory is allocated in smaller chunks - and this is most lazily achieved by partially populating each bank.

Assuming that I use no more than 4MB RAM and mostly run toy applications written in BASIC, is there a fatal flaw if 4MB RAM is arranged as 128 banks of 32KB rather than the conventional arrangement of 64 banks of 64KB (minus a chunk in bank zero)?

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
PostPosted: Thu Nov 11, 2021 5:46 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
In my opinion, having contiguous RAM above bank $00 is most efficient. Unlike programs, which are by necessity bank-aware, data can be bank-agnostic if RAM is contiguous. From my perspective, it should be possible to have data structures straddle banks so as to get most efficient use from available RAM.

Using a CPLD, one can readily distinguish between bank $00 and higher banks, which, in my opinion, effectively makes what you are suggesting of dubious value.

Sheep64 wrote:
For example, having 32KB RAM, 16KB I/O and 16KB ROM...

16KB allocated to I/O is grossly wasteful. Again, using a CPLD, the I/O block can be reduced to 1K or less. A well-thought-out memory map will maximize RAM, especially in bank $00, since that is where direct page and the stack will reside.

In my POC V1.2 and V1.3 units, there is 48KB of bank $00 RAM—V1.3 also has 64KB of extended RAM. I/O is decoded into pages and the whole of the I/O block occupies 2KB. This arrangement is achieved with discrete logic.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Thu Nov 11, 2021 7:55 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
I use Forth a lot on the workbench, and its excellent memory efficiency makes me think I'll never write anything complex enough to fill more than 32K of RAM with code (with the kernel in ROM); but there have been times I would like many megabytes of contiguous RAM for large data arrays. Dr Jefyll has proposed using the top address bit or two for other functions in his ultrafast-I/O schemes.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Nov 11, 2021 10:02 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Sheep64 wrote:
Worse, the RAM is only available in discontiguous chunks of 32KB.
BigDumbDinosaur wrote:
data can be bank-agnostic if RAM is contiguous. From my perspective, it should be possible to have data structures straddle banks
GARTHWILSON wrote:
there have been times I would like many megabytes of contiguous RAM for large data arrays.

:!: I'll be the fourth to draw attention to the value of contiguous RAM.

I agree that itty-bitty tasks won't require contiguous RAM, but constraining oneself to itty bitty tasks seems a high price to pay. Non-contiguous RAM may be simpler to organize where small tasks are concerned, so there is that. But I'm not convinced that simpler address decoding will allow clock rates to rise much or at all.

To manage bank-aware decoding, programmable logic is one option, as BDD noted. And the diagram below shows how familiar chips like the '528 and the '138 can have their input capacity tripled to handle the wider addresses produced by the '816. The little 'LVC1G27 and 'LVC1G332 gates you see have maximum prop delays of 3ns or less, so the arrangement is faster than you might suppose. The real bottleneck will be the '528 or '138, and in that case FCT Series logic is probably the fastest choice.

BTW, I agree when Sheep64 points out that, "The "best" arrangement depends upon total RAM and usage patterns." Different circumstances result in different tradeoff choices. But I myself would never dream of breaking up the 816's big, beautiful, flat memory space! :shock:

-- Jeff


Attachments:
'816 IO-decode.png
'816 IO-decode.png [ 10.08 KiB | Viewed 997 times ]

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 4:32 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I expected one or two negative comments and a general consensus that I devised an inadvisable memory map. I did not expect the weight and speed of negative reactions. Nor did I expect Dr Jefyll to provide the most tentative response.

Regarding software, my approach is to have applications which work within a 16 bit address-space and an executive which may or may not maintain a larger address-space. Differences in operating system ABI and processor instruction dialect can be handled dynamically within affected routines.

Regarding hardware, my approach is to "flatten the logic". In particular, I have found that the following operations may run in parallel:-

  • Multiplexed address latching.
  • Address segment decode using A13-A15.
  • Read/write qualification.

Empirical testing found that new 74HC138 and 74HC139 operating at room temperature was able to invert a 64MHz crystal. I plan to use a raw crystal to obtain both clock phases or, alternatively, one 74HC157 in one of Dr Jefyll's lesser remembered configurations which I call the Scotty manoeuvre. Specifically, one 74HC157 may implement 4 bit transparent latch where each input may latch independently on high or low input. Where the address-space is 20 bit or less, this may replace one 74x373 or 74x573 and inverter, as suggested in the W65C816 datasheet. Assuming 74x157 operates at similar speed to the decoder chips, assuming 10ns setup time, assuming 5ns RAM, assuming tight physical arrangement, assuming low bus capacitance and assuming steady power, it should be possible to exceed 20MHz at 5V while only using 74HC series. A practical 30MHz design is possible with 74AC series, asymmetric clock and/or over-volting.

With a modicum of chip stacking, it is possible to place any two of address multiplexing, clock stretching, address decode or read/write qualification under one DIP 65816. This arrangement is unlikely to break the speed record. However, it aids fitting of dozens of DIP chips into 100mm*100mm and is therefore cheap to manufacture.

I have also discovered that half 74x74 may be used with active high 65xx peripheral chip select to implement a crude privilege system which may preclude access to I/O. The overhead of a privilege system may horrify real-time proponents. However, it is possible to connect 6522 chips with different interrupts while only one 6522 implements a privilege system. This could restrict access to a system timer and filing system but permit high speed audio sampling to RAM. Of particular note, the privilege bit does not hinder address decode speed.

Conceptually, I find separate consideration of memory map and clock stretching map to be extremely helpful. Often, they use the same circuitry - but not always. This is where the distinction is useful. As an example, it may be desirable to have A13-A15 decoded with 74HC138 to implement 6*8KB RAM, 8KB I/O and 8KB ROM. Meanwhile, 16KB of clock stretching may be decoded with a separate chip. Ignoring the rather elastic timing of each memory region, I have found that it is possible to approximate the memory map of Commander X16, Commodore VIC20, Apple II, latter 8 bit Acorn hardware and W65C265 using one 74HC138. This arrangement gained a reputation for being slow and cumbersome when NMOS 6502 PHI2 output (approximately 50ns latency) was connected to 74LS138 input enable (approximately 25ns latency). However, when using PHI0 input, 74HC or 74AC and separate read/write qualification, latency may be less than 15ns. In combination with 600nm CMOS processors and larger, faster SRAM, it is quite feasible to run legacy binaries at more than 10 times their original speed. I would not have considered this arrangement without the popularity of the Commander X16 project.

As a bonus, the unmapped upper address lines of modern RAM and ROM may be connected to a transparent latch such that 6 bit A16-A21 may be connected directly to RAM and 2 bit A22-A23 may be connected directly to ROM. This allows any permutation of 63 banks to be used with one native and three foreign ABIs. This allows, for example, an unmodified Acorn BASIC binary to use an Acorn ABI while EhBASIC concurrently uses a VIC20/Commodore 64/Commander X16 ABI while the base system concurrently implements native 65816 vectors. Essentially, it is possible to solve the Commodore/Acorn/65816 vector conflict with fairly trivial 15ns glue logic. Unfortunately, this arrangement breaks stack introspection on page one using TSX // AND $102,X or similar. Fortunately, the most common case is an executive routine which flips a carry flag. I presume that drogon or Werner Engineering have already handled these cases. Regardless, all ABIs can be implemented using one slow, cheap 28C256 (possibly programmed with XR2801 EEPROM programmer, possibly connected to a privileged 6522).

The design can be easily implemented in CPLD or FPGA. However, I strongly advise not using more six bits of input for address decode when using Atmel CPLD, not using more than five bits of input when using Xilinx FPGA and not using more than four bits of input when using Lattice iCE40. Anything more is an unnecessary violation of "flatten the logic". In the general case, for contiguous regions of memory, that means not decoding blocks smaller than 4KB, definitely not decoding blocks smaller than 1KB and never processing A16-A23.

Finally, I have a quick message to one proponent of "flatten the logic" who latches A16 and then incorporates this and more than six other inputs into CPLD equations: Firmware modification will remove more than 5ns latency from address decode. This would allow, for example, POC1.3 to run faster than 20MHz. Unfortunately, it requires dividing the RAM into two or more discontiguous regions. Oh, wait. POC1.3 does that already.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 27 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: