6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 1:03 pm

All times are UTC




Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next
Author Message
PostPosted: Thu Jul 09, 2020 2:50 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 730
Location: Tokyo, Japan
65f02 wrote:
Along the same lines, maybe it would work to fall back into slow mode whenever a disk I/O address is accessed, and stay there until an RTS is encountered?

Well, first of all you'd need to count JSRs and RTSs or risk mishandling timing-criticial code with subroutine calls in it (which sound as if it may be expensive), and second (as I can attest from personal experience) even counting doesn't always work. Since the 6502 doesn't have an indirect JSR (unlike the 6800), it's synthesized by pushing an address on to the stack and jumping, and even with a direct JSR the stack-push technique may be cleaner when programming in continuation passing style. And of course the developer may be using RTS tricks just because it happens to give him the exact timing he needs in the smallest number of bytes.

Quote:
Also, a good point about needing to access all addresses externally while in slow mode.

Yeah, I think that's something that escaped all of us at the start, for some reason. The "stack runs too fast" issue hadn't occured to me; I was just realizing that there must be writes of data to the sector buffer in RWTS, and the sector buffer was obviously not in the same page....

Quote:
That will need a new flavor of bus cycles in the replica, where data are fetched from internal RAM (where they may previously have been written by some fast code elsewhere), but nevertheless an external bus cycle is started to retain the timing. Which should, of course, be quite feasible.

Yup, that's exactly how I envisioned it. I was wondering if one should actually do the write, as well, or just do some random read cycle (from any address, basically) that avoids changing the external RAM. I guess whatever's easiest, since the the state of external RAM that's emulated internally is probably "undefined."

Though, thinking about that, it now occurs to me that on some systems different areas of memory will run at different speeds (e.g., ROM may have wait-states). So it probably makes sense to at least ensure that the address you're using externally is the actual address being read or written by the code.

Quote:
Had gotten sidetracked over the past days by the suggestion of a (dare I say in a 6502 forum?) CDP 1806 replica...

Oh, yeah, other processors get regular mentions in this forum. The 6800, for example, though mostly that seems to be in threads explaining how much faster the 6502 is over the 6800, poor thing. I have RCA1802 on my list, too, because I accidentally ordered ten of those instead of ten 6809s and they weren't worth returning so they went into the parts bin anyway. But since I have them, I must use them, right?

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 09, 2020 9:53 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
cjs wrote:
65f02 wrote:
Along the same lines, maybe it would work to fall back into slow mode whenever a disk I/O address is accessed, and stay there until an RTS is encountered?

Well, first of all you'd need to count JSRs and RTSs or risk mishandling timing-criticial code with subroutine calls in it (which sound as if it may be expensive), and second (as I can attest from personal experience) even counting doesn't always work. Since the 6502 doesn't have an indirect JSR (unlike the 6800), it's synthesized by pushing an address on to the stack and jumping, and even with a direct JSR the stack-push technique may be cleaner when programming in continuation passing style. And of course the developer may be using RTS tricks just because it happens to give him the exact timing he needs in the smallest number of bytes.

Hmm; but if we have to assume that subroutine calls can occur within the time-critical code segments, aren't those calls also bound to leave the memory page which has been earmarked as time-critical?

Counting JSRs and RTSs should not be too difficult, and may be a pragmatic way to determine when the CPU leaves the time-critical code section "upwards". I am inclined to accept the risk of missing a "stealth" subroutine call which is done via stack pushes and an indirect jump. [Edit -- I had this exactly backwards in the original post:] If I do overlook one, the effect would be that the processor leaves slow mode too early, when it encounters the RTS which corresponds to the "stealthy" subroutine call. That might indeed lead to incorrect real-time behaviour. But in the page-based approach, calling an off-the-page subroutine would do similar damage, and is more likely to occur in my view.

Alternatively, I could monitor the stack pointer. As soon as it increases above the value which it had when the first I/O access was encountered, I could end the "slow mode". But that is probably inviting other side effects, in situations where the programmer has pushed other data onto the stack...


Last edited by 65f02 on Thu Jul 09, 2020 9:21 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 09, 2020 10:04 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Personally, I wouldn't try to make this bulletproof, just try to get it to work in practice. A page either side of the page which holds the code that touches the I/O registers would probably be enough. (I say this without any evidence at all.)


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 09, 2020 1:12 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 730
Location: Tokyo, Japan
65f02 wrote:
Hmm; but if we have to assume that subroutine calls can occur within the time-critical code segments, aren't those calls also bound to leave the memory page which has been earmarked as time-critical?

Well, my thinking was no: the time-critical routines are usually pretty small and kept together on a single page because having yet more pages that must be aligned in a certain way just increases the pain.

Quote:
Counting JSRs and RTSs should not be too difficult, and may be a pragmatic way to determine when the CPU leaves the time-critical code section "upwards".

Yeah, this is basically what I was trying to avoid with the "mark the whole page 'slow'" idea. Though I don't have a lot of evidence for this, I trust making a page slow more than I trust trying to track subroutines, especially when any instruction is likely to be "misused" to gain certain timing in timing-critical code.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 09, 2020 5:00 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
cjs wrote:
65f02 wrote:
Hmm; but if we have to assume that subroutine calls can occur within the time-critical code segments, aren't those calls also bound to leave the memory page which has been earmarked as time-critical?
Well, my thinking was no: the time-critical routines are usually pretty small and kept together on a single page because having yet more pages that must be aligned in a certain way just increases the pain.
I am not sure that's what the programmers actually did. I just had a look at the commented DOS listing you linked to, and the RWTS routine calls "MSDELAY", which is several pages away, to wait for the motor to spin up. I did not look further down into part 2 of the RWTS, but this already seems like a counter-example: The FPGA can notice that something time-critical starts when the "motor on" address is accessed, but if it were to rely on the memory page, it would not realize that the delay routine itself is also time critical, as it resides elsewhere.

Quote:
Quote:
Counting JSRs and RTSs should not be too difficult, and may be a pragmatic way to determine when the CPU leaves the time-critical code section "upwards".
Yeah, this is basically what I was trying to avoid with the "mark the whole page 'slow'" idea. Though I don't have a lot of evidence for this, I trust making a page slow more than I trust trying to track subroutines, especially when any instruction is likely to be "misused" to gain certain timing in timing-critical code.
I think I'll try the JSR/RTS counting; it seems quite simple to me. Note that the implementation I have in mind would be quite simplistic, in that it does not remember anything beyond the current status. (I.e. it would not build a permanent map of slow vs. fast areas.) I would just dynamically enable "slow mode" whenever I encounter an access to an I/O address, start incrementing a counter whenever I see a JSR in slow mode, decrement it when I see an RTS, and switch back to fast mode when the counter reaches -1.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 10, 2020 1:35 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 730
Location: Tokyo, Japan
65f02 wrote:
I just had a look at the commented DOS listing you linked to, and the RWTS routine calls "MSDELAY", which is several pages away, to wait for the motor to spin up.

Oh, really! Ok, I'd not realized that. Another thing that hadn't occurred to me would be time-criticial but not clock-cycle criticial routines (i.e., where being a few clocks off is fine because you need to wait about a millisecond or whatever, such as for turning on a drive motor), where programmers might not worry about page boundary considerations.

Quote:
I think I'll try the JSR/RTS counting; it seems quite simple to me. Note that the implementation I have in mind would be quite simplistic, in that it does not remember anything beyond the current status. (I.e. it would not build a permanent map of slow vs. fast areas.) I would just dynamically enable "slow mode" whenever I encounter an access to an I/O address, start incrementing a counter whenever I see a JSR in slow mode, decrement it when I see an RTS, and switch back to fast mode when the counter reaches -1.

Yes, this does now seem like the better approach. It would probably also be useful to have some sort of debugging tools that would allow end-users to figure out for what chunks of code this is being used for in an actual system, though I have no idea off-hand about how to do that. (Perhaps record in a special area of memory designated by the user?)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 11, 2020 4:12 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Alright, I have written the code for an "automated real-time mode", which gets started by any access to a known-to-be-time-critical I/O address, and ended by an RTI or by an RTS going "up" from the level where the first critical I/O access was encountered. I can't test it before I get back home a week from now, unfortunately. I will certainly report here on my success, or lack thereof, in attempting to boot Apple DOS!

I have also implemented Ed's idea of caching screen buffers (graphics and text) in internal RAM while data are written to the host, so that read access can be fast. Curious to see how much that does for HGR drawing and text scrolling in the Apple II!

Finally, I have started on the idea of writing back internal RAM contents to external (battery-buffered) RAM in the background, during idle cycles of the external bus. But I encountered an unexpected early bump in the road there: Just enabling the 9th bit in my 64k block RAM, which is meant to mark "pending" changes to data, slows down the FPGA massively -- adding nearly 2ns to the prior 10ns cycle time!

I had assumed that the extra bits would be "sitting there" in the RAM blocks anyway, hence declaring them to make them accessible should not change much at all in the resource utilization. I have not even connected the bits to anything faraway yet, so they should not consume routing resources either: The 9th input bits are tied to '1' and '0' respectively, on the two ports of the dual port memory. The 9th output bits are not connected at all yet.

Any ideas what could be going on there? According to the Xilinx timing analyzer, all the slow paths start with RAM data access, and it seems that it is the block memory itself (with its output multiplexers which combine the blocks into large memory) which is slowing things down. I will fiddle with the IP core generator some more, but may have to just forget about the battery backup -- it doesn't seem worth a 20% performance hit.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 11, 2020 4:21 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
That's a surprise, and not a good one. I've never ventured into 9 bits.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 12, 2020 12:03 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
65f02 wrote:
Any ideas what could be going on there? According to the Xilinx timing analyzer, all the slow paths start with RAM data access, and it seems that it is the block memory itself (with its output multiplexers which combine the blocks into large memory) which is slowing things down. I will fiddle with the IP core generator some more, but may have to just forget about the battery backup -- it doesn't seem worth a 20% performance hit.
I often use the Block RAMs, but I hardly ever use the IP core generator to instantiate them. I have found the organizations that use the 9-, 18-, and 36-bit widths do not impose a speed penalty in my applications. (I'm also mostly happy with 50 MHz clock rates, but I do have projects that use the block RAMs at 125 MHz+). These additional bits are intended to be used for parity, but I generally use them for data, and I've never encountered an increase in read / write speeds as a consequence of utilizing these bits.

One thing to consider is that these additional bits only exist for particular configurations of the BRAMs. For example, the 18432 bit BRAM of the Spartan 6 can be organized as 16384 x 1, 8192 x 2, 4096 x 4, 2048 x 8 or 2048 x 9, 1024 x 16 or 1024 x 18, and 512 x 32 or 512 x 36. There only exist an additional 2048 bits that can be mapped to the parity bits as bit 9, bits 16...17, and bits 32...35.

Review your IP core parameters to make sure that you're not actually instantiating in your IP core multiple BRAMs that may require multiplexers built into the programmable logic to yield the memory organization that you're requesting.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 12, 2020 8:19 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Thank you, Michael -- I think you nailed that one!

I had always assumed that the memory generator would construct my 64k*8 memory from 32 blocks of 2k*8 BRAM. But it turns out that it is smart enough to use the 16k*1 organization instead, which means much simpler multiplexer stages: Eight 4:1 multiplexers, one for each bit, rather than 32:1 for the 8-bit-wide blocks.
[Edit: Fixed my numbers, after eventually realizing that 64/2 = 32 and not 16...]

Just tried it: When I force the IP core generator to use 2k*8 blocks, by telling it to use 2k*8 "fixed primitives", the resulting memory performance is just as bad as what I got for the newly introduced 9-bit RAM. Memory access times take a big hit due to the complex multiplexers.

As you stated, if I want to use the 9th bit, there is no alternative to the 9-bit wide organization. When defining the individual RAM blocks as 1 bit wide, the extra memory is simply "lost" and I only get 8 sets of 1-bit memory. So this is mixed news: I now understand why the performance has dropped so much, but I also see that there is no way around that.

I might still try a poor man's solution for the battery backed-up RAM, by just cycling through all addresses known to be battery backed-up, and writing them back to the host in the background no matter whether they have changed or not. Less elegant, but with the small RAM sizes in the chess computers (1 to 8 kByte) it should still work well enough, as long as the 65C02 works largely from internal memory and leaves enough host bus cycles available.

Thanks again for the pointer in the right direction! At least I understand what's going on now...


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 13, 2020 12:33 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Another quick update: I have implemented the "stupid" memory backup solution now, and it does not seem too bad. I realized that my state machine for writing data back to the host in the background can wait until the very last moment where the CPU might still want to initiate an external bus cycle -- then only use the cycle when the CPU is not interested. Hence I never steal a bus cycle from the CPU, and all those redundant write-back operations of unchanged data, while not elegant, will at least not harm performance.

Lesson learned for me: Spartan 9-bit block RAM is not slower than 8-bit -- but may be a whole lot slower than 1-bit, if the latter means simpler multiplexers to combine multiple RAM blocks into a larger memory.

I have also written the VHDL to support the Apple II language card, by the way, and realized that the restrictions are a bit more severe than I thought. Excessive detail follows, for those who care:

I had overlooked that all my writes to external memory (or I/O) also write to the internal RAM at that address: My original term for enabling the RAM, which did disable it "on the fly" when an external address shows up on the bus, had to be simplified weeks ago. There was clearly not enough time to route that signal back to the block RAMs on the "outer rim" before the end of the clock cycle...

That was convenient for implementing Ed's idea of the internally mirrored video RAM (as the mirroring was already happening anyway). But it limits what I can do with the lower 4kByte RAM block of the language card, which exists in two overlaid versions: Contrary to what I thought, I don't have the spare 4k of BRAM from the Apple's I/O address space available, as that BRAM gets messed up by any write operation to the I/O addresses. And I can't even retain an intact copy in BRAM of the first incarnation of the language card's 4k block, as soon as the Apple starts writing to the second incarnation -- making those write accesses "external" does not change the fact that they overwrite the internal BRAM too.

I think I now have this implemented in a way which works, by falling back onto slow, external mode as soon as conflicting memory areas have been used. (At the cost of limited performance, once a more complex usage scenario arises: I can accelerate the ROM as long as no RAM has been used, and can accelerate the RAM as long as only 12 k of it have been used. But switching back from RAM to ROM means slow ROM access; and the first write to the "other" 4k RAM bank means that both 4k banks are now only available externally. Oh, and all writes to the lower 4k go to slow external RAM from the beginning, mirrored internally, to be prepared in case the second 4k bank ever gets written to, causing an invalid copy in the internal BRAM.)

I'm curious to see whether this works as planned, and what it does for the performance of UCSD Pascal, once I am back home.


That was more detail than anyone might be interested in... I just mention it here as my justification for not wanting to look into supporting more complex bank switching or memory management schemes. This simple one turned out to be messy enough! :wink:


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 13, 2020 12:52 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Certainly was interested - even the smaller font didn't put me off! Quite an intricate set of constraints.

My gut feel is that while you might end up with something that's not optimally fast, the acceleration you hold onto will still be very worthwhile.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 18, 2020 8:48 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 730
Location: Tokyo, Japan
65f02 wrote:
That was more detail than anyone might be interested in...

Not at all; that's exactly the kind of detail I'm interested in.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 24, 2020 11:37 pm 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
The memory mapping issue is a tough one. I ended up using a 4K of RAM as a map that represents 16 byte controllable windows for my solution (not FPGA, and only about 48MHz). You can define any of these 16 byte regions as fast or slow (accelerated or cycle exact). It's programmable via a serial port or external OLED/push buttons. Every machine uses a different memory map, often with mirrors of chips and so trying to make a device work with everything requires quite a bit of flexibility. I wanted something you could drop into any 6502 machine and use it as an accelerator or (mostly) the ability to test the contents of memory, such as a RAM, ROM, and I/O chips - like for stand-up arcade machines. Nice solution you have. 100MHz is cookin'! :)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 25, 2020 6:15 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Thanks Jim -- I had overlooked your project so far! A forum search found this thread. Do you have additional information posted on the forum or elsewhere?

The computing power, and hence clock rate, available in those small processors these days is amazing. I am sure one could beat my 100 MHz with a suitable ARM processor... But I wasn't aware of any project which combined that with a "real time" 6502 bus interface to the original host. That's quite intriguing! How have you implemented this? Do the Phi0 clock, NMI, IRQ RDY lines trigger interrupts or do you poll them?

Have you been able to push this beyond 1 MHz host clock, as mentioned in the earlier thread? With the FPGA-based solution, clock delay and jitter do become noticeble for the faster 5 MHz hosts. (All incoming signals, including the Phi0 clock, get sync'd to the FPGA clock via two flip-flop stages, which add a 20 ns delay and 10 ns peak/peak jitter.) This still works nicely, but the 8 MHz CDP 1806 system which I might want to try next may be pushing it.

Many thanks for pointing out that cool project!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: