A pool of odd ideas for speeding up a 6502 architecture

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Sep 06, 2016 7:53 pm

kc5tja wrote:

ttlworks wrote:

BTW: when trying to build sort of a game system nowaday, for a colored screen resolution of 640*480 or more I think that a 64kB address range won't do, so there is that proposal for giving a 6502 a 24 Bit address bus...

You might as well just go for the 65816. The 6502 literally has no advantages for everything described above.

Samuel is making the same point I have been making all along. Somewhere in the development curve, one has to accept that no matter what gets added to the circuit, the 6502 will continue to be an eight bit MPU with a 64 KB address space and very limited hardware resource management capabilities. Aside from the 16,000 KB address space afforded by the 65C816, the presence of ABORT, E, VPB, etc. (as Mr. Falvo also pointed out) makes for a whole new realm of circuit development (to be fair about it, the WDC 65C02 also has VPB, but not ABORT or E). In order to accomplish what the 65C816 does in a single device, your 6502 circuit will need a lot of glue logic, either in discrete form (slow and bulky) or in a CPLD. Either way, considerable design time will be expended on something that will not perform nearly as well as a 65C816 running at the same clock speed.

Quote:

ttlworks wrote:

I'm not sure, if the 6502 needs a 'multiply' instruction, as most of the code probably won't make use of it.

Only because nobody has had one available before. If one were available, you bet I'd use the heck out of it.

I know I'd being using a hardware multiply and divide in a lot of places. Filesystem management, for example, is all about arithmetic, and hardware multiply/divide would be invaluable in such an application.

BigEd · Post by **BigEd** » Tue Sep 06, 2016 8:46 pm

I suppose I could just admit that it's impossible to have two threads, one for each of two different ideas. But as a consequence, it's very difficult to have a satisfactory discussion about either idea.

Let me just say: this here thread was intended for implementation ideas which lead to a higher performance 6502. This other thread is intended for modified add-value near-6502 ideas.

So, here's Sam's post about fused ops and cycle counts and z80s:

kc5tja wrote:

Z-80s can take 3, 4, or 5 clock cycles per M-cycle, depending on the instruction executed, and even then, can take several M-cycles too.

The reason the average is between 3 and 5 is because the chip is microcoded underneath. That is, the Z-80 and 68000 CPUs actually have a simpler, yet functionally independent CPU (of sorts) whose job it is to interpret the programmer-visible instruction set. This is how, for example, the 68020 could get away with such utterly ridiculously complicated addressing modes.

In the Z-80's case, the microcode is used to multiplex data over finite numbers of internal buses, all of which were laid out by hand back then. It's also used to enforce the bus protocol as well:

1. Put address on the bus.
2. (While waiting for data to arrive, increment the PC.)
3w. Sample the RDY or similar signal, and if negated, wait here.
3. When RDY is asserted, accept the data and terminate the bus transaction.

E.g., if RDY is asserted during step 1 above, it means nothing.

The genius of the 6502 is that its bus was truly single phase (phi1 and phi2 are conveniences to simplify the NMOS implementation; it's not at all a requirement for the CMOS process, which is why the 65816 doesn't have them, and you drive phi2 externally). You drive the address bus and R/W lines during phase-1, and capture the data during phase-2. If the data wasn't ready, that's OK -- just repeat the current cycle. The 65816 is only marginally more sophisticated than this, due to multiplexing the bank-address byte on D7-D0.

As far as proof of 6502's performance relative to other CPUs with a normalized clock, all you need to do is count the cycles in your particular application. Yes, yes, the 68000 can pull off 2 MIPS performance at 8MHz, but know what? A 65816 at 8MHz will pull off 4 MIPS on average. At this point, the 65816 and 68000 compete head to head, with the former on average attaining close to 80% the performance of the 68000, despite having only an 8-bit wide bus. Proof: Sit an Apple IIgs on a table next to a classic Macintosh. Run a paint program on both of them. (Remember, Mac is monochrome, while IIgs is 16-color. While the Mac has more pixels, the IIgs actually has more bits to push around total). You'd totally expect the IIgs at 2.3MHz to be slower at video updates than the 8MHz Mac; however, this is not the case. Grab a brush from some picture and drag it around the screen. The Mac will have very noticeable rip and tear, while the IIgs will appear to be about as fast as a Commodore-Amiga using its blitter to draw.

As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.

If we take this opportunity to really widen the data bus, then a single instruction fetch can grab a whole handful of instructions. This is quite useful thanks to something called macro-op fusion. If you augment the instruction decode logic to perform "macro-op fusion", your instruction timings now will look like this:

Code: Select all

; Assuming native-mode, 16-bit accumulator
; For core that implements macro-op fusion, I further assume a 64-bit bus.
;
; CODE                  AS-IS           Macro-Op Fusion
  CLC           ;       2               1       [1, 3]
  LDA addend1L  ;       5               1       [2, 5]
  ADC addend2L  ;       5               1       [2, 5]

  STA resultL   ;       5               2       [3, 4, 5]
  LDA addend1H  ;       5               1       [2, 5]

  ADC addend2H  ;       5               2       [2, 3, 5]
  STA resultH   ;       5               1       [4, 5]

;               TOTAL   32 cycles       9 cycles (best case)

Notes:
 1  Out of context, CLC normally would take the usual 2 cycles;
but, since it's recognized as part of a more complex code pattern,
it's behavior can be rolled into the mechanisations of the surrounding
code.

 2  This instruction takes 2 cycles to fetch a 16-bit word from memory.

 3  There is an additional cycle overhead for instruction fetch on this byte.

 4  This instruction takes 2 cycles to store a 16-bit word to memory.

 5  Add 1 cycle if 16-bit operand crosses an 8-byte boundary.

The CPU is now looking not just at individual instructions to determine what to do, but the context surrounding them. clc, lda, adc is a single "macro-op" instruction. sta, lda occurs entirely too frequently to miss this one too. adc, sta occurs less frequently, but it's strongly desirable for what I hope are obvious reasons.

According to http://oldwww.nvg.ntnu.no/amiga/MC680x0 ... ndard.HTML , a single ADD.L instruction takes 12 cycles. The code above fetches, adds, and stores a 32-bit quantity to memory, and assuming alignment with respect to the 64-bit bus, actually runs 3 cycles faster. Again, this is a hypothetical case, and don't expect to see this technology become widespread in the 65xx-community soon. All I'm saying is that it's relatively easily doable if you truly compare apples to apples.

BigEd · Post by **BigEd** » Tue Sep 06, 2016 9:25 pm

... and having quoted that, I've run out of energy today to engage with the ideas... sorry.

kc5tja · Post by **kc5tja** » Wed Sep 07, 2016 2:09 am

While my cycle counts are correct, I forgot to update the footnotes.

Since the data bus is 64-bits wide, there's no further need to span 2 cycles to read a 16-bit quantity, unless it happens to straddle the 64-bit boundary itself.

BigEd · Post by **BigEd** » Tue Sep 20, 2016 12:41 pm

I think this note belongs here - one small innovation in the first ARM back in 1985 gave them 50% better memory bandwidth, which they felt was the limiting factor for performance at the time:

Quote:

Furber: We knew that the processor would meet our requirements because it was fairly easy to do the sums. One of the reasons ARM came out as a 32-bit processor and Acorn effectively went from 8 to 32 without stopping at 16 along the way was because we had this strong idea that memory bandwidth was the key to getting processors to perform, and the 32-bit processor had an easy access to twice the bandwidth of a 16-bit processor so why not use it? We were not talking high clock rates here. The original ARM processor that went out in the Archimedes product was capable of operating at about eight megahertz, and that was, you know, not too difficult to achieve; and if you achieved eight megahertz then we had a very good idea of how much bandwidth it would use. The other thing we did with the processor which, again, I thought was remarkably obvious at the time, and so we didn’t patent it which, in retrospect, might have been a mistake, was we had this idea of just exposing a little bit of the internal operations of the processor to enable the memories to operate in high-speed page mode when the processor was generating sequential memory request addresses, which it does quite a lot of the time. That gave us about 50 percent more bandwidth out of the memory and really for no additional cost. It was a very cheap thing to do. In terms of complexity it cost maybe half-a-dozen logic gates on the chip, and one pin, to give the memory controller enough information to deliver 50 percent higher bandwidth and therefore performance. The other reason we were confident the design would deliver was, being a reduced construction set computer, it had no complicated instructions, and therefore we knew exactly how long it would run before-- the maximum time it would run-- before it would take an interrupt, and we could design that to meet our real time requirements.

- http://archive.computerhistory.org/reso ... df#page=15

Edit:
To clarify, the extra pin signals that the next access is sequential - that the address will be present-address+1, which it is for all straight-line code other than loads and stores, and also for multi-word pushes and pops. The CPU knows in the previous cycle that this is going to be so. In most cases, that means the access will be in-page for the DRAM, and so a CAS-only access is enough, and that's much faster.

(A later, bigger, chip will have an onboard cache, and its memory accesses will often be cache-line refills, and those too are sequential in nature. But Acorn had no room on the ARM chip for cache.)

(Hat-tip to Jeff for suggesting the clarification.)

Dr Jefyll · Post by **Dr Jefyll** » Thu Sep 22, 2016 1:46 pm

There's a curious detail regarding the multi-word pushes and pops. (I don't see it mentioned in the article Ed linked, so I guess it's something I read elsewhere. I think this detail pertains to the ARM1; perhaps Ed can confirm.)

One might expect that a multi-word push causes successive writes to decreasing addresses, and a multi-word pop causes successive reads from increasing addresses. But, compared to what you'd expect, the pushes are performed backwards.

The result of the push is as you'd expect. But the implementation is twisted around so it uses successive writes to increasing addresses. They did it this way because accesses to decreasing addresses don't benefit from the bandwidth optimization described in Ed's post.

BigEd · Post by **BigEd** » Thu Sep 22, 2016 5:22 pm

Ah, you're right about pushes and pops being in an odd order. I'm not sure about the effect on the in-page detection - I'd hope that both incrementing and decrementing accesses would be in-page if indeed they are in-page, they are sequential. But the reason for the reverse access is more interesting: it's because you have restore something last, in order that the instruction can be interrupted... or something. I'm going to have to look this up - let's try this excellent book by Steve Furber himself...

Yes, he says that initially the order of writes to stack was as you'd expect (which means the stack pointer need only ever be incremented or decremented) but when at a late stage they decided they would want to implement an abort mechanism, they needed the PC to be updated last, so that an aborted multiple load would be restartable.

See also
http://www.heyrick.co.uk/armwiki/STM#Function

That book by Furber has lots of implementation detail about ARM1, and also documents the rationale for many decisions.

Dr Jefyll · Post by **Dr Jefyll** » Thu Sep 22, 2016 5:32 pm

Thanks for the correction. So, you're saying a multi-word pop causes successive reads from decreasing addresses?

It's fascinating to hear what Furber has to say -- the stories, and the insights. Recommended!

BigEd · Post by **BigEd** » Thu Sep 22, 2016 5:42 pm

I suspect I may be confused, unclear, or even both... one thing to keep in mind is that the ARM, because of this tactic for abortable multiple-register operations, must in some cases do an add (or maybe a subtract) and then access the stack in the 'wrong' order. I'm fairly sure of that. Rick's page says:
"Decrement After
This is more complicated. The first address is that of the base register (Rn) minus four time the number of registers to be written, plus four. The registers are written, the address increments.
"

BigEd · Post by **BigEd** » Thu Sep 29, 2016 5:59 pm

Here's what Ken Shirriff has to say - he's neither confused nor unclear!

Quote:

When using the block data operations to push and pull registers on the stack, you'd expect to push R0, R1, R2, etc and then pop in the reverse order R2, R1, R0.[3] To handle this, the priority encoder needs to provide the registers in either order, and the address incrementer needs to increment or decrement addresses depending on whether you're pushing or popping, and the chip includes this circuitry. However, there's a flaw that wasn't discovered until midway through the design of the ARM1. Register 15 (the program counter) must always be updated last, or else you can't recover from a fault during the instruction because you've lost the address.[4]

The solution used in the ARM1 is to always read or write registers starting with the lowest register and the lowest address. In other words, to pop R2, R1, R0, the ARM1 jumps into the middle of the stack and pops R0, R1, R2 in the reverse order. It sounds crazy but it works. (The bit counter determines how many words to shift the starting position.) The consequence of this redesign was that the circuitry to decrement addresses and priority encode in reverse order is never used. This circuity was removed from the ARM2.

Footnote 3:

Quote:

[3] The block data transfer instructions work for general register copying, not just pushing and popping to a stack. It's simpler to explain the instructions in terms of a stack, though.

Footnote 4:

Quote:

[4] If an instruction encounters a memory fault (e.g. a virtual memory page is missing), you want to take an interrupt, fix the problem (e.g. load in the page), and then restart the instruction. However, if you update registers high-to-low, R15 (the program counter) will be updated first. If a fault happens during the instruction, the address of the instruction (R15) is lost, and restarting the instruction is a problem.

One solution would be to push registers high-to-low and pop low-to-high so R15 is always updated last. Apparently the ARM designers wanted the low register at the low address, even if the stack grows upwards, so popping R15 least wouldn't work. Another alternative is to have a "shadow" program counter that can restore the program counter during a fault. The ARM1 designers considered this alternative too complex. For details, see page 248 of "VLSI RISC Architecture and Organization", by Stephen Furber, one of the ARM1's designers.

(Thanks Jeff for prompting me to find this.)

Dr Jefyll · Post by **Dr Jefyll** » Fri Sep 30, 2016 9:24 pm

"The solution used in the ARM1 is to always read or write registers starting with the lowest register and the lowest address."

So, it appears the multi-word pushes & pops are performed in a way that benefits from the bandwidth tweak. But benefiting from the tweak wasn't their main motivation.

Thanks for sorting this out, Ed

BigEd · Post by **BigEd** » Sat Oct 01, 2016 8:23 am

Yes, I see that I did miss something in my equivocation: if "sequential" means "sequentially increasing" then the memory system only needs to detect the all-ones condition for being at the top of a page as being an exception. That's simple. I was thinking of allowing for accesses in both directions, which would mean also detecting all-zeros for being at the bottom of a page, which is a bit worse than half as good!

(All-ones being only about seven bits or so, I think, for the part of the address which matters.)

hoglet · Post by **hoglet** » Sat Oct 01, 2016 9:09 am

BigEd wrote:

Yes, I see that I did miss something in my equivocation: if "sequential" means "sequentially increasing" then the memory system only needs to detect the all-ones condition for being at the top of a page as being an exception. That's simple. I was thinking of allowing for accesses in both directions, which would mean also detecting all-zeros for being at the bottom of a page, which is a bit worse than half as good!

(All-ones being only about seven bits or so, I think, for the part of the address which matters.)

If you look at the schematic of the Acorn ARM Evaluation system:
http://mdfs.net/Info/Comp/BBC/Circuits/Tube/armcpu.gif
you can indeed see the SEQ signal from the ARM CPU is combined with several address lines in a 20R8A PAL (IC9), which is part of a discrete DRAM memory controller.

The address lines are A2..A6 and A24..A25

The DRAMs being used are Hitachi HM50256 which have 9 bit row and column addresses.

So it seems in this case, due to limited inputs and/or product terms, the implementation is not as optimal as it could be (considering only 5 rather than 9 address bits).

I think A24..A25 are used to distinguish RAM, ROM and I/O regions.

Edit: And further, the ARM Evaluation System Hardware Reference Manual:
http://acorn.huininga.nl/pub/docs/manua ... Manual.pdf
gives the equations of IC9 on page 31, and mentions a 32-word boundary.

Dave

BigEd · Post by **BigEd** » Sat Oct 01, 2016 10:40 am

Good detective work Dave! It's a good point, that even if only 31 out of 32 accesses is sped up, that's still a substantial fraction - and in practice, there will usually be some non-sequential accesses in so many. It's a diminishing-returns situation. In fact, thinking about it, 32 words is of the order of a cache line refill, which would be the modern case.

Arlet · Post by **Arlet** » Sat Oct 01, 2016 12:07 pm

Also, longer bursts risk starvation in a multi-master bus system, so these are often not desirable anyway.

A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture