Defining the 65Org32 - addressing modes (and instructions)

BigEd · Post by **BigEd** » Fri Dec 02, 2011 9:18 am

Arlet wrote:

Quote:

With a macro assembler, a sequence of some tens of single-bit shifts isn't so ugly

Until you realize you've just spent 4% of a precious block RAM on a single 20 bit shift. :)

Block ram is for caches! But is it really so small?

BigEd · Post by **BigEd** » Fri Dec 02, 2011 9:22 am

GARTHWILSON wrote:

My original 65Org32 proposal (it would be good to review the topic)

Yes, sorry - I'm thinking of 65Org32 as a bigger brother to 65Org16, which was explicitly a simple extension to 6502. So I haven't been referring to the same machine as you, which is wrong because you named it.

As we have very few FPGA implementers actively interested, it still might be worth taking a stepwise approach: first make the step from 65Org16 to something with 32data/32address with the necessary changes to the state machine, and then add the instructions, registers and execution units according to taste.

Cheers
Ed

Arlet · Post by **Arlet** » Fri Dec 02, 2011 9:56 am

BigEd wrote:

Block ram is for caches! But is it really so small?

A single block RAM is 18432 bits, or 512x36 bit words, and the smallest Spartan3, the XC3S50 has 4 of those. I've used the XC3S200 before, and it has 12 block RAMs.

teamtempest · Post by **teamtempest** » Sat Dec 03, 2011 5:37 am

Quote:

I'm not sure about "minus most three byte instructions" - the instructions are there but the length is different. I think perhaps two interesting things happen:

- detecting short addresses to choose zero page versus absolute will always choose zero page
- so some of the absolute forms never get used. So they need not be implemented, which does leave room for alternate opcodes

I think you're probably right that memory referencing instructions that have both zero page and absolute forms will always 'naturally' resolve to two byte zero page forms if an assembler is told that zero page spans 4GB. Which is quite a minimal change to make, actually.

As you also point out, that won't work for JSR or JMP. They really do have to change from emitting three bytes to just two, as would any other absolute instruction without an equivalent zero page form (no examples of which come immediately to mind). From a macro implementation point of view it's just changing

Code: Select all

JSR .macro ?addr
  .byte $20
  .byte <(?addr)
  .byte >(?addr)
  .endm

to

Code: Select all

JSR .macro ?addr
  .byte $20
  .byte ?addr
  .endm

Since there aren't that many instructions affected, it's possible one version or the other could even be conditionally defined within one source file. A native version wouldn't be much harder than that, either.

The one less memory fetch would make them faster, wouldn't it?

Quote:

One concern: bit shifting up to 32 bits becomes quite cumbersome with only single-bit shift instructions, and because there are no 8 bit units (octets), you can't take any shortcuts by shifting per octet either.

So, it seems an extension for a barrel shift instruction would be quite useful, slowing down the core by quite a bit (although you could make it a N-clock cycle instruction)

I'm glad to see others are thinking about what a nuisance having only single-bit shifts might be. Not being a hardware guy, I don't know much about what the solution might be. I don't quite see why there can't be instructions that shift by eight or 16 bits at a time either, since these are natural sizes that come up quite often regardless of how many bits are in the accumulator (for example, ASCII and Unicode character codes).

One thought regarding speed, though. Is it really such a problem? I've been told that actual sustained reading speeds of DDR2 memory chips and the like are closer to 50Mhz than the 800MHz (or whatever) burst speeds they are capable of. That would seem to match what Ed's tests are showing pretty well. Much faster and the problems of matching a faster CPU with a slower memory are going to become more prominent.

Or am I totally off base here?

Oh, and one other thought: if there was an N-clock shift instruction, how would that affect interrupt latency? Or would it be interruptible, not atomic? I suppose from a programming point of view I'd care most only if I was trying to interface hardware of some kind, but I'm kind of curious anyway. I suppose the same question could be asked about a multiply or divide instruction, actually, unless they could be done in 6 cycles or less (unlikely without a lot of additional circuitry, I suspect).

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 03, 2011 5:50 am

Quote:

Oh, and one other thought: if there was an N-clock shift instruction, how would that affect interrupt latency? Or would it be interruptible, not atomic? I suppose from a programming point of view I'd care most only if I was trying to interface hardware of some kind, but I'm kind of curious anyway. I suppose the same question could be asked about a multiply or divide instruction, actually, unless they could be done in 6 cycles or less (unlikely without a lot of additional circuitry, I suspect).

Multiplication can be done all at once, as quickly as the logic states can ripple through. I don't think that's the case with division but I haven't tried to figure it out. Shifting can be done all at once too, instead of one bit at a time. But yes, I expect it takes more resources. Anything long like the 65816's move instructions needs to be interruptible. If we don't have good interrupt performance, we don't have a 65-family processor.

teamtempest · Post by **teamtempest** » Sat Dec 03, 2011 5:57 am

Quote:

I might still see having direct-page, absolute, and "long" addressing though, all three, even though they all cover the same 4 gigaword address range; because sometimes you want the DP offset so you use the DP addressing, sometimes the DBR or PBR offset so you use absolute addressing, and sometimes no offset so you use "long" addressing (although it's no longer than the others-- it just ignores the offsets).

This, on the other hand, strikes me as just a wee bit harder to implement. At least the way my HXA assembler works, the address mode is usually determined by examination of the operand. More specifically, the high bits of the operand value. Distinguishing whether or not they match other bits determines what the address mode should be, and hence what particular op code should be emitted followed by which bits of the address operand (some can be discarded if the hardware will restore them later when used, as for example with direct page values).

But since all addresses are equivalent in the proposed processor there's no way to distinguish them based simply on examing them. To make them "un-equivalent" there would have to be something like the 8/24 "zero page" I suggested earlier.

One way out, to enable multiple address modes in the face of indistinguishable addresses, would be to change the opcode mnemonics to reflect the desired mode. Something like the Motorola 68000 series, perhaps:

Code: Select all

LDA addr
LDA.DP addr
LDA.PBR addr

kc5tja · Post by **kc5tja** » Sat Dec 03, 2011 6:35 am

GARTHWILSON wrote:

Shifting can be done all at once too, instead of one bit at a time. But yes, I expect it takes more resources.

Multiplication can take a huge amount of chip area because you're replicating a full adder at each partial product summation point. A single-cycle shifter can be implemented in substantially smaller amounts of logic. Here's one here (UNTESTED as of the moment I wrote this) in Verilog:

Code: Select all

// This assumes you don't want to rely on your vendor's >> or <<
// primitive.  I'd recommend using >> or << first, and switch to something
// like this if, and only if, you run out of chip area.
//
// This also assumes a 16-bit wide set of registers.  Add one stage for
// each width doubling.
//
// This code implements a logical shift right.  Supporting arithmetic
// shifting isn't much harder.  Supporting shifting to the left is similarly
// easy to implement.

reg [15:0] input;
reg [15:0] output;
reg [3:0] amt;

reg [15:0] t8, t4, t2;

always @(*) begin
  t8 <= (amt[3] == 1) ? {8'b00000000, input[15:8]} : input;
  t4 <= (amt[2] == 1) ? {4'b0000, t8[15:4]} : t8;
  t2 <= (amt[1] == 1) ? {2'b00, t4[15:2]} : t4;
  output <= (amt[0] == 1) ? {1'b0, t2[15:1]} : t2;
end

BigEd · Post by **BigEd** » Sat Dec 03, 2011 10:55 am

Thanks for the shifter! That's nice and simple, and shows that it only takes one level of mux for each bit of shift distance. So five levels for a 32bit machine: this probably won't slow things down at all.

Because we need two operands for this, both ideally from registers, I'd consider a new single-purpose register and use it for all 4 existing shift/rotate instructions. We then need just a single extra instruction
TXD
to transfer to the distance D register which we initialise to 1. (In Arlet's core, this register would not be in the register file, which is single-port)

Multiplication will also be cheap because the nice FPGA people give us an efficient 18x18 primitive to build with. Four of those and we're done (in a single cycle)

Division is always a bit more expensive: the choice is between
- a multi cycle opcode
- a division step operation (repeated in the source, by a loop or unrolled)
- using multiplication instead (multiply by reciprocal) (*)
You can get 2 bits of result per cycle or operation for the first two methods.

(I wouldn't worry so much about interrupt latency, for the simple reason that a cycle at 50MHz is much shorter than a cycle at 4MHz.)

Cheers
Ed

(*) See here and here (code samples for chapter 10) for discussion of converting a constant to a 'magic number' which acts like a reciprocal.

Arlet · Post by **Arlet** » Sat Dec 03, 2011 11:22 am

For Xilinx the built-in << and >> shift operations work very well. I've used them in a CPU design, and they are quite efficient. Despite their efficiency, they still slow down the ALU, because it requires more area, and therefore increased routing delays. You also need extra muxes to select between all the different ALU operations.

It is also an easy option to stick an extra register after the shifter output, which slows down the instruction by a cycle, but allows the synthesis tool to pick a pipelined version of the operation. This also works well with the built-in multipliers.

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 03, 2011 8:10 pm

Quote:

(I wouldn't worry so much about interrupt latency, for the simple reason that a cycle at 50MHz is much shorter than a cycle at 4MHz.)

The jitter becomes another issue though, which comes from having to complete different lengths of instructions before starting the interrupt service.

One thing I've used my workbench computer for is digitizing audio, interrupt-driven off a VIA timer, to get a set of samples to do a spectrum analysis on the aircraft noise (or in one case, off-road race-car noise) our communications products have to deal with. If I use a sharp (5th-order, not quite "brick-wall") anti-alias filter that cuts off the input just above 4kHz, my 5MHz phase-2 rate with the jitter in the interrupt rate adds an unintentional noise that is 43dB down from the highest-frequency components' level in the audio. That means it's good for about 7 bits of accuracy, which is pretty good in this application. (The dynamic range is more, because if the signal level goes down, so does the noise that is untentionally added by jitter.) The jitter is the same regardless of the sampling rate, but has a greater effect with higher input signal frequency.

But if I wanted high-quality audio out to 20kHz, I would want the jitter to be only about 1/20th as much (*), which it could do at 100MHz, all other factors being equal. Computers don't normally do this kind of thing interrupt-driven, but instead offload it to separate hardware and use DMA. But if the 6502 has the interrupt performace to do it without that added hardware and complication, why not. The attractive part is that we can do this stuff on a home-made computer without being comuter scientists.

Related, my workbench computer can to a 2048-point complex fast fourier transform in 5 seconds at 5MHz in Forth. Assembly would speed it up a little, but not too much because so much of the time is taken in the multiply routine. If it had a hardware multiply and also ran at 50MHz, well, you can imagine the possibilities.

(*) This comes from five times the maximum audio frequency, multiplied by 4 to get at least two more bits of accuracy. In this "audio myths" YouTube video with lots of demonstrations (and full-quality wave files downloadable from www.ethanwiner.com/aes so you don't get the effects of YouTube's lossy data compression), Ethan Winer looked for the most offensive noise he could find, and mixed it in at different levels with fine recordings. You can't even hear it until it's about 9 bits below the music level. On my cheap computer speakers on my desk, I couldn't hear it until it was 7 bits down. See 32:00-34:30, and 46:00-47:45.

BigEd · Post by **BigEd** » Sat Dec 03, 2011 8:17 pm

Good point about the jitter. Almost certainly, a divide-step instruction would be the simplest thing to do. It was good enough for SPARC. Another possibility is the coprocessor approach, where you set off a divide and fetch the result some cycles later.

But keeping it simple is the decision most likely to lead to an implementation!

Cheers
Ed

Edit: Wirth has some code in this short paper (pdf) in Oberon, for ARM, which makes me think we don't need a divide instruction at all.

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 03, 2011 9:14 pm

Quote:

Good point about the jitter. Almost certainly, a divide-step instruction would be the simplest thing to do. It was good enough for SPARC. Another possibility is the coprocessor approach, where you set off a divide and fetch the result some cycles later.

Such modularity should make the project more conquerable, even if the DIVide module is in the same IC. There could be a flag to test to see if it's done, although in many situations the processor could do something else you know take X number of cycles so that when you come back for the quotient and remainder, you automatically know it will be ready.

BigEd · Post by **BigEd** » Wed Dec 07, 2011 4:15 pm

Arlet's tip reminding(!) us of the << and >> operators set me off on an implementation experiment.

(This has strayed from addressing modes to architectural extensions, but never mind)

Previously we had these speeds for an unconstrained, 'Balanced' synthesis at various sizes:

BigEd wrote:

For spartan3:

8bit data/16bit address: 57.267MHz
16bit data/32bit address: 54.457MHz
32bit data/40bit address: 52.596MHz

Note that this isn't a 65Org32 (or 65org40) it's just a quick idea as to whether the wide arithmetic is costly. For a quick unconstrained synthesis, it seems not.

I added a D register and a TXD instruction, and a barrel shifter which sits alongside the 6502-style ALU (but inside the ALU module.) It's not very tidy, but here are some results:

For spartan3:

8bit data/16bit address: 55.066MHz
16bit data/32bit address: 53.217MHz
32bit data/40bit address: 45.850MHz

Note that the 'Balanced' tactics are never going to give the fastest result anyway. I did previously get 18% speedup by applying better tactics.

With those same tactics, the post-synth timings on that last experiment come out at 53.657MHz.

Also note that the critical path seems to be through the adder and then the zero-detection. I think Arlet has commented previously that zero-detection could be punted to a subsequent cycle.

Having got my hands dirty, I could have a go at unsigned multiply.

Cheers
Ed

Edit to add:
Targetted a spartan6 part with the same 32-bit wide design and same improved tactics.
xc6slx9-3-csg324: post-synth 108MHz (9.24ns period), post-route clock period 9.68ns
with a -2- speed grade: 88MHz, 12.2ns, 13.6ns (This is the speed grade in the Avnet microboard)

Note that this is still an unconstrained design: in particular the pad-to-clock and clock-to-pad timings are greater than the period. So this won't run as fast as that, but it does show that the enhanced ALU isn't critical.

Final point: see how adding long-distance shifts has knocked the clock speed down a bit. All programs are penalised a bit, in theory, to help those cases where we need to shift a lot. In practice, there's only a penalty if we actually needed to clock slower.

Edit to add: flawed implementation posted here

ElEctric_EyE · Post by **ElEctric_EyE** » Fri Dec 09, 2011 12:31 am

BigEd wrote:

...I added a D register and a TXD instruction, and a barrel shifter which sits alongside the 6502-style ALU (but inside the ALU module.)...

Nice work Ed!
Did you make new opcodes for these instructions?...
I am realizing after using the 65Org16 Core with an 8bit peripheral, i.e. TFT, in order to do multiple conversions from 8bit to 16bit and vice-versa, that a multiply by 256 and a divide by 256 would be very nice, especially if the new opcode could be completed quicker than 8 successive ASL's/LSR's which I'm current using.

BigEd · Post by **BigEd** » Fri Dec 09, 2011 10:15 am

Yes, I decided to use $9B for TXD (which is TXY on other 65xx variations), and then the new D reg becomes a 'distance' for all forms of shift: ROL, ROR, ASL and LSR.

It works, but I'm not completely happy with it. I'm not convinced that one often needs a 5-bit shift or a 17-bit shift. I suppose the nice thing about FPGA is that anyone who wants it can have it, and anyone else can leave it out. Most likely is that specific uses like 16 or 24 bit shifts would be most used. It's also not obvious that the rotate-with-carry model is always what's wanted.

The rotates were easy but the masking and sign extension for the shifts were a bit messy: must be better ways to code those. The speed loss for 32bit width is disappointing.

I also threw in an MPY as $DB (is STP elsewhere) but that's untested. I decided to implement XBA ($EB), so that we get

Code: Select all

   (B,A)=A*B

which means we can multiply two numbers and get a full sized result with the minimum of hardware change. Then the software will need to do a little organising of inputs and outputs before and after.

For your particular case, wanting to get at the high octet of a 16-bit value, I'm not sure if either of my changes is much help. The multiply and the shift are both fast, but the setup will cost:

Code: Select all

  LDX #8
  TXD
  LSR A
  LDX #1
  TXD

Of course, you can leave 8 in the D register and reuse it for several shifts, so long as you remember to set it back to 1 eventually. (I've just realised this is the kind of modal change which an interrupt handler won't like - but then, saving and restoring registers is something interrupt handlers need to do)

Code: Select all

  LDA #256
  XBA
  LDA something
  MPY

Cheers
Ed