Defining the 65Org32 - addressing modes (and instructions)
GARTHWILSON wrote:
My original 65Org32 proposal (it would be good to review the topic)
As we have very few FPGA implementers actively interested, it still might be worth taking a stepwise approach: first make the step from 65Org16 to something with 32data/32address with the necessary changes to the state machine, and then add the instructions, registers and execution units according to taste.
Cheers
Ed
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Quote:
I'm not sure about "minus most three byte instructions" - the instructions are there but the length is different. I think perhaps two interesting things happen:
- detecting short addresses to choose zero page versus absolute will always choose zero page
- so some of the absolute forms never get used. So they need not be implemented, which does leave room for alternate opcodes
- detecting short addresses to choose zero page versus absolute will always choose zero page
- so some of the absolute forms never get used. So they need not be implemented, which does leave room for alternate opcodes
As you also point out, that won't work for JSR or JMP. They really do have to change from emitting three bytes to just two, as would any other absolute instruction without an equivalent zero page form (no examples of which come immediately to mind). From a macro implementation point of view it's just changing
Code: Select all
JSR .macro ?addr
.byte $20
.byte <(?addr)
.byte >(?addr)
.endm
Code: Select all
JSR .macro ?addr
.byte $20
.byte ?addr
.endm
The one less memory fetch would make them faster, wouldn't it?
Quote:
One concern: bit shifting up to 32 bits becomes quite cumbersome with only single-bit shift instructions, and because there are no 8 bit units (octets), you can't take any shortcuts by shifting per octet either.
So, it seems an extension for a barrel shift instruction would be quite useful, slowing down the core by quite a bit (although you could make it a N-clock cycle instruction)
So, it seems an extension for a barrel shift instruction would be quite useful, slowing down the core by quite a bit (although you could make it a N-clock cycle instruction)
One thought regarding speed, though. Is it really such a problem? I've been told that actual sustained reading speeds of DDR2 memory chips and the like are closer to 50Mhz than the 800MHz (or whatever) burst speeds they are capable of. That would seem to match what Ed's tests are showing pretty well. Much faster and the problems of matching a faster CPU with a slower memory are going to become more prominent.
Or am I totally off base here?
Oh, and one other thought: if there was an N-clock shift instruction, how would that affect interrupt latency? Or would it be interruptible, not atomic? I suppose from a programming point of view I'd care most only if I was trying to interface hardware of some kind, but I'm kind of curious anyway. I suppose the same question could be asked about a multiply or divide instruction, actually, unless they could be done in 6 cycles or less (unlikely without a lot of additional circuitry, I suspect).
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
Oh, and one other thought: if there was an N-clock shift instruction, how would that affect interrupt latency? Or would it be interruptible, not atomic? I suppose from a programming point of view I'd care most only if I was trying to interface hardware of some kind, but I'm kind of curious anyway. I suppose the same question could be asked about a multiply or divide instruction, actually, unless they could be done in 6 cycles or less (unlikely without a lot of additional circuitry, I suspect).
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Quote:
I might still see having direct-page, absolute, and "long" addressing though, all three, even though they all cover the same 4 gigaword address range; because sometimes you want the DP offset so you use the DP addressing, sometimes the DBR or PBR offset so you use absolute addressing, and sometimes no offset so you use "long" addressing (although it's no longer than the others-- it just ignores the offsets).
But since all addresses are equivalent in the proposed processor there's no way to distinguish them based simply on examing them. To make them "un-equivalent" there would have to be something like the 8/24 "zero page" I suggested earlier.
One way out, to enable multiple address modes in the face of indistinguishable addresses, would be to change the opcode mnemonics to reflect the desired mode. Something like the Motorola 68000 series, perhaps:
Code: Select all
LDA addr
LDA.DP addr
LDA.PBR addr
GARTHWILSON wrote:
Shifting can be done all at once too, instead of one bit at a time. But yes, I expect it takes more resources.
Code: Select all
// This assumes you don't want to rely on your vendor's >> or <<
// primitive. I'd recommend using >> or << first, and switch to something
// like this if, and only if, you run out of chip area.
//
// This also assumes a 16-bit wide set of registers. Add one stage for
// each width doubling.
//
// This code implements a logical shift right. Supporting arithmetic
// shifting isn't much harder. Supporting shifting to the left is similarly
// easy to implement.
reg [15:0] input;
reg [15:0] output;
reg [3:0] amt;
reg [15:0] t8, t4, t2;
always @(*) begin
t8 <= (amt[3] == 1) ? {8'b00000000, input[15:8]} : input;
t4 <= (amt[2] == 1) ? {4'b0000, t8[15:4]} : t8;
t2 <= (amt[1] == 1) ? {2'b00, t4[15:2]} : t4;
output <= (amt[0] == 1) ? {1'b0, t2[15:1]} : t2;
end
Thanks for the shifter! That's nice and simple, and shows that it only takes one level of mux for each bit of shift distance. So five levels for a 32bit machine: this probably won't slow things down at all.
Because we need two operands for this, both ideally from registers, I'd consider a new single-purpose register and use it for all 4 existing shift/rotate instructions. We then need just a single extra instruction
TXD
to transfer to the distance D register which we initialise to 1. (In Arlet's core, this register would not be in the register file, which is single-port)
Multiplication will also be cheap because the nice FPGA people give us an efficient 18x18 primitive to build with. Four of those and we're done (in a single cycle)
Division is always a bit more expensive: the choice is between
- a multi cycle opcode
- a division step operation (repeated in the source, by a loop or unrolled)
- using multiplication instead (multiply by reciprocal) (*)
You can get 2 bits of result per cycle or operation for the first two methods.
(I wouldn't worry so much about interrupt latency, for the simple reason that a cycle at 50MHz is much shorter than a cycle at 4MHz.)
Cheers
Ed
(*) See here and here (code samples for chapter 10) for discussion of converting a constant to a 'magic number' which acts like a reciprocal.
Because we need two operands for this, both ideally from registers, I'd consider a new single-purpose register and use it for all 4 existing shift/rotate instructions. We then need just a single extra instruction
TXD
to transfer to the distance D register which we initialise to 1. (In Arlet's core, this register would not be in the register file, which is single-port)
Multiplication will also be cheap because the nice FPGA people give us an efficient 18x18 primitive to build with. Four of those and we're done (in a single cycle)
Division is always a bit more expensive: the choice is between
- a multi cycle opcode
- a division step operation (repeated in the source, by a loop or unrolled)
- using multiplication instead (multiply by reciprocal) (*)
You can get 2 bits of result per cycle or operation for the first two methods.
(I wouldn't worry so much about interrupt latency, for the simple reason that a cycle at 50MHz is much shorter than a cycle at 4MHz.)
Cheers
Ed
(*) See here and here (code samples for chapter 10) for discussion of converting a constant to a 'magic number' which acts like a reciprocal.
Last edited by BigEd on Sat Dec 03, 2011 11:25 am, edited 1 time in total.
For Xilinx the built-in << and >> shift operations work very well. I've used them in a CPU design, and they are quite efficient. Despite their efficiency, they still slow down the ALU, because it requires more area, and therefore increased routing delays. You also need extra muxes to select between all the different ALU operations.
It is also an easy option to stick an extra register after the shifter output, which slows down the instruction by a cycle, but allows the synthesis tool to pick a pipelined version of the operation. This also works well with the built-in multipliers.
It is also an easy option to stick an extra register after the shifter output, which slows down the instruction by a cycle, but allows the synthesis tool to pick a pipelined version of the operation. This also works well with the built-in multipliers.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
(I wouldn't worry so much about interrupt latency, for the simple reason that a cycle at 50MHz is much shorter than a cycle at 4MHz.)
One thing I've used my workbench computer for is digitizing audio, interrupt-driven off a VIA timer, to get a set of samples to do a spectrum analysis on the aircraft noise (or in one case, off-road race-car noise) our communications products have to deal with. If I use a sharp (5th-order, not quite "brick-wall") anti-alias filter that cuts off the input just above 4kHz, my 5MHz phase-2 rate with the jitter in the interrupt rate adds an unintentional noise that is 43dB down from the highest-frequency components' level in the audio. That means it's good for about 7 bits of accuracy, which is pretty good in this application. (The dynamic range is more, because if the signal level goes down, so does the noise that is untentionally added by jitter.) The jitter is the same regardless of the sampling rate, but has a greater effect with higher input signal frequency.
But if I wanted high-quality audio out to 20kHz, I would want the jitter to be only about 1/20th as much (*), which it could do at 100MHz, all other factors being equal. Computers don't normally do this kind of thing interrupt-driven, but instead offload it to separate hardware and use DMA. But if the 6502 has the interrupt performace to do it without that added hardware and complication, why not. The attractive part is that we can do this stuff on a home-made computer without being comuter scientists.
Related, my workbench computer can to a 2048-point complex fast fourier transform in 5 seconds at 5MHz in Forth. Assembly would speed it up a little, but not too much because so much of the time is taken in the multiply routine. If it had a hardware multiply and also ran at 50MHz, well, you can imagine the possibilities.
(*) This comes from five times the maximum audio frequency, multiplied by 4 to get at least two more bits of accuracy. In this "audio myths" YouTube video with lots of demonstrations (and full-quality wave files downloadable from www.ethanwiner.com/aes so you don't get the effects of YouTube's lossy data compression), Ethan Winer looked for the most offensive noise he could find, and mixed it in at different levels with fine recordings. You can't even hear it until it's about 9 bits below the music level. On my cheap computer speakers on my desk, I couldn't hear it until it was 7 bits down. See 32:00-34:30, and 46:00-47:45.
Good point about the jitter. Almost certainly, a divide-step instruction would be the simplest thing to do. It was good enough for SPARC. Another possibility is the coprocessor approach, where you set off a divide and fetch the result some cycles later.
But keeping it simple is the decision most likely to lead to an implementation!
Cheers
Ed
Edit: Wirth has some code in this short paper (pdf) in Oberon, for ARM, which makes me think we don't need a divide instruction at all.
But keeping it simple is the decision most likely to lead to an implementation!
Cheers
Ed
Edit: Wirth has some code in this short paper (pdf) in Oberon, for ARM, which makes me think we don't need a divide instruction at all.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
Good point about the jitter. Almost certainly, a divide-step instruction would be the simplest thing to do. It was good enough for SPARC. Another possibility is the coprocessor approach, where you set off a divide and fetch the result some cycles later.
Arlet's tip reminding(!) us of the << and >> operators set me off on an implementation experiment.
(This has strayed from addressing modes to architectural extensions, but never mind)
Previously we had these speeds for an unconstrained, 'Balanced' synthesis at various sizes:
I added a D register and a TXD instruction, and a barrel shifter which sits alongside the 6502-style ALU (but inside the ALU module.) It's not very tidy, but here are some results:
For spartan3:
With those same tactics, the post-synth timings on that last experiment come out at 53.657MHz.
Also note that the critical path seems to be through the adder and then the zero-detection. I think Arlet has commented previously that zero-detection could be punted to a subsequent cycle.
Having got my hands dirty, I could have a go at unsigned multiply.
Cheers
Ed
Edit to add:
Targetted a spartan6 part with the same 32-bit wide design and same improved tactics.
xc6slx9-3-csg324: post-synth 108MHz (9.24ns period), post-route clock period 9.68ns
with a -2- speed grade: 88MHz, 12.2ns, 13.6ns (This is the speed grade in the Avnet microboard)
Note that this is still an unconstrained design: in particular the pad-to-clock and clock-to-pad timings are greater than the period. So this won't run as fast as that, but it does show that the enhanced ALU isn't critical.
Final point: see how adding long-distance shifts has knocked the clock speed down a bit. All programs are penalised a bit, in theory, to help those cases where we need to shift a lot. In practice, there's only a penalty if we actually needed to clock slower.
Edit to add: flawed implementation posted here
(This has strayed from addressing modes to architectural extensions, but never mind)
Previously we had these speeds for an unconstrained, 'Balanced' synthesis at various sizes:
BigEd wrote:
For spartan3:
- 8bit data/16bit address: 57.267MHz
16bit data/32bit address: 54.457MHz
32bit data/40bit address: 52.596MHz
For spartan3:
- 8bit data/16bit address: 55.066MHz
16bit data/32bit address: 53.217MHz
32bit data/40bit address: 45.850MHz
With those same tactics, the post-synth timings on that last experiment come out at 53.657MHz.
Also note that the critical path seems to be through the adder and then the zero-detection. I think Arlet has commented previously that zero-detection could be punted to a subsequent cycle.
Having got my hands dirty, I could have a go at unsigned multiply.
Cheers
Ed
Edit to add:
Targetted a spartan6 part with the same 32-bit wide design and same improved tactics.
xc6slx9-3-csg324: post-synth 108MHz (9.24ns period), post-route clock period 9.68ns
with a -2- speed grade: 88MHz, 12.2ns, 13.6ns (This is the speed grade in the Avnet microboard)
Note that this is still an unconstrained design: in particular the pad-to-clock and clock-to-pad timings are greater than the period. So this won't run as fast as that, but it does show that the enhanced ALU isn't critical.
Final point: see how adding long-distance shifts has knocked the clock speed down a bit. All programs are penalised a bit, in theory, to help those cases where we need to shift a lot. In practice, there's only a penalty if we actually needed to clock slower.
Edit to add: flawed implementation posted here
Last edited by BigEd on Fri Mar 09, 2012 5:32 am, edited 1 time in total.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
BigEd wrote:
...I added a D register and a TXD instruction, and a barrel shifter which sits alongside the 6502-style ALU (but inside the ALU module.)...
Did you make new opcodes for these instructions?...
I am realizing after using the 65Org16 Core with an 8bit peripheral, i.e. TFT, in order to do multiple conversions from 8bit to 16bit and vice-versa, that a multiply by 256 and a divide by 256 would be very nice, especially if the new opcode could be completed quicker than 8 successive ASL's/LSR's which I'm current using.
Yes, I decided to use $9B for TXD (which is TXY on other 65xx variations), and then the new D reg becomes a 'distance' for all forms of shift: ROL, ROR, ASL and LSR.
It works, but I'm not completely happy with it. I'm not convinced that one often needs a 5-bit shift or a 17-bit shift. I suppose the nice thing about FPGA is that anyone who wants it can have it, and anyone else can leave it out. Most likely is that specific uses like 16 or 24 bit shifts would be most used. It's also not obvious that the rotate-with-carry model is always what's wanted.
The rotates were easy but the masking and sign extension for the shifts were a bit messy: must be better ways to code those. The speed loss for 32bit width is disappointing.
I also threw in an MPY as $DB (is STP elsewhere) but that's untested. I decided to implement XBA ($EB), so that we getwhich means we can multiply two numbers and get a full sized result with the minimum of hardware change. Then the software will need to do a little organising of inputs and outputs before and after.
For your particular case, wanting to get at the high octet of a 16-bit value, I'm not sure if either of my changes is much help. The multiply and the shift are both fast, but the setup will cost:Of course, you can leave 8 in the D register and reuse it for several shifts, so long as you remember to set it back to 1 eventually. (I've just realised this is the kind of modal change which an interrupt handler won't like - but then, saving and restoring registers is something interrupt handlers need to do)
Cheers
Ed
It works, but I'm not completely happy with it. I'm not convinced that one often needs a 5-bit shift or a 17-bit shift. I suppose the nice thing about FPGA is that anyone who wants it can have it, and anyone else can leave it out. Most likely is that specific uses like 16 or 24 bit shifts would be most used. It's also not obvious that the rotate-with-carry model is always what's wanted.
The rotates were easy but the masking and sign extension for the shifts were a bit messy: must be better ways to code those. The speed loss for 32bit width is disappointing.
I also threw in an MPY as $DB (is STP elsewhere) but that's untested. I decided to implement XBA ($EB), so that we get
Code: Select all
(B,A)=A*B
For your particular case, wanting to get at the high octet of a 16-bit value, I'm not sure if either of my changes is much help. The multiply and the shift are both fast, but the setup will cost:
Code: Select all
LDX #8
TXD
LSR A
LDX #1
TXD
Code: Select all
LDA #256
XBA
LDA something
MPY
Ed