Page 4 of 6
Posted: Fri Dec 09, 2011 10:23 am
by Arlet
I think adding shift/rotate immediate would be a natural option. So, you could do
which would be a LSR A by 10 bits. I think it would be fairly easy to modify the current model to allow for this, since you can reuse all the logic for the existing immediate mode.
Posted: Fri Dec 09, 2011 4:01 pm
by BigEd
yes, now I've a little more confidence with modifying the core this could be a good step - it's four opcodes, but it's useful. Again
column B looks like the best choice.
Posted: Fri Dec 09, 2011 5:03 pm
by ElEctric_EyE
I like your idea BigEd... How hard to make your D register like the X & Y registers?
Posted: Fri Dec 09, 2011 5:14 pm
by BigEd
well, you've seen the source. You'd need to change or add lines everywhere that X and Y are used or written to. You'd also need to find new opcodes to match. In my case that's extra difficult because I'm still sticking with a parameterised core which builds for 8-bit purposes, so I'm not considering using the higher bits in a 16-bit opcode. (I could do it anyway, using conditional compilation, but I'm still not considering it because the design space is too large and I know I can only make progress in small steps.)
The unique thing about my D and B registers is that they are a second operand to the ALU so they can't be in the main AXYS register file. (I could fix that by making it a dual-ported register file, or by duplicating it, but it's easiest to make them standalone.)
If you were extending AXYS to more registers without needing a different path to the ALU, that problem wouldn't arise.
Adding a U and V to act specifically like X and Y might be tolerably regular - think also of the assembler, and whether you could convince anyone to support your new machine!
Cheers
Ed
Posted: Fri Dec 09, 2011 5:31 pm
by ElEctric_EyE
...The unique thing about my D and B registers is that they are a second operand to the ALU so they can't be in the main AXYS register file...
I understand that now.
A macro should be able to easily take advantage of what you have done so far.
Posted: Fri Dec 09, 2011 6:57 pm
by GARTHWILSON
so I'm not considering using the higher bits in a 16-bit opcode.
I suspect that using more op-code bits could lead to reducing the number of logic levels in the instruction-decoding logic, possibly increasing maximum speed. Has that been looked into?
Posted: Fri Dec 09, 2011 7:21 pm
by Arlet
I think that the memory -> ALU path is the longest, so I doubt it would help to reduce instruction decoding. Also, I think adding more op-code bits would make decoding slower, rather than faster.
Posted: Fri Dec 09, 2011 8:27 pm
by GARTHWILSON
I'm no processor designer, but here's what I'm thinking, as I look at how the op codes are arranged on the table. See if this reasoning has merit.
Consider all the op codes that have Y as the destination register for example:
Code: Select all
PLY 0111 1010
LDY# 1010 0000
LDY_ABS 1010 1100
LDY_ZP,X 1011 0100
LDY_ABS,X 1011 1100
TAY 1010 1000
INY 1100 1000
DEY 1000 1000
The only bit that's consistent is bit 0; but there are a lot of other op codes that have the same value in bit 0 that don't have Y as a destination for putting results, so other bits have to be inverted, ANDed, etc. for the instruction decoder to figure it out. (It does look like it was easier before PLY came along on the CMOS 6502; but that instruction is rather valuable.)
I would think that if you have enough bits, one bit could always be used to say the destination register is Y, without having to examine other bits, another could be used to mean the destination register is A, or stack, or other memory, etc., and you could even load more than one register at once if you wanted to, like some programs have LDY#0 and LDX#0, or LDY#FF and LDX#FF, right together. Another bit could indicate a source, or a particular step in the address mode, or a particular step in ALU usage, without having to examine any other bits. In fact with enough op code bits, the instruction decoder might be reduced to almost nothing, and the various fields in the op code itself would direct traffic throughout the various cycles of an instruction.
Even if it did not speed anything up, it may reduce the FPGA resources needed, allowing more instructions, register width, or whatever, in a given size of FPGA.
Posted: Fri Dec 09, 2011 8:32 pm
by BigEd
Right - simple instruction layout and fast instruction decode is one of the principles of RISC, and if you look at the instruction encodings of ARM, say, this is what you'll find.
But ... with 8-bit opcodes there's not enough room for manoeuvre. (And there's no point at all until you've done some synthesis and seen where the critical path really is - you risk spending effort on something which simply isn't a problem. Just like optimising code, you need to measure first.)
Right now, with the quick and cheap syntheses, we're seeing the ALU being the slow path, not the instruction decode. (Arlet could still be right, for a properly constrained synthesis of a system including RAM.)
Cheers
Ed
Posted: Fri Dec 09, 2011 8:35 pm
by GARTHWILSON
But with 8-bit opcodes there's not enough room for manoeuvre.
I was thinking that going to 16 or 32 bits would add a lot of freedom. I'll take your word for it though if you say it's really not an issue. The RISC pipeline is long though, costing a lot for branches and interrupts that require flushing and re-filling the pipe. With a data bus wide enough to specify all possible physical addresses in the machine in a single grab (as opposed to low byte, high byte), and possibly an input clock that's 2 or 4 times the bus clock so dead bus cycles can be eliminated, the interrupt latency can be further reduced, as can the length of many of the instructions.
Posted: Fri Dec 09, 2011 8:43 pm
by BigEd
It's hard to say without an experiment, of course. The thing about processor performance is that nothing comes as readily as process improvement: moving from spartan3 to 6, and accordingly faster external RAM, will be the biggest and easiest win.
I was about to say that putting immediate values into the extra bits might be the biggest win, but it's not much help unless you have single-cycle operations, and that's probably quite an upheaval.
But in any case, my own focus is not on performance: it's on capability. We get a lot of performance for free, if we can get this thing working at all. (And indeed, it is working, in a minimal implementation of 65Org16) Conversely, if it isn't working, great ideas for improvement are only potential, which we've had for many years in many discussions.
We could really do with some more verilog hackers: anyone who doesn't know how can consider themself an absolute beginner and have a go at starting up the learning curve. The tools are free, and we have a working core to tinker with. Simulation is free too: you don't need any hardware.
For myself, I've ordered a spartan6 board with 16bit RAM and a chipscope license - I'm hoping that'll take away some excuse for slow progress. Due to arrive Monday.
Cheers
Ed
Posted: Fri Dec 09, 2011 10:11 pm
by Arlet
The Spartan-6 has 6-input LUTs, if I'm not mistaken. This means you can do a 6-bit opcode decode in a single logic level, and most 7-8 bit decodes in 2 levels. I agree that a really sparse encoding could simplify it even more, but it's probably better to spend some more logic on keeping the instruction encoding compact. In the future, the extra bits in the instruction space could be useful for including some of the operands, which will save memory and cycles.
Posted: Sat Dec 10, 2011 11:31 pm
by ElEctric_EyE
When can we see your code BigEd? I am confounded as to how you did what you did. Consider me a Verilog newbie. Maybe even a Verilog "hacker"!
...For myself, I've ordered a spartan6 board with 16bit RAM and a chipscope license - I'm hoping that'll take away some excuse for slow progress...
What will Chipscope show you? What kind of design are you going to put in this board? And which board is it?
Posted: Sun Dec 11, 2011 12:42 am
by kc5tja
The RISC pipeline is long though, costing a lot for branches and interrupts that require flushing and re-filling the pipe.
The CISC pipeline is far worse; today's 25-plus stage pipelines virtually guarantees
massive delays on mispredicted branches and interrupts.
A 4-cycle latency for an interrupt is hardly an issue in practice. Saving and restoring 32 registers, however, is the true source of major latency with RISCs.
Also, the pipeline itself isn't an impediment to single-cycle interrupt handling. If the CPU maintains a whole separate context dedicated just for interrupt handling (e.g., does
NOT fetch from CPU vectors, but has a whole different set of IP, A, X, Y, S, and P registers), you can switch contexts in a single cycle.
Posted: Sun Dec 11, 2011 1:24 am
by GARTHWILSON
Also, the pipeline itself isn't an impediment to single-cycle interrupt handling. If the CPU maintains a whole separate context dedicated just for interrupt handling (e.g., does NOT fetch from CPU vectors, but has a whole different set of IP, A, X, Y, S, and P registers), you can switch contexts in a single cycle.
Hmmm... that does not sound re-entrant, like to have a very urgent but very quick-to-service interrupt cut in on the servicing of another one that is a little less urgent and takes longer to service. I can't remember for sure if I've ever done that on the 6502 outside of having NMI cut in on service of IRQ, but I am set up for it. I remember reading about the Harris RTX-2000 stack processor years ago which handled it extremely efficiently-- one to four clocks for interrupt and zero for return from interrupt, IIRC, and, since the registers are the stack, you can do it as many levels deep as you want, as long as you don't overflow the stack.