6502.org • View topic - RTS and its inner workings

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

RTS and its inner workings

Page 1 of 1

[ 7 posts ]

Previous topic | Next topic

Author

Message

kakemoms

Post subject: RTS and its inner workings

Posted: Wed May 02, 2018 8:45 am

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343

Hi

I am looking at the RTS instruction that supposedly pulls 2 bytes from the stack and then move those (-1) into the program counter.

The strange thing is that while this happens, there seems to be three additional memory accesses
(as taken from Visual6502.org)

Code:

1   0017   60 RTS    0017
2   0018   00         0018
3   01fb   00         0019
4   01fc   04         0019
5   01fd   00         0019
6   0004   00         0004
7   0005   4c JMP Abs 0005

Second column is accessed address, third is databyte, fourth is PC.

So after RTS instruction has been fetched (at $0017), it looks like the next byte (at $0018) is also read? Then it access three addresses from the stack, of which the third is unnecessary. Then it actually starts reading the address before ($0004) the first actual instruction ($0005).

Anyone here knows more about why RTS is behaving in this way? Is it correct that it behaves like this on real hardware or am I just misinterpreting the data?

Top

BigEd

Post subject: Re: RTS and its inner workings

Posted: Wed May 02, 2018 10:42 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

I'm sure visual6502 is right. As a quick check, we know that RTS takes 6 cycles, and that's what happens here.

So, we might ask what happens in each cycle?

Here's a quick simulation showing a bit more detail:

Cycle 1: fetch RTS - the CPU can't do anything else as it hasn't seen the instruction
Cycle 2: fetch possible operand, as it always does, and must, because it hasn't had time to decode the instruction. But it does send SP to the ALU, and also to the address bus
Cycle 3: SP to address bus, read the empty byte below the stack, not useful but it has to do something. ALU performs the increment for the next cycle
Cycle 4: SP+1 to address bus, reading PC-1 low byte, will be directed to ALU. ALU performs another increment
Cycle 5: SP+2 to address bus from the ALU, also updates the value of SP. ALU performs a NOP with PC-1 low byte, PC-1 high byte is read
Cycle 6: PC is updated with PC-1, new PC sent to address bus, byte before destination is read because something has to happen, PC will be incremented as usual
Cycle 1: fetch next opcode as per usual

Looking back through history to the 6800, we see it's one cycle quicker on RTS because the PC value is read from the stack already incremented. But the JSR takes 8 or even 9 cycles, so overall the 6502 is an improvement.

Top

John West

Post subject: Re: RTS and its inner workings

Posted: Wed May 02, 2018 10:42 am

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336

64doc has the following explanation. It was written long before we had Visual6502, so consider it more speculation than fact. But it's plausible.

Code:

        #  address R/W description
       --- ------- --- -----------------------------------------------
        1    PC     R  fetch opcode, increment PC
        2    PC     R  read next instruction byte (and throw it away)
        3  $0100,S  R  increment S
        4  $0100,S  R  pull PCL from stack, increment S
        5  $0100,S  R  pull PCH from stack
        6    PC     R  increment PC

The cycle after the opcode fetch is always a fetch of the next instruction byte. The opcode isn't available until the end of the first cycle, and the following memory access has to be decided by the start of the second cycle - there isn't enough time to decode the instruction. Since the next instruction byte is often needed, the 6502 fetches it anyway. In one byte instructions, it's not needed.

By the end of the second cycle, we know what the instruction is, and can start executing it. Since S always points to an empty location, it has to be incremented first. Then on the fourth cycle, we can start fetching PC. The fetch finishes on the fifth cycle.

JSR pushes a value one less than the address of the next instruction (I remember we had a lengthy discussion of exactly what JSR is doing, but don't remember any of the details), so the value popped has to be incremented. This can't happen until after the high 8 bits are fetched in cycle 5, so we have a dummy read in cycle 6.

Edit: BigEd is a little bit quicker than me. And his post is based on what Visual6502 is doing, so definitely believe him.

Top

BigEd

Post subject: Re: RTS and its inner workings

Posted: Wed May 02, 2018 10:47 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Indeed, we noticed here that the 6502 can manage to do without a temporary register - something the 6800 has - by using the ALU pipeline and the stackpointer itself as a temporary store. That saves on chip size and so reduces the cost. As it happens, Arlet was able to use the same tactic to remove a register in his impressively compact HDL 6502.

Top

kakemoms

Post subject: Re: RTS and its inner workings

Posted: Wed May 02, 2018 7:19 pm

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343

BigEd wrote:

Well, there are several instructions that takes more cycles than the number of separate address bus accesses, so 6 cycles didn't mean that it had to do all this. INY for example takes two cycles, but it only reads the opcode, nothing else happens on the address bus.

As you say, the ALU gets SP already in cycle 2. It is somewhat strange that they did not choose to increase the SP already there. My guess is that they reuse the hardware, and as you said, there is no separate register storage, so it was probably done that way for improved density.
In cycle 3, the SP gets to the address bus. In effect it is using the address bus as a register to circumvent the need for a separate register. Therefore it takes one extra cycle to add 1 to the address bus.

For cycle 6, the stored PC-1 is used as new PC. I don't understand why they need to do this, because apparently the opcode at PC-1 is read (it appears on the databus). And why would JSR be longer if PC was stored instead of PC-1?

Top

BigEd

Post subject: Re: RTS and its inner workings

Posted: Wed May 02, 2018 8:03 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

The 6502 can't operate on SP in cycle 2 because it's only reading it then. The address goes out near the beginning of the cycle, and the data comes back right at the end of the cycle.

The 6502 makes a bus access on every cycle because it only has a RnW control: it doesn't have separate read and write controls. So in a cycle where the memory access appears to be unnecessary, one must ask what internal operation is going on which takes a cycle. Or there could be several things going on. Cycle 3, then, is used to increment SP. The address bus value is an accident of internal bus activity and internal controls.

Cycle 6 is the earliest cycle in which the full 16 bit PC-1 is available. The only place to store that is in the 16 bit PC register, and that's also the only place where a 16 bit increment can be done in a single cycle.

Why does JSR take 6 cycles and store PC-1? That's another story. I think it may have be explored somewhere here already.

Always best to approach these questions as 'why does it work like this' - the designers will have arranged things as best they can, and will have learnt from their 6800 experience.

Top

BigEd

Post subject: Re: RTS and its inner workings

Posted: Thu May 03, 2018 3:47 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Hmm, I think I garbled my story ever so slightly there.

Anyhow, I think perhaps there are two topics here: what the 6502 does, and why, on the one hand, and on the other hand, what might a higher-performance 6502 do?

That second idea has been visited many times of course, and there was even briefly a commercial offering with single-cycle instructions, something the 6502 has never done. Both that offering, and more recent HDL projects which offer single-cycle instructions, use up part of the memory access cycle to do some pre-decoding. That might be a good tradeoff, but it could mean the improvement in cycle count comes at the cost of increased clock period - and that's something to keep a close eye on, because improved clock speeds are the simplest and most uniform way to improve performance, and conversely a reduced clock speed can lose the advantage of some bit of cleverness.

My feeling is that the 6502 ISA is not a great fit for all the techniques we've seen applied to improve IPC, but then nor is the x86 ISA, and that's been pushed very far indeed. The 6502 itself has only a minimal amount of pipelining, but to do better you're up against the many and various addressing modes. I think a great deal would be possible, but would take a lot of expertise and testing, and the sort of person who could do that is fairly likely to look at something a bit more RISCy.

Top

Page 1 of 1

[ 7 posts ]

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 10 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum