I never had the Leventhal book back then btw.
6502 redundant, missed, and suggested features
Re: 6502 redundant, missed, and suggested features
GARTHWILSON wrote:
I just checked what my 1985 Synertek book says about the NMOS '02 though, and it was still not updated to reflect the bug that was found. It was still saying, "The next memory location contains the high-order byte of the effective address which is loaded into the sixteen bits of the program counter." As we know however, it won't be the next one if the address record starts at xxFF.
I never had the Leventhal book back then btw.
Re: 6502 redundant, missed, and suggested features
Quote:
Never let an indirect address cross a page boundary, as in JMP ($31FF). Although the high-order byte of the indirect address is in the first location of the next page (in this example, memory location 3200₁₆), the CPU will fetch the high-order byte from the first location of the same page (location 3100₁₆ in our example).
Re: 6502 redundant, missed, and suggested features
Even then, determining which uP it is should be fairly straightforward, and having two suitable routines, or at least an error condition would be the way to go.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: 6502 redundant, missed, and suggested features
Bregalad wrote:
Except the reason I encourage people not using JMP () is not because of the bug, but because using pha/pha/rts is shorter (and in some cases, faster). I would still use JMP () if for some reason this would be the optimal solution, but in all my cases where I had to use jump tables, rts happened to be a more efficient solution.
In the rare cases where using JMP () was actually a possibility, the vector has *always* sat in zero page for me, making this bug irrelevant.
In the rare cases where using JMP () was actually a possibility, the vector has *always* sat in zero page for me, making this bug irrelevant.
- BigDumbDinosaur
- Posts: 9427
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 6502 redundant, missed, and suggested features
BigEd wrote:
Never let an indirect address cross a page boundary, as in JMP ($31FF). Although the high-order byte of the indirect address is in the first location of the next page (in this example, memory location 3200₁₆), the CPU will fetch the high-order byte from the first location of the same page (location 3100₁₆ in our example).
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: 6502 redundant, missed, and suggested features
Thanks for checking BDD. If you'd be so kind as to have a quick look at the PDF I linked and confirm that it's an earlier (shorter) edition than the one you have, that would be great.
- BigDumbDinosaur
- Posts: 9427
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 6502 redundant, missed, and suggested features
BigEd wrote:
Thanks for checking BDD. If you'd be so kind as to have a quick look at the PDF I linked and confirm that it's an earlier (shorter) edition than the one you have, that would be great.
x86? We ain't got no x86. We don't NEED no stinking x86!
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 6502 redundant, missed, and suggested features
I just checked Leventhal & Saville's book "6502 Assembly-Language Subroutines" (© 1982) and its wording on p.151 is different, but still covers the problem:
Quote:
IMPLEMENTATION ERRORS
Occasionally, a microprocessor's instructions simply do not work the way the designers or anyone else would expect. The 6052 has one implementation error that is, fortunately, quite rare. The instruction JMP ($XXFF) where the Xs represent any page number, does not work correctly. One would expect this instruction to obtain the destination address from memory locations XXFF and (XX+1)00. Instead, it apparently does not increment the more significant byte of the indirect address; it therefore obtains the destination address from memory locations XXFF and XX00. For example, JMP ($1CFF) will jump to the address stored in memory locations 1CFF₁₆ (LSB) and1C00₁₆ (MSB), surely a curious outcome. Most assemblers expect the programmer to ensure that no indirect jumps ever obtain their destination addresses across page boundaries.
Occasionally, a microprocessor's instructions simply do not work the way the designers or anyone else would expect. The 6052 has one implementation error that is, fortunately, quite rare. The instruction JMP ($XXFF) where the Xs represent any page number, does not work correctly. One would expect this instruction to obtain the destination address from memory locations XXFF and (XX+1)00. Instead, it apparently does not increment the more significant byte of the indirect address; it therefore obtains the destination address from memory locations XXFF and XX00. For example, JMP ($1CFF) will jump to the address stored in memory locations 1CFF₁₆ (LSB) and1C00₁₆ (MSB), surely a curious outcome. Most assemblers expect the programmer to ensure that no indirect jumps ever obtain their destination addresses across page boundaries.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- BigDumbDinosaur
- Posts: 9427
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 6502 redundant, missed, and suggested features
GARTHWILSON wrote:
I just checked Leventhal & Saville's book "6502 Assembly-Language Subroutines" (© 1982) and its wording on p.151 is different, but still covers the problem...
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: 6502 redundant, missed, and suggested features
BigDumbDinosaur wrote:
BigEd wrote:
Thanks for checking BDD. If you'd be so kind as to have a quick look at the PDF I linked and confirm that it's an earlier (shorter) edition than the one you have, that would be great.
- BigDumbDinosaur
- Posts: 9427
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 6502 redundant, missed, and suggested features
BigEd wrote:
BigDumbDinosaur wrote:
BigEd wrote:
Thanks for checking BDD. If you'd be so kind as to have a quick look at the PDF I linked and confirm that it's an earlier (shorter) edition than the one you have, that would be great.
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: 6502 redundant, missed, and suggested features
Z-80s can take 3, 4, or 5 clock cycles per M-cycle, depending on the instruction executed, and even then, can take several M-cycles too.
The reason the average is between 3 and 5 is because the chip is microcoded underneath. That is, the Z-80 and 68000 CPUs actually have a simpler, yet functionally independent CPU (of sorts) whose job it is to interpret the programmer-visible instruction set. This is how, for example, the 68020 could get away with such utterly ridiculously complicated addressing modes.
In the Z-80's case, the microcode is used to multiplex data over finite numbers of internal buses, all of which were laid out by hand back then. It's also used to enforce the bus protocol as well:
1. Put address on the bus.
2. (While waiting for data to arrive, increment the PC.)
3w. Sample the RDY or similar signal, and if negated, wait here.
3. When RDY is asserted, accept the data and terminate the bus transaction.
E.g., if RDY is asserted during step 1 above, it means nothing.
The genius of the 6502 is that its bus was truly single phase (phi1 and phi2 are conveniences to simplify the NMOS implementation; it's not at all a requirement for the CMOS process, which is why the 65816 doesn't have them, and you drive phi2 externally). You drive the address bus and R/W lines during phase-1, and capture the data during phase-2. If the data wasn't ready, that's OK -- just repeat the current cycle. The 65816 is only marginally more sophisticated than this, due to multiplexing the bank-address byte on D7-D0.
As far as proof of 6502's performance relative to other CPUs with a normalized clock, all you need to do is count the cycles in your particular application. Yes, yes, the 68000 can pull off 2 MIPS performance at 8MHz, but know what? A 65816 at 8MHz will pull off 4 MIPS on average. At this point, the 65816 and 68000 compete head to head, with the former on average attaining close to 80% the performance of the 68000, despite having only an 8-bit wide bus. Proof: Sit an Apple IIgs on a table next to a classic Macintosh. Run a paint program on both of them. (Remember, Mac is monochrome, while IIgs is 16-color. While the Mac has more pixels, the IIgs actually has more bits to push around total). You'd totally expect the IIgs at 2.3MHz to be slower at video updates than the 8MHz Mac; however, this is not the case. Grab a brush from some picture and drag it around the screen. The Mac will have very noticeable rip and tear, while the IIgs will appear to be about as fast as a Commodore-Amiga using its blitter to draw.
As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.
If we take this opportunity to really widen the data bus, then a single instruction fetch can grab a whole handful of instructions. This is quite useful thanks to something called macro-op fusion. If you augment the instruction decode logic to perform "macro-op fusion", your instruction timings now will look like this:
The CPU is now looking not just at individual instructions to determine what to do, but the context surrounding them. clc, lda, adc is a single "macro-op" instruction. sta, lda occurs entirely too frequently to miss this one too. adc, sta occurs less frequently, but it's strongly desirable for what I hope are obvious reasons.
According to http://oldwww.nvg.ntnu.no/amiga/MC680x0 ... ndard.HTML , a single ADD.L instruction takes 12 cycles. The code above fetches, adds, and stores a 32-bit quantity to memory, and assuming alignment with respect to the 64-bit bus, actually runs 3 cycles faster. Again, this is a hypothetical case, and don't expect to see this technology become widespread in the 65xx-community soon. All I'm saying is that it's relatively easily doable if you truly compare apples to apples.
The reason the average is between 3 and 5 is because the chip is microcoded underneath. That is, the Z-80 and 68000 CPUs actually have a simpler, yet functionally independent CPU (of sorts) whose job it is to interpret the programmer-visible instruction set. This is how, for example, the 68020 could get away with such utterly ridiculously complicated addressing modes.
In the Z-80's case, the microcode is used to multiplex data over finite numbers of internal buses, all of which were laid out by hand back then. It's also used to enforce the bus protocol as well:
1. Put address on the bus.
2. (While waiting for data to arrive, increment the PC.)
3w. Sample the RDY or similar signal, and if negated, wait here.
3. When RDY is asserted, accept the data and terminate the bus transaction.
E.g., if RDY is asserted during step 1 above, it means nothing.
The genius of the 6502 is that its bus was truly single phase (phi1 and phi2 are conveniences to simplify the NMOS implementation; it's not at all a requirement for the CMOS process, which is why the 65816 doesn't have them, and you drive phi2 externally). You drive the address bus and R/W lines during phase-1, and capture the data during phase-2. If the data wasn't ready, that's OK -- just repeat the current cycle. The 65816 is only marginally more sophisticated than this, due to multiplexing the bank-address byte on D7-D0.
As far as proof of 6502's performance relative to other CPUs with a normalized clock, all you need to do is count the cycles in your particular application. Yes, yes, the 68000 can pull off 2 MIPS performance at 8MHz, but know what? A 65816 at 8MHz will pull off 4 MIPS on average. At this point, the 65816 and 68000 compete head to head, with the former on average attaining close to 80% the performance of the 68000, despite having only an 8-bit wide bus. Proof: Sit an Apple IIgs on a table next to a classic Macintosh. Run a paint program on both of them. (Remember, Mac is monochrome, while IIgs is 16-color. While the Mac has more pixels, the IIgs actually has more bits to push around total). You'd totally expect the IIgs at 2.3MHz to be slower at video updates than the 8MHz Mac; however, this is not the case. Grab a brush from some picture and drag it around the screen. The Mac will have very noticeable rip and tear, while the IIgs will appear to be about as fast as a Commodore-Amiga using its blitter to draw.
As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.
If we take this opportunity to really widen the data bus, then a single instruction fetch can grab a whole handful of instructions. This is quite useful thanks to something called macro-op fusion. If you augment the instruction decode logic to perform "macro-op fusion", your instruction timings now will look like this:
Code: Select all
; Assuming native-mode, 16-bit accumulator
; For core that implements macro-op fusion, I further assume a 64-bit bus.
;
; CODE AS-IS Macro-Op Fusion
CLC ; 2 1 [1, 3]
LDA addend1L ; 5 1 [2, 5]
ADC addend2L ; 5 1 [2, 5]
STA resultL ; 5 2 [3, 4, 5]
LDA addend1H ; 5 1 [2, 5]
ADC addend2H ; 5 2 [2, 3, 5]
STA resultH ; 5 1 [4, 5]
; TOTAL 32 cycles 9 cycles (best case)
Notes:
1 Out of context, CLC normally would take the usual 2 cycles;
but, since it's recognized as part of a more complex code pattern,
it's behavior can be rolled into the mechanisations of the surrounding
code.
2 This instruction takes 2 cycles to fetch a 16-bit word from memory.
3 There is an additional cycle overhead for instruction fetch on this byte.
4 This instruction takes 2 cycles to store a 16-bit word to memory.
5 Add 1 cycle if 16-bit operand crosses an 8-byte boundary.
According to http://oldwww.nvg.ntnu.no/amiga/MC680x0 ... ndard.HTML , a single ADD.L instruction takes 12 cycles. The code above fetches, adds, and stores a 32-bit quantity to memory, and assuming alignment with respect to the 64-bit bus, actually runs 3 cycles faster. Again, this is a hypothetical case, and don't expect to see this technology become widespread in the 65xx-community soon. All I'm saying is that it's relatively easily doable if you truly compare apples to apples.
Re: 6502 redundant, missed, and suggested features
[placeholder - with luck, we can discuss points raised by Sam's post over here.)]
Re: 6502 redundant, missed, and suggested features
kc5tja wrote:
As far as proof of 6502's performance relative to other CPUs with a normalized clock, all you need to do is count the cycles in your particular application. Yes, yes, the 68000 can pull off 2 MIPS performance at 8MHz, but know what? A 65816 at 8MHz will pull off 4 MIPS on average. At this point, the 65816 and 68000 compete head to head, with the former on average attaining close to 80% the performance of the 68000, despite having only an 8-bit wide bus. Proof: Sit an Apple IIgs on a table next to a classic Macintosh. Run a paint program on both of them. (Remember, Mac is monochrome, while IIgs is 16-color. While the Mac has more pixels, the IIgs actually has more bits to push around total). You'd totally expect the IIgs at 2.3MHz to be slower at video updates than the 8MHz Mac; however, this is not the case. Grab a brush from some picture and drag it around the screen. The Mac will have very noticeable rip and tear, while the IIgs will appear to be about as fast as a Commodore-Amiga using its blitter to draw.
As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.
As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.
You can also use even slower memory with the 68000, memory bound instructions will slow down, but instructions that are not, like a series of muls, will still run full speed.
Re: 6502 redundant, missed, and suggested features
I agree - for historical performance comparisons, you have to normalise memory speeds. Memory was a major cost in a system as well as often being the performance limiter. It was a long time in the micro world before caches were seen - first off chip and then on chip.
The Z80 tactics, of clocking the CPU faster than memory, backfired here, because as memory sped up, the CPU would need to go uncomfortably fast to avoid being the bottleneck. Modern Z80s don't have that 4:1 ratio. Meantime, the 6502 tactics meant it was possible to run video or DMA in the half-cycles that the CPU didn't need the memory, because the memory was faster than the CPU. Later, we could have 4MHz 6502 in a system which didn't need to make the memory do double-duty. (Acorn's Master Turbo had a second 6502 at 4MHz in 1986.)
Edit: of course, different considerations apply today. You can buy WDC parts which run at 14MHz and then you have to find appropriate memory and peripherals. If you use an FPGA for your CPU you can run at 100MHz and more, you have on-chip RAM which can probably run at full speed but it's synchronous. To use off chip RAM you're back in the territory of choosing between simple SRAM which runs slower than your CPU, or something which runs faster but is best used for short sequential transfers.
The Z80 tactics, of clocking the CPU faster than memory, backfired here, because as memory sped up, the CPU would need to go uncomfortably fast to avoid being the bottleneck. Modern Z80s don't have that 4:1 ratio. Meantime, the 6502 tactics meant it was possible to run video or DMA in the half-cycles that the CPU didn't need the memory, because the memory was faster than the CPU. Later, we could have 4MHz 6502 in a system which didn't need to make the memory do double-duty. (Acorn's Master Turbo had a second 6502 at 4MHz in 1986.)
Edit: of course, different considerations apply today. You can buy WDC parts which run at 14MHz and then you have to find appropriate memory and peripherals. If you use an FPGA for your CPU you can run at 100MHz and more, you have on-chip RAM which can probably run at full speed but it's synchronous. To use off chip RAM you're back in the territory of choosing between simple SRAM which runs slower than your CPU, or something which runs faster but is best used for short sequential transfers.