6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 2:23 pm

All times are UTC




Post new topic Reply to topic  [ 137 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next
Author Message
 Post subject: Re: M65C02A Core
PostPosted: Sat Aug 01, 2015 9:29 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I recently had the need to create a project using the T65 6502 core. To ensure that I had a recent version of the core which passed Klaus' 6502 functional test suite. (Note: I did not allow the functional tests to execute the BCD and Binary arithmetic. Instead, I artificially looped the functional test back to itself after all other tests had been completed.) I was curious to see if the M65C02A core would demonstrate its memory cycle efficiency when running Klaus' test program. It appeared when benchmarking the M65C02A core with figForth 1.0a, that the M65C02A was roughly 40% faster.

The test bench runs the functional test simultaneously using both the T65 core and the M65C02A core. For the tests performed, the M65C02A requires 74687 cycles to execute 50484 instructions, and the T65 requires 119638 cycles to execute the same number of instructions. Thus, for the M65C02A, the CPI is 1.4794, and for the T65, the CPI is 2.3698. The CPI % difference is that the M65C02A is approximately 37.57% faster than the T65 at the same clock rate on this application. This result is similar to the performance advantage noted on the figForth 1.0a Sieve of Eratosthenes benchmark. (Note: the T65 core is a cycle accurate model of the 6502 microprocessor, while the M65C02A is not, and the M65C02A does not implement any of the undocumented 6502 opcodes sometimes used by games and copy protection schemes.) These additional results add additional credence to the improved performance claimed for the 65CE02 as a result of the elimination of dead/redundant memory cycles.
Attachment:
File comment: Comparison of Execution Time of T65 and M65C02A cores on Klaus Dormann's 6502 functional test program.
M65C02A_T65_FunctionalTestExecutionTime.JPG
M65C02A_T65_FunctionalTestExecutionTime.JPG [ 209.16 KiB | Viewed 9996 times ]

Edit: [MichaelM] The original post listed the 65SC02 as the processor variant with reduced number of cycles per instruction. Dr. Jefyll pointed out the error, and I've corrected the reference above and provided a link to the post in the Forth topic to which the above text intended to refer.

_________________
Michael A.


Last edited by MichaelM on Sun Nov 15, 2015 5:35 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Sun Aug 02, 2015 9:28 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Work continues to verify all of the implemented instructions for the M65C02A core.

Below is an instruction sequence which demonstrates the Kernel/User mode of the M65C02A core. Bit 5 of the PSW has been designated as the Mode bit (M). By default the mode bit is set after reset, all interrupts, and traps/exceptions. The kernel mode stack is used to push the user mode PC and PSW. Only an RTI instruction may modify the mode bit, and only after the PC has been pulled from the kernel mode stack.

The first marker indicates the start of the code sequence used to push the address of the user mode routine and PSW. The middle marker designates the point in the execution stream where the transition is made from the kernel mode to the user mode. Leading up the transition, it can be seen that the stack pointer S (Sk) starts at 0x01EF and decrements as the three bytes are pushed onto the kernel mode stack. When the RTI instruction is executed, the values just pushed are pulled and Sk increments for each pull. After the last pull, when SYNC is asserted at the middle marker, the core is operating in the user mode, and the change in the stack pointer can be noted by a change in the value shown on the screen: from 0x01EF to 0x01FF. The first instruction in the user mode is a call (JSR) to a single instruction (RTS) user mode subroutine. After returning from the subroutine, the user mode stack pointer (Su) is transferred to X and compared with 0xFF. Since the test bench stops at an even address, the comparison passed.
Attachment:
File comment: M65C02A Core Mode Switch from Kernel Mode to User Mode
M65C02A_ModeSwitch_To_UserMode.JPG
M65C02A_ModeSwitch_To_UserMode.JPG [ 211.99 KiB | Viewed 9965 times ]

The following is a hand assembly of the test routine described above.
Code:
F25A: F460F2                phw #ModeSwitch
F25D: A9C7                  lda #$C7            ; NVIZC set, user mode selected
F25F: 48                    pha                 ; PSW for user mode routine
F260: 40        ModeSwitch: rti                 ; switch mode to user mode
F261: 2003F0                jsr $F003           ; call user mode subroutine
F264: BA                    tsx                 ; read user mode stack pointer
F265: E0FF                  cpx #$FF            ; Su should be $FF after return
F267: F001                  beq *+2             ; skip if equal, else stop
F269: DB                    stp
;
F26A:                       stp
If the X register needed to be preserved, then the following code sequence can be used to test the stack pointer directly:
Code:
F264: 8BE0FF                cps #$FF            ; Su should be $FF after return

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Mon Aug 03, 2015 12:12 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
The following diagram shows the execution of a BRK instruction. The key take away is that the operating mode changes after the PC and PSW have been pushed onto the kernel mode stack: the M bit is set (PSW[5]), the D flag is cleared, and the I flag is set. The IRQ/BRK routine executed simply saves A and checks that P has D clear and I set, and then returns. After the RTI, the mode bit is clear, which means that the processor has returned to the user mode; also note that S is set back to 0x01FF as it was before the BRK instruction.
Attachment:
File comment: M65C02A Mode Switch from User to Kernel Mode Using BRK.
M65C02A_ModeSwitch_To_KernelMode.JPG
M65C02A_ModeSwitch_To_KernelMode.JPG [ 208.33 KiB | Viewed 9945 times ]

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Sun Jan 03, 2016 1:27 am 
Offline

Joined: Mon Oct 12, 2015 5:19 pm
Posts: 255
Quote:
BigEd wrote:
These days, we'd consider an instruction cache or small instruction prefetch buffer - which would be ideal if indeed the CPU is in a tight polling loop during the DMA.

I've been thinking about using a 16-bit wide data bus to RAM/ROM, multiplexing it down to 8 bits for the CPU, and then having instruction fetches run through a two-byte single-line cache. With this, a little over half of your instruction-fetch cycles no longer need to hit the memory bus, leaving them free for DMA or whatever. And for those crazy self-modifying-code people, you could have it snoop memory write cycles.

Doing the same thing with a write-through data cache is probably overkill.


Found my way into this thread, in a roundabout fashion. Curious you wrote this on my birthday, but I had not joined your forum yet! I'll take the gift, anyway.

Anyhoo. I am desirous of solutions like this. I was shopping around on digikey, seeing what exists, and I, in my naivite bordering on cluelessness, thought something like this (or, maybe I TOTALLY misunderstand you! Very likely, so!). I saw a listing for a "dual port SRAM" (I think?) and thought that could be useful (I must make a mental note to read the pdf). Could my I/O device (a VGA image sensor), pass 8 bits, repeatedly, to one port, until chip is half-full, and still leave the processor the other half (the other port) without any processor halt (as DMA would require)? Or doesn't it work like that?

I guess that's why its called HARD ware!


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Sun Jan 03, 2016 2:24 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
randallmeyer2000:

randallmeyer2000 wrote:
Anyhoo. I am desirous of solutions like this. I was shopping around on digikey, seeing what exists, and I, in my naivite bordering on cluelessness, thought something like this (or, maybe I TOTALLY misunderstand you! Very likely, so!). I saw a listing for a "dual port SRAM" (I think?) and thought that could be useful (I must make a mental note to read the pdf). Could my I/O device (a VGA image sensor), pass 8 bits, repeatedly, to one port, until chip is half-full, and still leave the processor the other half (the other port) without any processor halt (as DMA would require)? Or doesn't it work like that?

Quick answer is yes. One thing to keep in mind when using dual-port SRAM is that presenting the same address simultaneously to both ports raises a potential issue: if one port is writing and the other reading, the data being written may not be correctly written and the data being read may not return the previous value or the new value being written. This behavior is related to the design of the dual-port mechanism in the IC itself.

Therefore, a carefully constructed mechanism for swapping access to each section of the dual-port SRAM will make your life easier. Other than that, dual-port SRAMs are a great tool for the problem you are attempting to solve. Good luck on your project.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Sun Jan 03, 2016 4:05 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(Probably best to continue the dual-port SRAM conversation in a new thread, or this one:
viewtopic.php?p=42914#p42914
)


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Thu Jan 14, 2016 6:14 pm 
Offline

Joined: Mon Oct 12, 2015 5:19 pm
Posts: 255
Thanks MichaelM,

as Ed said, you can hop over to the other/(my?) thread, but you'll find more questions, there, than answers. This is a big world (digital world/IC world), and I know very little about the things I am getting into!

But a short response might be warranted here (despite being OT to the M6502A).

I was reading the wiki for FIFO structures https://en.wikipedia.org/wiki/FIFO_%28computing_and_electronics%29#Electronics and saw that they mention special methods for using pointers and flags to indicate "full" and "half full" and such.

I think --and definitely not knowing--that with a constantly writing thing--like an image sensor-- that a simple counter (i.e. counter IC) of clock cycles (pclk, I think my image sensor calls the signal; but vsync or hsync signals could be used too) could count to a certain number (approx half full) and then indicate time to switch. Thus, every SRAM could have an A an B section, and the microproc. never talks to the side that the image sensor "talks" to?

Might be a "fancy" address decoder, though? I haven't though it all through, just yet!


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Thu Jan 14, 2016 6:17 pm 
Offline

Joined: Mon Oct 12, 2015 5:19 pm
Posts: 255
Can I call it a new invention? "Psuedo-Bank Switching"? hahahah! Or maybe acronym it? Psuedo-Bank Switching with Dual Port SRAM? PBSDPSRAM? hahaha. The mind of a child (i.e. me, little old naive me!) is an amazing thing! (again, sorry OT; I'll refrain in the future.).


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Fri Jan 15, 2016 3:11 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
randallmeyer2000 wrote:
I think --and definitely not knowing--that with a constantly writing thing--like an image sensor-- that a simple counter (i.e. counter IC) of clock cycles (pclk, I think my image sensor calls the signal; but vsync or hsync signals could be used too) could count to a certain number (approx half full) and then indicate time to switch. Thus, every SRAM could have an A an B section, and the microproc. never talks to the side that the image sensor "talks" to?

This is an appropriate way to implement the DPRAM interface you asked about earlier. The most significant address (and its complement) into your DPRAM, or some other arbitrary boundary, can be used to ensure that the microprocessor and the image sensor are not accessing the same memory cell simultaneously.

You may want to consider using a FIFO, which are generally a bit more expensive than DPRAMs, since it appears that you want to stream the data from the image sensor to the microprocessor. Although generally more expensive (depending on the size/depth) than DPRAMs, I find them easier to use in applications such as yours. One way to implement a FIFO is to use discrete counters and flip-flops for the read and write pointers and the empty, half full, and full flags, and to use a DPRAM as the storage element.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Mon Jan 18, 2016 12:55 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
I noticed the CPI is mentioned as 1.4x. But I thought a micro-coded machine needed multiple clock cycles to execute a single machine cycle. Or does the micro-code machine work in parallel with the rest of the logic ?

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Mon Jan 18, 2016 1:27 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I think it is a matter of choice. The architecture for the microprogram control structure that I've chosen to implement for these cores is tied to the nature of the 6502 memory cycle. In addition, I've included multiple functional units and pipelining of the microprogram and the ALU to maximize single cycle performance. If you're interested, I've provided a more comprehensive description of how the core's behave below.

With the first core, I chose to use multiple clock cycles per microcycle to improve the speed by using pipeline registers to break the combinational signal paths into shorter delays. With the second core, I chose to optimize the address path and ALU paths through logic optimization first, and to postpone the optimization of the combinational path delays with pipeline registers until the project is complete.

In both implementations, the microprogram structure was tied to the 6502 memory cycle. In other words, the objective was that the microprogram would only require as many memory cycles as any instruction would need to execute. This means that my cores do not have any dead memory cycles as is typical of the 6502/65C02 microprocessors and other cycle accurate implementations. (Caveat: the extended instructions of the M65C02A include two instructions with dead memory cycles.)

For single byte instructions, there are two categories: (1) read-only, and (2) write-only. Instructions in the first category do not write the memory, and instructions in the second category write to memory. Instructions in the first category execute in a single memory cycle, which consist solely of the instruction fetch cycle. Instructions in the second category execute in two memory cycles: (1) instruction fetch, and (2) memory write.

To implement this behavior I've combined the fetch cycle with the instruction decode cycle. The microprogram is implemented using the synchronous Block RAM of the FPGA. Thus, the microprogram is pipelined by one cycle. The ALU registers are also synchronous, so that the ALU is also pipelined. Altogether, these cores demonstrate a two-cycle pipeline.

There are two critical signal paths in these cores: (1) address output; and (2) ALU data output. The address that the core's address generator computes is always the address of the next memory location. To put a write cycle immediately following a read cycle requires that the address and the ALU data output be compute/processed by independent units. In other words, it is not possible to provide the performance without increasing the number of adders used in the core; sharing of the ALU with the address generator is not allowed.

The delays in these two paths establish the basic performance limit of either core, but especially for the M65C02A core. To maintain compatibility with the 6502/65C02 memory cycle is difficult because the delays in these two signal paths is nearly equal to half of the clock period. This means that on the rising edge of Phi2, i.e. the falling edge of the core's clock, the address and control signals must be valid. Including an MMU, albeit a simple mapper, in the address output signal path increased the path delay and lowered the maximum operating frequency as expected. Adding the additional stack register to support user and kernel modes also marginally increased the address output path delays, but I traded off that against the potential provided by the two operating modes. In a similar fashion, the additional registers that I've included by implementing each standard register as a 3-deep push-down stack increased the ALU data output signal path delay.

To implement single cycle behavior with the on-chip Block RAMs, I've resorted to using both edges of the clock. In particular, the program/data BRAMs are clocked on the falling edge of the clock. This ensures that Xilinx PAR will track the address path delays so that the BRAMs can return valid instructions and data on the following rising edge. Similarly, the register write enable signals and the BCD adder in the ALU output data path also operate on the falling edge of the clock.

I've accepted these limitations to the achievable performance of the M65C02A core because I'm more interested in seeing how complex a core I can develop using the flexibility offered by microprogramming the control functions, and keeping my reference implementation within the resource limitations of my target FPGA: Xilinx Spartan 3A XC3S200A-4VQ100I. Performance naturally improves if the cores are targeted to the newer Spartan 6 and Virtex 6/7 FPGAs, but those architectures are not currently my targets. I periodically check the core against the other FPGA architectures and am encouraged that the core ports and performance improves by simply re-synthesizing the source and placing and routing the resulting netlist on the newer FPGA architecture.

I've begun some work on how to improve the overall performance of the M65C02A core using some Simultaneous Multi-Threading (SMT) techniques. I've been able to reformulate the address generator to support 16 (or more) simultaneous threads with as few as 3 or 4 pipeline stages. The resulting improvement in operating speed is very encouraging. The present implementation of the M65C02A core is restricted to 24 MHz (in the target FPGA defined above), and it's primarily limited by the combinational delays in the address output signal path. Using SMT techniques, the clock rate of the address generator has been raised from 24 MHz to more than 150 MHz. (Actually, the rates are higher, and I'm scaling back the value based on past experience.)

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Tue Jan 19, 2016 10:23 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
I've been experimenting with error correction recently, and I was wondering how difficult it would be to adapt the core to 10 bit bytes ?
(10 bits + 5 bits for error correction fits well into 16 bit word).

Quote:
To maintain compatibility with the 6502/65C02 memory cycle is difficult

How hard would it be to use another bus interface (eg. WISHBONE) ? WISHBONE requires ack pulses rather than a ready signal which kills the single cycle nature of the 6502 bus. But WISHBONE is freely available and there are a number of cores that use. it.

I've been wanting to experiment with a micro-coded machine for a while, and it looks like the M65C02A core might be a place to start.
Do you have a tool to convert text to micro-code instructions ?

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Tue Jan 19, 2016 12:49 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Rob Finch wrote:
I've been experimenting with error correction recently, and I was wondering how difficult it would be to adapt the core to 10 bit bytes ?
(10 bits + 5 bits for error correction fits well into 16 bit word).

Interesting. I've been thinking/planning along the same lines. There should be no restrictions to including error detection and correction logic with the core's functions. Although the core is designed for 8/16 bit operations, the memory interface is implemented using the standard 8-bit byte. Extending that interface to 10 or more bits should not be an issue.
Rob Finch wrote:
How hard would it be to use another bus interface (eg. WISHBONE) ? WISHBONE requires ack pulses rather than a ready signal which kills the single cycle nature of the 6502 bus. But WISHBONE is freely available and there are a number of cores that use. it.

There's no reason that WISHBONE couldn't be used. As you say, the ACK responses destroy the single cycle nature of the 6502 memory cycle, but WISHBONE offers some improvements to the 6502 memory cycle that can be used to advantage. The ACK response can be mapped to an internal ready signal, and that should enable an easy port to the WISHBONE bus.
Rob Finch wrote:
Do you have a tool to convert text to micro-code instructions ?
I have a simple (Windows) tool available on GitHUB that can be used for converting microcode text files into Xilinx memory initialization files. Alternatively, it can also produce VHDL compatible ROM structures that you can drop into your code. (It can produce Verilog compatible ROM structures, but I've never focused on that because my teams use VHDL. I use/prefer Verilog and I definitely prefer to use the Verilog memory initialization methodology. Thus, the Verilog output file can be used in most designs with a few edits, most of which can be automated with editor macros.)

I am still, unfortunately, working on the M65C02A core. It's not presently on GitHUB. A prior, incomplete release, which may be suitable for your initial foray into microprogramming can be found on GitHUB. It will differ in a few minor areas, but it will fully implement the W65C02S instruction set, and it has most of the M65C02A prefix instructions implemented. Not present are the single/block MOV instruction, the accumulator/memory exchange instruction, the FORTH VM, the IP relative with auto-increment instructions, the Kernel/User modes, and the co-processor interface.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Tue Jan 19, 2016 1:48 pm 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
One thing I noticed in the MPC was I think that the MPC program counter only need the three least significant bits to increment. All the micro-code with the exception of the interrupt / break code which is a little longer is done in four or less lines. If the code were aligned on four line boundaries then the program counter doesn't need all the bits to increment. (I'm looking at the M65C02 code). This would have the micro-code rom more sparsely populated.

For the 651020 core with a 10 bit opcode there'd be a 1024 way branch, requiring a 2048 entry micro-code table. Plus the code would have to be 36 bits for the extra addressing. (4 block RAMs).

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: M65C02A Core
PostPosted: Wed Jan 20, 2016 2:28 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I think that your observation about the length of the control sequences may be used to optimize the microprogram counter. Although possible, I would not recommend it. I think that you've demonstrated a propensity to include some complex functions/operations in your processors that will likely require more that 6-8 cycles to complete. Thus, I think that you'll soon be increasing the width of the uPC counter because you've decided to implement a 65816 with 16 task registers with a cache and TLB, and need to automatically push all of this state information onto one of the stacks. :D

There were a number of things that I did with the M65C02 core that were not necessarily great things to do. First, I decided not to use any microsubroutines.Why? Just because I wanted to see if the addressing modes could be implemented as short linear sequences like those I expect the PLA-based controller of the original used. Second, I decided not to use any bits from the instruction decode ROM to select the index register usage in the address generator. Third, I decided that I would not use more than 2 BRAMs for my microprogram, and that they would be split into a instruction decode ROM and a sequence control ROM.

One recent change that I've made is to add the multi-way inputs to the uPC instead of substituting them into the least significant bits. This has allowed me to place jump tables on any boundary, and greatly improved the microprogram structure.

_________________
Michael A.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 137 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: