Improving the 6502, some ideas

BigEd · Post by **BigEd** » Tue Jun 23, 2009 9:34 pm

ElEctric_EyE wrote:

I would like to see more speed. Even if no new instructions or add-ons. Just RAM/EEprom with the CPU on a single die, and 100+ MHz.

I wondered if this was realistic, and it looks like it might be - xilinx' spartan3 non-volatile fpga seems to have the speed and the on chip resources. An eval board is in the $200 range, which probably includes some basic software. There's a lot of expertise needed though, and the parts are nowhere near as friendly as a 40pin dip. No idea if you can get parts in small quantity either.

Once you look at the capability of the devices though, you might wonder why you're building a 6502 with them! A CPLD seems to me a better fit, but I don't believe you'd get the memory on-chip, and I'm not sure about 100MHz.

kc5tja · Post by **kc5tja** » Tue Jun 23, 2009 9:46 pm

CPLDs are generally MUCH faster than FPGAs, pin loading permitting. A 66MHz CPU programmed into an FPGA can generally be expected to run close to 200MHz on a CPLD made with the same semiconductor process.

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 23, 2009 11:21 pm

Quote:

I was thinking it might be good if LDA toggled between A and B. As in:

lda #1 ; A = 1, B = ?
lda #2 ; A = 2, B = 1
lda #3 ; A = 3, B = 2

The advantage is that you don't need an LDB instruction.

I guess I can see some usefulness, but OTOH, an LDB would be the same as LDx where "x" is any of the other many registers, so it doesn't really add any complexity. "B" just has a different bit pattern in the three or four bits of the op code that indicate what register, probably %0001. (A is %0000, C is %0010, X is %0011, Y is %0100, etc..)

Quote:

An on-chip register stack? See the transputer and Forth Chips

Thanks for the link. Yes, I have definitely been interested in that. I should spend some time there and see if any of the 32-bit ones are currently in production, or explore the ones with VHDL source available. I kept an ad for a VME board from 20 years ago with a Harris RTX-2000, and another for a board with an SC32. As I remember, the first one did 16 Forth MIPS at 12MHz, with a maximum 4-clock interrupt latency and 0-clock RTI. Years before that, there was also a 4MHz MicroSpeed board for the Apple II which ran Forth in hardware and let the Naval labs simulate landing an aircraft on a carrier in real time. The software that came with it fit on a single floppy too. Very impressive performance! I would also like to pursue the idea of a 32-bit 6502 though.

Quote:

Garth: perhaps a new forum for the 65Org32 on here? That way, the various subjects can have their own threads, such as basic architecture, real world interfacing, opcodes, VHDL design, etc.

That's going further than I was thinking. Good idea. I should talk to Mike about that. Even if we don't pull this one off, the subject will occasionally keep coming up.

Quote:

Memory controller: maybe a small controller circuit portion that can be updated when required?

as in programmable logic, or a small board of some sort? It would be nice if we could just use commercially available DRAM modules (2GB anyone?), if they don't force us to use bursts at the clock rates that hobbyists can build for. If they require sub-nanosecond rise time in and out though, we won't be able to use them.

Quote:

PROBLEM: nearly ALL graphics chips shift MSB first, not LSB, and are big-endian, since it yields the simplest possible logical mapping of bits to pixels on the screen. However, the CPU's shift/rotate instructions assume the opposite bit and byte ordering -- thus, using ROR/ROL/LSL/ASR on a word-sized piece of byte-sized bitmap data will result in garbage graphics!!

Although this had crossed my mind earlier, I was hoping to not have to go to this complexity: How about having a few instructions to change byte order. That would accommodate Tony's byte-packing and -unpacking senario too.

Quote:

Even with multitasking, the image stays fixed in RAM.

I'm not sure I'm understanding your whole paragraph (or pair of them) there, but what I had in mind is what my HP-71 hand-held computer does. You can have scores of program files (as long as they're not Forth!) and one of them can even use and then delete another one at a lower memory location and everything gets scooted down in memory so all the available memory is together at the end instead of ending up fragmented, and the program continues running-- just at a different address.

Quote:

Far more important to me are instructions that add (and/or subtract) immediate values to X, Y, and S.

Without this facility, even implementations of Forth spend unnecessary amounts of time tweaking the registers instead of executing the user's program.

If I weren't advocating an all-32-bit design, I'd want double-increment and double-decrement instructions, since the '02 and '816 do a lot of INX INX and DEX DEX in Forth and other high-level languages. Sometimes further complicating it is the situation if you need to branch on a page-crossing condition, since you either have to make sure you'll start and end on even addresses or test after both INC's (or DEX's). A DINX followed by a single branch-on-rollover instruction would streamline things.

ElEctric_EyE · Post by **ElEctric_EyE** » Wed Jun 24, 2009 12:48 am

BigEd wrote:

ElEctric_EyE wrote:

I would like to see more speed. Even if no new instructions or add-ons. Just RAM/EEprom with the CPU on a single die, and 100+ MHz.

I wondered if this was realistic, and it looks like it might be - xilinx' spartan3 non-volatile fpga seems to have the speed and the on chip resources. An eval board is in the $200 range, which probably includes some basic software. ...

I know, once you get started on a project, one is always tempted to keep on developing instead of stopping at the achieved goals and releasing a product...

I like 6502 assembly, and the tools available for free to develop the software and program the EEPROM. If someone could develop a type of 65C02 "PIC" with different onboard RAM/EEPROM spec's I would pay at least $50 USD for a device that:

1: was a basic 65C02 device capable of addressing 16MB+, maybe even 1/2GIG of ondie RAM.
2: had EEPROM ondie space of 32K, maybe more or less...
3: major speed, 100+MHz to work basic (8 bit add, sub, asl, lsr) math functions on that large memory space, math that could be done on all registers, and also just a few more registers, besides a,x, and y would be nice...

man that would be a beefy 6502: I would call it the 65C002. mmmm, nice, I'm going to go to sleep now and dream it may be made by someone...

I might pay more, much more actually if this device existed.

kc5tja · Post by **kc5tja** » Wed Jun 24, 2009 7:00 am

GARTHWILSON wrote:

Quote:

Even with multitasking, the image stays fixed in RAM.

I'm not sure I'm understanding your whole paragraph (or pair of them) there, but what I had in mind is what my HP-71 hand-held computer does. You can have scores of program files (as long as they're not Forth!) and one of them can even use and then delete another one at a lower memory location and everything gets scooted down in memory so all the available memory is together at the end instead of ending up fragmented, and the program continues running-- just at a different address.

How does the calculator know that the code will "just run?" The CPU must use 100% PC-relative addressing modes for all effective addresses into code-space.

The 6502 lacks this feature completely, and the 65816 has lackluster support at best. How do you achieve that capability with the 6502? You cheat.

You write this in a text file FOO.A:

Code: Select all

LDA $BEEF

Assembler spits out:

Code: Select all

0000 AD EF BE

Sounds reasonable. You hard-code the address, and that's that. If the code gets relocated, no big deal, the address remains constant.

You write this in a text file BAR.A:

Code: Select all

LDA myVariable

Assembler now spits out:

Code: Select all

0000 AD 00 00

How does the assembler know where myVariable exists in RAM? It doesn't, and it never will. The loader (that which loads the code into RAM, and is fully aware that it's loading code and not just raw data) is the ultimate decider of where myVariable sits, for IT is responsible for allocating memory to hold the binary images.

But this raises a new problem: how does the loader know where to fix up addresses?

The assembler spits out an object file format that a linker and/or loader interprets (for this reason, loaders are often called interpreters too). This file is usually expressed as a recursive descent tree of objects, for that is the easiest way to parse these kinds of data. For example, here might be a high-level representation of such a file, expressed in a pseudo-LISP notation for ease:

Code: Select all

(object-file "BAR.O"
  (cpu 6502)
  (chunk "code"
    (org 0)
    (bytes 0xAD 0x00 0x00)
  )
  (xref "myVariable")
  (chunk "code-fixups"
    (absolute-address-at :section "code" :at 1 :to "myVariable")
  )
)

Maybe, in raw binary, it might look like this (again, hypothetical example):

Code: Select all

0000: 00 05 42 41  52 2E 4F
      __ __ \_____________/
       |  |        |
       |  |        +------------- file name of the module (BAR.O)
       |  +---------------------- length of the filename
       +------------------------- This byte stands for "module filename follows"

0007: 01 01 02
      __ __ __
       |  |  |
       |  |  +------------------- Identifies the intended target as a 6502
       |  +---------------------- Only one byte for this section.
       +------------------------- This byte stands for "CPU tag follows."

000A: 02 04 63 6F 64 65
      __ __ \_________/
       |  |      |
       |  |      +--------------- Identifies the section name as "code"
       |  +---------------------- There are four letters in "code"
       +------------------------- We're beginning a new section.

0010: 03 02 00 00
      __ __ _____
       |  |   |
       |  |   +------------------ We place the "virtual" origin at $0000
       |  +---------------------- Two bytes for the origin
       +------------------------- We're specifying an origin for this section.

0014: 04 03 AD 00 00
      __ __ ________
       |  |     |
       |  |     +---------------- The actual data to be filled into the section when loaded.
       |  +---------------------- There are three bytes to load for this section.
       +------------------------- Raw data follows.

0019: 05 0A "myVariable"
      __ __ ____________
       |  |      |
       |  |      +--------------- The name of the symbol we MUST have defined
       |  |                       (courtesy of the loader, which means we must have
       |  |                       loaded another module first!) to load this module.
       |  +---------------------- The symbol name has 10 characters in it.
       +------------------------- Reference a symbol.

0025: 02 0B "code-fixups"
      __ __ _____________
       |  |       |
       |  |       +--------------- section name
       |  +----------------------- which has 11 characters in it
       +-------------------------- A new section starts here!

0033: 06 04 "code" 01 00 0A "myVariable"
      __ __ ______ _____ __ ____________
       |  |    |     |    |      |
       |  |    |     |    |      +-- the symbol we want to reference ...
       |  |    |     |    +--------- which has 10 characters in its name.
       |  |    |     +-------------- The code at offset $0001 from the ...
       |  |    +-------------------- "code" section, which
       |  +------------------------- has 4 characters in its name.
       +---------------------------- Tell the loader to apply an absolute address fixup.

I think you get the idea now.

When you think things through, maintaining this kind of metadata about the code makes a lot of sense, because it allows your binary module loader to (a) determine how much memory to allocate for each module it's loading, (b) apply effective address computations at load-time, which in turn enables (c) assemblers, compilers and the like to not have to worry about explicitly making PC-relative code. Hence, you needn't even use BSR or its ilk (which incurs a runtime performance overhead anyway) if you don't want to.

fachat · Post by **fachat** » Wed Jun 24, 2009 7:51 am

About code relocation:

kc5tja wrote:

GARTHWILSON wrote:

The 6502 lacks this feature completely, and the 65816 has lackluster support at best. How do you achieve that capability with the 6502? You cheat.

...

How does the assembler know where myVariable exists in RAM? It doesn't, and it never will. The loader (that which loads the code into RAM, and is fully aware that it's loading code and not just raw data) is the ultimate decider of where myVariable sits, for IT is responsible for allocating memory to hold the binary images.

But this raises a new problem: how does the loader know where to fix up addresses?

You that this has already been done for the 6502 with a relocatable file format?
See http://www.6502.org/users/andre/o65/index.html

Quote:

When you think things through, maintaining this kind of metadata about the code makes a lot of sense, because it allows your binary module loader to (a) determine how much memory to allocate for each module it's loading, (b) apply effective address computations at load-time, which in turn enables (c) assemblers, compilers and the like to not have to worry about explicitly making PC-relative code. Hence, you needn't even use BSR or its ilk (which incurs a runtime performance overhead anyway) if you don't want to.

The problem with relocating running code (at least in the 6502 where code is not relative) is not that the relocation tables would have to be stored - but address values sneak into variables (fixed values like jump tables are in the relocation tables too) or into the stack. And it is basically impossible to determine which part of the stack is an address and what is normal data.

So if you don't have any support from the program itself (like going into a state where no address values are stored on stack or in other variables) you can only relocate at load time.

André

GARTHWILSON · Post by **GARTHWILSON** » Wed Jun 24, 2009 8:54 am

Quote:

How does the calculator know that the code will "just run?" The CPU must use 100% PC-relative addressing modes for all effective addresses into code-space.

It doesn't move Forth stuff around since that does use absolute addressing, but for the BASIC (which is beyond superb, with a filthy-rich user-extendable instruction set, local environments, recursion, ability to have many program files open at once, etc.) and for assembly-language stuff, deleting a file in RAM which results in scooting the things past it around is done by referring to routines in the OS which are at fixed addresses in ROM, and they update all the pointers for open program files, variables, software buffers, etc.; so when that job is done and control is given back to the calling program, it goes to the right place. It's incredible-- very seamless and bug-free in that respect, but probably also has more overhead than I would usually want for my uses of the proposed processor. I have the assembler and thousands of pages of documentation on the computer, but have not made the effort to learn the strange Saturn processor with its 12-, 20-, and 64-bit registers, 20-bit multiplexed address bus and 4-bit data bus.

Thanks for the explanation on the loader. I figured André's related system had to be doing something like that. [Edit: as he just mentioned in his last post]

My interest in moving already-loaded programs around comes from the times I've loaded up different applications on the workbench computer, then, after being done with one of the earlier ones, wanting to delete it and make room for something else; but that would leave memory fragmented, without enough contiguous room to load the next thing. In that case, I've had to also dump and re-load what was after the part I was finished with.

Quote:

The problem with relocating running code (at least in the 6502 where code is not relative) is not that the relocation tables would have to be stored - but address values sneak into variables (fixed values like jump tables are in the relocation tables too) or into the stack. And it is basically impossible to determine which part of the stack is an address and what is normal data.

I had forgotten about the stack; but as long as you don't use preëmptive multitasking and relocate the program and its data space while it has references to outside programs (including the fixed OS) on the stack, it seems like it should be ok to have the OS do the move and adjust the values to put in the program and data bank registers for the particular program (those registers now being 32-bit and able to hold any value, not being limited to 64K boundary lines). Coöperative multitasking, if multitasking is required, should avoid the problem by allowing the switching of tasks only when it's safe. Interrupts would have to have some pointers that the OS would fix, requiring temporarily disabling interrupts for the move.

This is new thinking for me, so I won't be one bit offended if you point out other things I'm leaving out. I'm sure there are many of them. At risk of looking ignorant, I'll say that the term "garbage collection" sounds related. How does that work (without getting too complicated)?

Nightmaretony · Post by **Nightmaretony** » Wed Jun 24, 2009 2:22 pm

Although this had crossed my mind earlier, I was hoping to not have to go to this complexity: How about having a few instructions to change byte order. That would accommodate Tony's byte-packing and -unpacking senario too.

At my present job, I work with VME boards. They use hardware byte swap chips. The implementation in VHDL is not that hard. Older logic, you are looking at 245s and 374s, with an intermediary bus. Definitely feasible.

Onboard stack is limiting as heck, plus taking up logic gate resources. The 6502 stack concept is helluva valid. An onboard would be more equivalent to a metric (biblical beast of burden)load of internal registers.

Also Garth, are you able to add in code projects on here? I emailed Mike with my table lookup routine from the pinball program but have not had any reply. Wanted to add to the code knowledge base

kc5tja · Post by **kc5tja** » Wed Jun 24, 2009 3:01 pm

GARTHWILSON wrote:

very seamless and bug-free in that respect

This would only work as long as your program doesn't store program-referenced variables. If it did, then you can only relocate at load-time.

Quote:

My interest in moving already-loaded programs around comes from the times I've loaded up different applications on the workbench computer, then, after being done with one of the earlier ones, wanting to delete it and make room for something else; but that would leave memory fragmented, without enough contiguous room to load the next thing. In that case, I've had to also dump and re-load what was after the part I was finished with.

There are basically several ways of working around this problem:

(1) Make your program event-driven, and mandate that your software handle EVENT_BEGIN_RELOCATE and EVENT_END_RELOCATE events, both issued by the kernel of your OS. The idea being EVENT_BEGIN_RELOCATE gives your code the chance to serialize your program's state into a form suitable for later deserialization during EVENT_END_RELOCATE. So, your control flow is the OS realizes it needs to collapse memory. It issues EVENT_BEGIN_RELOCATE to an application, which then uses this time to prepare for the move. The OS then collapses memory by relocating the program's code and data segments. EVENT_END_RELOCATE is then sent to the newly relocated program, which restores its previous state, thus giving it a chance to fix up addresses.

(2) Mandate your program NEVER uses absolute addresses for internal memory references. This is the method that PC/GEOS took, and it not only worked extremely well, but it also enabled them to utilize a "vm file" (virtual memory file) system where it could provide raw application state dumps as data files. The result was a user experience wherein loading and saving application state was effectively instantaneous, and this on a 4.77MHz CPU! To access a chunk of memory, you'd first have to explicitly "lock" it into physical memory. While locked, the chunk never moved, giving the application(s) the opportunity to touch the memory as needed. After you're done using it, you "unlock" the memory, giving the system the chance to move (or even dump back to disk) the memory.

(3) Use a memory management unit to page memory and establish multiple, unique address spaces for each program. This is what nearly all modern OSes do today. All programs literally run at address 0. Only shared libraries are loaded at non-zero addresses, and even then, nobody can guarantee where in the address space they appear, even amongst different address spaces! The kernel is responsible for managing memory in finite units called "pages" (typically 4K each), and updating hardware memory mapping configurations. This is the most difficult to implement, but by far the fastest and most convenient approach from an application programmer's point of view. No PC-relative code needed (except for "floating" shared libraries, as discussed above), and no need to lock memory manually, and no need for handling relocation events.

(4) Use a bytecode interpreter. This is the slowest possible method, and by what you described, is what your HP calculators are actually doing, especially with BASIC. Bytecodes can be implemented however you want, including mandating that all code be virtual-PC-relative.

(5) Make all code and data PC-relative. This is, for example, how Win32Forth works.

Quote:

The problem with relocating running code (at least in the 6502 where code is not relative) is not that the relocation tables would have to be stored - but address values sneak into variables (fixed values like jump tables are in the relocation tables too) or into the stack. And it is basically impossible to determine which part of the stack is an address and what is normal data.

This is a solved problem. The solution is to never use actual addresses. Instead, you store handles to data.

A handle, in its simplest possible incarnation, is merely a pointer to a pointer. The next most complex is an index into a table of pointers. It allows you play all sorts of neat tricks with memory management without adversely affecting runtime performance (your inner-most loops will take the performance hit, but as discussed above, "locking" your used memory returns your performance back. Just remember to unlock your memory after you're done).

The "handle table" is a data structure known, at least in part, to both your application and your memory manager. When a chunk of memory is relocated, the memory manager is responsible for updating its corresponding handle pointer. This frees the application to refer to chunks of memory via the handle, which remains fixed throughout the entire lifetime of the program.

Quote:

I had forgotten about the stack; but as long as you don't use preëmptive multitasking and relocate the program and its data space while it has references to outside programs

This is also a solved problem. I invite you to research the programming language Oberon. Each of the activation frames on the stack has a corresponding "type descriptor" associated with it, which is required for its garbage collector to properly free unused memory. It can guarantee a piece of memory is free if, and only if, it has no outstanding references to it. Checking variables and CPU registers is easy, but checking the chain of activation frames is difficult, for nothing on the stack distinguishes data from pointers! The frame type descriptors permits the garbage collector to traverse the stack with confidence, clearly disambiguating pointers from non-pointers.

However, you can only get that with a strongly-typed language. Languages like Forth simply cannot handle this. Languages like Scheme, Smalltalk, and Lisp get around this problem either by using tagging, which I won't discuss here, or by simply not using activation frames as they're commonly understood.

Quote:

I'll say that the term "garbage collection" sounds related. How does that work (without getting too complicated)?

Suppose you have a program which allocates a chunk of memory, uses it, then frees it, like so:

Code: Select all

void foo() {
  char *myMemory = malloc(16);  /* allocate 16 bytes of RAM */
  UseItHere(myMemory);  /* Do something with it */
  free(myMemory);  /* OK, we're done with it, so free it. */
}

This allocates some memory, uses it, then releases it so that it may be released for other programs to use. If only all code were this simple, we'd never have any problems.

Unfortunately, even without the specter of multitasking, more complex applications prove difficult to know when to actually release memory. For example, what actually happens in UseItHere() will determine whether or not the free(myMemory); call is actually a bug.

If UseItHere() adds that newly allocated piece of memory to a data store of some kind, then calling free() on it is a bug, because the memory might still be in use elsewhere in the program. (Let's ignore the fact that it actually is in use by the data store itself.)

So, the truly safest way to handle this is to rewrite the code as follows:

Code: Select all

void foo() {
  char *myMemory = malloc(16);  /* allocate 16 bytes of RAM */
  UseItHere(myMemory);  /* Do something with it */
}

Notice that we allocate and use the memory, but we never actually free it. Again, what happens in UseItHere() determines whether or not this is a bug -- what if UseItHere has decided to not register the memory on some data store? OOPS -- now we've wasted some memory.

This is called a "memory leak", and I'm positive you've heard that term before.

So, what is the solution? There are only two known solutions to this problem:

(1) Apply formal logic, problem solvers, and static program analysis to determine which chunks of memory get allocated, and when they're to be deallocated. This is the approach that functional languages like MLton use to help reduce garbage collection overhead. ML alone takes 22 code passes to perform all of its optimizations. MLton adds 13 more. Thankfully, ML compiles so fast that this hit in compile performance isn't that noticable. Still, imagine trying to maintain a compiler with a 35 passes, and compare that to your assembler or C compiler, which typically only has 2.

(2) Rely on garbage collection.

Garbage collection works because a very simple observation: you can't use a chunk of memory if you don't have a pointer to it. Ignoring the case of synthesizing pointers (which is a type and safety violation, by the way), once you lose the last pointer to a region of memory, that's it -- it's gone forever!! You can't restore it, unless you have a pointer cached away somewhere accessible (like the data store UseItHere() might be using).

So, a garbage collector's job is to examine two things: the program state, and the memory state, and based on this mutual information, dispose of memory no longer referenced. It does this, for all intents and purposes, concurrently with the running of your program.

Notice that BASIC has a DIM statement to allocate memory for arrays. Notice it also can construct arbitrarily large strings. However, it lacks completely any means of disposing allocated memory (some might argue the CLEAR command performs this job, but in reality, it's purpose is more complex. Executing CLEAR inside a running program often results in catastrophe unless you know EXACTLY what you're doing). This is because BASIC is a garbage collected environment.

Notice how languages like Java or even Oberon allow you to create new objects, but have no syntactic means of getting rid of them. This is because they, too, utilize garbage collection.

The end results with garbage collected languages are:

(1) ever so slightly slower runtime performance. This is because the underlying CPU simply lacks hardware assistance for implementing "read barriers" and "write barriers", and so these things need to be implemented in software. Most of these barriers can actually be optimized away by advanced compilers, thankfully, which is why performance hits are "slight" and not "god-awful" like they were in the 60s and 70s.

(2) NEVER do you get an accidental memory leak which crashes the system due to following a stray pointer. It's simply impossible, and provable mathematically.

You do get a different kind of resource leak with garbage collected languages though -- as long as you have a pointer to a chunk of memory, it'll never be discarded. Garbage collectors are not intention detectors, so when you are truly finished using a piece of data, the proper thing to do is assign its pointer to NULL (typically the value of zero, but it's best to use your language's symbolic representation for this; for Pascal/Modula-*/Oberon/Ada, use "nil" or "NIL" instead). That way, you eliminate that reference (without affecting others), giving the collector enough information to do its job.

Statistically speaking, however, comparing the memory leaks arising from not using garbage collection against the resource leaks from garbage collected systems, it is clear to see which is the lesser of the two evils. A brilliant example is the recent release of Internet Explorer, which will gladly crash itself after many hours of continual use due to leaked memory. Contrast this with Firefox, which will not leak memory nearly as fast unless you're using plugins (which Firefox can't control), such as the infamous Flash 9 or Flash 10 plugin under Linux.

Some languages, like Smalltalk or Lisp, won't even let you manipulate pointers of any kind, so it's impossible to casually leak resources in those environments. (It's doable, but you actually have to explicitly code for it!)

This raises a good point, too -- if you're planning on writing a program so that it's extensible with plugins, you can do all the static analysis and even dynamic analysis in the world, and you still won't have a reliable system. All because of those plugins, you might end up with an intractible memory leak that ultimately brings your system down.

This is why I advocate writing code in some dialect of Oberon instead of C: Oberon compiles to native code, just like C does, but has garbage collection. If you write extensible programs, then the extensions, too, are written in Oberon or an Oberon-compatible language. The end result is you get garbage collection and type-safety throughout the system. And, it compiles a hell of a lot faster too -- easily a factor of 15 times faster on my box, and for comparable levels of optimization. For Enterprise-scale programs, that speed boost is quite significant.

I haven't learned Ada yet, but I understand Ada exhibits most, if not all, of the properties that make Oberon so appealing to me. I'll need to take that language up someday.

VBR · Post by **VBR** » Wed Jun 24, 2009 4:14 pm

It's possible to use garbage collection with C, although it's a hack. Basically the garbage collector looks for anything that might be a pointer to allocated memory, although it has no way of knowing that it actually is a pointer, so you get some false positives. More dangerously, if a pointer is somehow disguised, or located somewhere the collector doesn't look, or placed at a different memory alignment than the collector expects, then the collector may free memory still in use. Still, some people swear they work.

kc5tja · Post by **kc5tja** » Wed Jun 24, 2009 4:54 pm

I definitely swear that they work as long as you write code like a normal human being. That is, no XOR-tricks to store two pointers in a single memory location, no bit-shifted addresses (I'm talking to you, Metacomco! What the HELL were you thinking when you decided to use BCPL pointers instead of normal CPU pointers for AmigaDOS data structures?!), etc.

However, they don't work as well as a fully confident garbage collector. Especially when the compiler has all the data it needs to fill out a few forms to allow the GC to do its job in a tag-free manner. C is a great language, but Oberon is better, if only because of GC (syntax is irrelevant to me for the most part, though others are utterly repulsed by its more easily readable syntax. That's always confused the heck out of me, but I digress). Thankfully, there's Java, which gives most of the advantages of Oberon with a syntax just C-like enough that people are overwhelmingly happy with it, despite the language's many flaws.

fachat · Post by **fachat** » Wed Jun 24, 2009 5:21 pm

kc5tja wrote:

There are basically several ways of working around this problem:

Quite some good explanations. I wonder which of them
are actually applicable to a 65xx system....

Quote:

(3) Use a memory management unit to page memory and establish multiple, unique address spaces for each program. This is what nearly all modern OSes do today. All programs literally run at address 0. Only shared libraries are loaded at non-zero addresses, and even then, nobody can guarantee where in the address space they appear, even amongst different address spaces! The kernel is responsible for managing memory in finite units called "pages" (typically 4K each), and updating hardware memory mapping configurations. This is the most difficult to implement, but by far the fastest and most convenient approach from an application programmer's point of view. No PC-relative code needed (except for "floating" shared libraries, as discussed above), and no need to lock memory manually, and no need for handling relocation events.

This is actually where my GeckOS started, as I designed my 6502 with an MMU to extend the 16 address bits to 20, using 4k pages. Only when people asked how to run it on non-MMU systems, I invented the o65 relocatable file format to overcome the limitation that there is no MMU.

Quote:

The problem with relocating running code (at least in the 6502 where code is not relative) is not that the relocation tables would have to be stored - but address values sneak into variables (fixed values like jump tables are in the relocation tables too) or into the stack. And it is basically impossible to determine which part of the stack is an address and what is normal data.

This is a solved problem. The solution is to never use actual addresses. Instead, you store handles to data.

Yes, but you need some kind of cooperation of the program, you can't just relocate the stack of a 6502. With some high level languages, this might be possible, though.

Relocating variables can still be done with a normal relocator, as long as it knows where that variable is. To access it, of course, as you write, the variable memory must be "locked" - be it with cooperative multitasking or specific lock/unlock calls around that critical section.

Quote:

I had forgotten about the stack; but as long as you don't use preëmptive multitasking and relocate the program and its data space while it has references to outside programs

This is also a solved problem. I invite you to research the programming language Oberon.

That just doesn't work on 6502 hardware level...

Quote:

Each of the activation frames on the stack has a corresponding "type descriptor" associated with it, which is required for its garbage collector to properly free unused memory. It can guarantee a piece of memory is

And even if we use type descriptors, the stack frame on a 6502 can have very different sizes, depending on the execution flow - as the 6502 stack is often used as intermediary storage in conditional expressions, as there are so few registers.

Quote:

However, you can only get that with a strongly-typed language.

Exactly.

Quote:

(2) Rely on garbage collection.

...

The end results with garbage collected languages are:

(1) ever so slightly slower runtime performance. This is because the underlying CPU simply lacks hardware assistance for implementing "read barriers" and "write barriers", and so these things need to be implemented in software. Most of these barriers can actually be optimized away by advanced compilers, thankfully, which is why performance hits are "slight" and not "god-awful" like they were in the 60s and 70s.

Note that Java has done a lot of great advances in garbage collection. The performance of garbage collectors has improved greatly over the last years, and there are even version IIRC that do not stop the main thread(s) at all.

read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)

The Java memory model and the java.util.concurrent package provide greatly improved performance in synchronizing threads using these constructs (synchronized blocks, which are kind of thread-global, and volatile variables, which work in a local context - i.e. might only flush the single page with the variable, not all dirty pages)

Quote:

You do get a different kind of resource leak with garbage collected languages though -- as long as you have a pointer to a chunk of memory, it'll never be discarded. Garbage collectors are not intention detectors, so when you are truly finished using a piece of data, the proper thing to do is assign its pointer to NULL (typically the value of zero, but it's best to use your language's symbolic representation for this; for Pascal/Modula-*/Oberon/Ada, use "nil" or "NIL" instead). That way, you eliminate that reference (without affecting others), giving the collector enough information to do its job.

Setting the variable to null in Java is only necessary for static attributes (which live as long as the program runs) or attributes of long-living objects - local variables disappear at the end of the method anyway.

But yes, such memory leaks are a constant source of trouble. I need to remember your quote about the intention detector :-)

Quote:

This is why I advocate writing code in some dialect of Oberon instead of C: Oberon compiles to native code, just like C does, but has garbage collection. If you write extensible programs, then the extensions, too, are written in Oberon or an Oberon-compatible language. The end result is you get garbage collection and type-safety throughout the system. And, it compiles a hell of a lot faster too -- easily a factor of 15 times faster on my box, and for comparable levels of optimization. For Enterprise-scale programs, that speed boost is quite significant.

With Java you get type safety as well - and the Java just-in-time compiler (into native code) has become much better in the last years. In fact, after some time running an application, it might even recompile it with different optimization settings. Java speed has for a lot of cases become as good as C or C++.

OTOH you can write bad / slow code in any language. A compiler also is not an intention detector!

André

P.S.: I'm a Java architect for a living, so I might be biased....

BigEd · Post by **BigEd** » Wed Jun 24, 2009 6:05 pm

kc5tja wrote:

CPLDs are generally MUCH faster than FPGAs, pin loading permitting. A 66MHz CPU programmed into an FPGA can generally be expected to run close to 200MHz on a CPLD made with the same semiconductor process.

Interesting. But I note that Xilinx offer a free 16-bit CPU (picoBlaze) which does 42MHz in the CPLD version, versus 74MHz-200MHz on FPGA. See their PDF

(Personally, I'd prefer a 6502 or close relative!)

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jun 24, 2009 10:59 pm

All of this discussion is well and good, especially ideas on memory management. But, how does it "improve" the 65xx? What I'm reading sounds like a woman getting a boob job thinking that it will somehow "improve" her.

GARTHWILSON · Post by **GARTHWILSON** » Thu Jun 25, 2009 6:17 am

(regarding HP-71 BASIC):

Quote:

This would only work as long as your program doesn't store program-referenced variables. If it did, then you can only relocate at load-time.

I'm not sure exactly what you mean by "program-referenced," but there are global and local variables and arrays, and it all stays intact. Additionally:

Quote:

Notice that BASIC has a DIM statement to allocate memory for arrays. Notice it also can construct arbitrarily large strings. However, it lacks completely any means of disposing allocated memory (some might argue the CLEAR command performs this job, but in reality, it's purpose is more complex. Executing CLEAR inside a running program often results in catastrophe unless you know EXACTLY what you're doing). This is because BASIC is a garbage collected environment.

HP-71 BASIC does have a DESTROY instruction which allows destroying variables and arrays and freeing up the memory they used. For example, you can

Code: Select all

DIM A, B           ! Create real variables, and
COMPLEX C(8192)    ! a complex floating-point array of 8192 elements,
DIM D$[96], E$[10] ! then a couple of string variables.
<do stuff>
DESTROY C          ! Get rid of C and free up 128K of RAM.
<do more stuff, still using A, B, D$, and E$>

Quote:

(4) Use a bytecode interpreter. This is the slowest possible method, and by what you described, is what your HP calculators are actually doing, especially with BASIC. Bytecodes can be implemented however you want, including mandating that all code be virtual-PC-relative.

I'm sure that's what it's doing. I wouldn't call the 71 a calculator though. Although it's far more powerful than the 41, it's not as practical as a calculator.

Quote:

There are basically several ways of working around this problem:

Thankyou for the explanations. I probably wasn't able to read them as fast as you typed them (Samuel types nearly 100wpm and seems to be able to talk about something else at the same time!), but they were definitely helpful. I don't understand them well enough to be able to tell someone else yet, but someday...

Quote:

Quite some good explanations. I wonder which of them
are actually applicable to a 65xx system....

perhaps none yet, but if we come up with a suitably capable one...

Quote:

That just doesn't work on 6502 hardware level...

How 'bout on a 65Org32 hardware level?

Quote:

All of this discussion is well and good, especially ideas on memory management. But, how does it "improve" the 65xx?

VBR started the topic to discuss improvements to the hardware, especially the processor's instruction set. I took it further into extending bus and register widths to get more power while actually dropping some addressing modes. After apologizing for kind of hijacking the topic, I got no complaint from VBR but others showed interest. If this would, down the road, lead to developing a more powerful 65-family processor, it can only be good to discuss things we might want it to be capable of while still making it clearly a 65-family processor.

Quote:

This is a solved problem. The solution is to never use actual addresses. Instead, you store handles to data.

A handle, in its simplest possible incarnation, is merely a pointer to a pointer. The next most complex is an index into a table of pointers. It allows you play all sorts of neat tricks with memory management without adversely affecting runtime performance...

I believe the Apple IIgs ProDOS used that. Is that correct?

Quote:

you can't just relocate the stack of a 6502

How about just changing the stack pointer, S. Even on the 6502, the stack space can be divided into chunks of say 64 bytes, enough that four different tasks can have plenty of stack space so they won't overrun it and start overwriting a portion reserved for another task. When you switch tasks, you store the current stack pointer in a variable, and load S from the variable for the next task. Of course the '816 allows far more with its 16-bit stack pointer. We could also make the new processor capable of natively handling more stacks.