Storing strings in memory

APL · Post by **APL** » Sun Jan 24, 2016 12:11 am

I can see two ways to store an ASCII string in memory.

1. The string to be displayed-one termination byte-
eg:- 5465737400

2. A byte at the beginning of the string location describing the length of the string
eg:- 0474736554

In the first instance you need to test each byte for the end of string terminator '00', which is not the case in the second scenario, where it might be considered a disadvantage the string need be reversed in memory. For long strings or heavy string processing, I would expect the second to be quicker - although the downside may be the string is limited in length.

Which method is recommended.

Secondly on my project, the VIA seems to run quite warm, not hot, I can lay my finger on it without too much discomfort. Does this sound OK?

GARTHWILSON · Post by **GARTHWILSON** » Sun Jan 24, 2016 12:23 am

I've used both ways, and can go either way, although I might slightly favor counted strings like Forth uses, rather than null-terminated. "Counted" means the first byte tells the number of data bytes in the string. If you had any reason to have a null byte in the string before the end, you can do it. Also, if you want to concatenate strings, or truncate a string, you don't have to go looking for the end, so it can be more efficient in some cases.

If the VIA is CMOS, ie, 65c22, there should be no discernible heating. I don't remember if the NMOS one produce discernible heating, but NMOS has other disadvantages anyway besides just taking more power, so I'd recommend using CMOS if possible.

barrym95838 · Post by **barrym95838** » Sun Jan 24, 2016 3:54 am

A third technique would take advantage of the 7-bit nature of plain-vanilla ASCII, and use the high-bit as an end-of-string marker:

PQRS --> 50 51 52 D3

The problem I'm having with my experimental 32-bit hobby processor design is that it can't address individual 8-bit bytes, only 32-bit words. UTF-32 is a perfect fit for it, and is a nice specification, but it isn't super popular yet, and tends to carry a lot of bit-baggage for many simple usage cases. Eight-bit bytes have been around for a long time, and they are an integral part of some popular encodings, so my processor is going to have to work a bit harder than most to accommodate them. It's either that, or clutter up its native instruction set with a bunch of byte manipulators, and I am still trying very hard to avoid that.

An example is a translation of a DTC Forth based on Dr. Brad's work. Lots and lots of Forth words can be implemented in a single one-word 65m32 machine instruction (plus a one-word NEXT) ... branch + 1+ 1- 2* 2/ >R @ AND DROP DUP EXIT INVERT NEGATE OR R> SWAP UNLOOP XOR [ ] NIP RDROP 2RDROP ... I'm sure that I'm forgetting a few others. The Forth dictionary overhead is much larger than the actual code doing the work, especially if I don't pack the names. Packing and unpacking names is inefficient, so I'm thinking that my initial implementation will just use UTF-32 and "waste" about 75% of the available bits, in the interest of simplicity and expediency, even though it rubs the size-optimizing side of my personality the wrong way.

YMMV

Mike B.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Jan 24, 2016 5:27 am

APL wrote:

1. The string to be displayed-one termination byte-
eg:- 5465737400

2. A byte at the beginning of the string location describing the length of the string
eg:- 0474736554

Use of a null terminator is very common, thanks to the technique's proliferation in C. With the 65xx family, no special test is needed to detect the end of the string, as the acting of load $00 into a register will automatically set the Z flag in the status register.

If the string is to be prepended with a length then you have to use that length as a down-counter while processing the string, which usually means using both .X and .Y, the former to act as the counter and the latter to act as the index. With the zero terminator, only one index register need be used, as the index.

The one distinct advantage of prepending a string with its length is that the null byte has no special significance and hence may be embedded in a string without consequence. On the other hand, the programmer has to make a decision as to whether to use an eight bit or 16 bit length. BASIC implementations on the 65xx family generally use an eight bit length. Business BASIC implementations, such as BBX and Thoroughbred, use a 16 bit length and thus can handle much longer strings. Pick yer poison!

commodorejohn · Post by **commodorejohn** » Tue Jan 26, 2016 8:24 pm

BigDumbDinosaur wrote:

The one distinct advantage of prepending a string with its length is that the null byte has no special significance and hence may be embedded in a string without consequence.

There's one other one as well, which is that the code won't get confused with a malformed string (i.e. one that has no terminator) and run off into the wilds reading or even trashing other parts of memory. It's this that leads some people to label Pascal (length-prepended) strings as "safer."

That said, I still prefer null-terminated for most purposes. It's always nice to get a critical operation for free, which any CPU which sets the flags on a register load will give you with null-terminated strings.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 27, 2016 3:12 am

commodorejohn wrote:

BigDumbDinosaur wrote:

The one distinct advantage of prepending a string with its length is that the null byte has no special significance and hence may be embedded in a string without consequence.

There's one other one as well, which is that the code won't get confused with a malformed string (i.e. one that has no terminator) and run off into the wilds reading or even trashing other parts of memory. It's this that leads some people to label Pascal (length-prepended) strings as "safer."

Except, of course, it is possible for the string length byte/word to be stepped on, creating a different set of problems.

My 65C816 string processing library works with null-terminated strings. I internally set the maximum string length (minus terminator) to 32,767 and abort processing if the (source) string length exceeds that—it's an easy check on the '816. There is also a check to determine if the catenation of two strings will exceed the 32KB length limit. It's not perfect, but it provides some protection.

Quote:

That said, I still prefer null-terminated for most purposes. It's always nice to get a critical operation for free, which any CPU which sets the flags on a register load will give you with null-terminated strings.

A lot of language behavior has been based on how the hardware on which the language was originally developed behaved. In the case of C, which came to life on the DEC PDP-11, Ritchie was most likely taking advantage of the fact that the MOV instruction would set the Z flag in the condition code register if a null byte or word was copied into the target register, just as TXA would do the same thing in the 6502 if .X was loaded with $00.

sark02 · Post by **sark02** » Wed Jan 27, 2016 6:43 am

BigDumbDinosaur wrote:

If the string is to be prepended with a length then you have to use that length as a down-counter while processing the string, which usually means using both .X and .Y, the former to act as the counter and the latter to act as the index. With the zero terminator, only one index register need be used, as the index.

It's unconventional, but if you store strings backwards then the length acts as a starting offset and then you count down to zero. This can work well for fixed string tables, but less so for dynamic strings.

For example, given a string pointed to by ($b0, $b1), the following will emit the string to a UART (via a 'putchar' function):

Code: Select all

puts:  ldy #0
       lda ($b0),y
       beq exit
       tay
loop:  lda ($b0),y
       jsr putchar
       dey
       bne loop
exit:  rts

where ($b0, $b1) -> .byte 24, "!gnirts a si siht ,olleH"

And you can imagine how straightforward a string compare routine might be implemented. Other operations, such as string concatenation or building dynamic strings (e.g. reading a line of input from a terminal) similarly work backwards and pointers to (address of) strings are used rather than the known start of fixed buffer addresses as it's the end of the string in the buffer than is anchored (and not useful).

Use of this technique will fill your swear jar quickly.

GARTHWILSON · Post by **GARTHWILSON** » Wed Jan 27, 2016 6:59 am

Quote:

A lot of language behavior has been based on how the hardware on which the language was originally developed behaved.

I was just thinking of this in terms of the '816 with its move instructions and counted strings. If you want to append one string to the end of another, you'd use the length byte of the original one to know where the end is,without having to go look for it. That will be the destination of the second string. You'd use the length byte of the second string to set up the number of bytes to move, before doing MVP or MVN. Then there is no INX or CPX instruction in any loop. In fact, there's no loop needed. It would be similar if you wanted to put one string in the middle of another, or at the left end of the first.

In my last big project, I did use null-terminated strings though, because I didn't have to do such string gymnastics. The only string editing was such that the string lengths were unchanged, and beyond that, all I had to do was display them.

Tor · Post by **Tor** » Wed Jan 27, 2016 8:37 am

On a minicomputer I worked with, the Fortran compiler used a character type (strings) which was a two-word descriptor. One word for the address of the string, the other word for the length.

calculi · Post by **calculi** » Wed Jan 27, 2016 3:36 pm

APL wrote:

Secondly on my project, the VIA seems to run quite warm, not hot, I can lay my finger on it without too much discomfort. Does this sound OK?

GARTHWILSON wrote:

.......

If the VIA is CMOS, ie, 65c22, there should be no discernible heating. I don't remember if the NMOS one produce discernible heating, but NMOS has other disadvantages anyway besides just taking more power, so I'd recommend using CMOS if possible.

Hi Garth and APL,

I experienced the same as APL : when I used CMOS Rockwell chips in my machine, the VIA (tried 3 different ones) was warm, the infrared thermometer gave approxmimately 45 °C / 113°F. There was no correlation to what the chip was doing (uninitialized - I/Os working or not - timers running or not) Both other chips (CPU + ACIA) stayed at room temperature. More logically, the NMOS ones all ran warm..

Needless to say, I checked and re-checked my board and connexions : no trouble. Then I switched to WDC chips : they're all at room temperature (running 24/24).

Marc

White Flame · Post by **White Flame** » Wed Jan 27, 2016 7:43 pm

Tor wrote:

On a minicomputer I worked with, the Fortran compiler used a character type (strings) which was a two-word descriptor. One word for the address of the string, the other word for the length.

CBM BASIC, and presumably many other BASICs, do the same. 1 byte for length, 2 bytes for a character pointer. This allows the character array to be stored not just on the heap, but also have the option to point to the raw literal strings in the BASIC source code for statements like A$="FOO".

Each variable has a 5-byte slot to store its value, and if the variable was a string, only 3 are used for this length+pointer structure.

BigEd · Post by **BigEd** » Wed Jan 27, 2016 8:28 pm

BigDumbDinosaur wrote:

A lot of language behavior has been based on how the hardware on which the language was originally developed behaved. In the case of C, which came to life on the DEC PDP-11, Ritchie was most likely taking advantage of the fact that the MOV instruction would set the Z flag in the condition code register if a null byte or word was copied into the target register, just as TXA would do the same thing in the 6502 if .X was loaded with $00.

An interesting document came to light recently(+), from 1971, which is described as version zero of the documentation for Unix. It describes Unix as running on a PDP-11, with reference to the previous and first version which ran on PDP-7 and -9 (both 18 bit machines.) The version described was, I think, written in assembler(*), and includes a B compiler. (B came after BCPL and before C.) The OS call interface already makes use of NUL-terminated strings. I would guess that the same would be true of the previous version, but I suppose it's possible that that isn't so. I don't know whether or not the -7 and -9 come with a handy zero flag.

It does seem that BCPL either used counted strings, or both conventions, depending on which source you consult. Here's Dennis M Ritchie, in Development of the C Language:

Quote:

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

(+) See
http://www.tuhs.org/Archive/PDP-11/Dist ... onZero.txt

(*) By which I mean assembly language, of course! I'm just teasing.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 27, 2016 9:32 pm

BigEd wrote:

An interesting document came to light recently(+), from 1971, which is described as version zero of the documentation for Unix. It describes Unix as running on a PDP-11, with reference to the previous and first version which ran on PDP-7 and -9 (both 18 bit machines.) The version described was, I think, written in assembler(*), and includes a B compiler. (B came after BCPL and before C.) The OS call interface already makes use of NUL-terminated strings. I would guess that the same would be true of the previous version, but I suppose it's possible that that isn't so. I don't know whether or not the -7 and -9 come with a handy zero flag.

I seem to vaguely recall that the PDP-11 assembly language was an enhanced version of the PDP-7 assembly language, which means the use of the null terminator would have been a "natural" idiom on the PDP-7 as well. I knew just enough PDP-11 assembly language to be dangerous.

Incidentally, B was written by Ken Thompson in 1970 to run on UNIX on PDP-7 hardware. Thompson took everything out of BCPL that he thought could be eliminated without losing too much functionality, doing this mainly to accommodate the PDP-7s tiny address space. B's performance was lackluster and being typeless (as was BCPL), handling complex data structures was a painful exercise. The development of C addressed these concerns.

commodorejohn · Post by **commodorejohn** » Wed Jan 27, 2016 10:40 pm

BigDumbDinosaur wrote:

I seem to vaguely recall that the PDP-11 assembly language was an enhanced version of the PDP-7 assembly language, which means the use of the null terminator would have been a "natural" idiom on the PDP-7 as well. I knew just enough PDP-11 assembly language to be dangerous.

Nah, the PDP-7 is very much more like an 18-bit PDP-8: single-address, single-accumulator, bare-bones instruction set. (Interesting how so many of the 8-bit microprocessors looked like this, while so many of the 16/32-bitters looked more like the PDP-11.) But it did have "skip if AC is zero/nonzero" instructions, so the general point stands: null-terminated strings = free end-of-string checking.

Storing strings in memory

Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Warm chip (Re: Storing strings in memory)

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory

Re: Storing strings in memory