6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun May 05, 2024 10:11 am

All times are UTC




Post new topic Reply to topic  [ 25 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Fri Feb 20, 2015 12:12 pm 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
Martin_H wrote:
Because the arguments are already on the stack, a function call in Forth is really cheap compared to a language like C which builds a call frame on the stack.

It's not quite as advantageous as all that. In Forth, everything has to be parked on the stack before it can be operated on which creates a speed penalty. In addition, Forth words (especially system words) consume the parameters passed to them on the stack so you find yourself doing a lot of DUPs which is not much different to what happens in C. In addition, many C compilers will pass parameters via the registers rather than the stack which with some optimizations makes for some very fast code (especially with the optimizing compilers we get today).

For all that, I find Forth a lot more fun. :)


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 20, 2015 12:29 pm 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
Martin_H wrote:
I'd love to see your malloc implementation. I usually learn a thing or two when I read someone else's code.

I'm happy to oblige and the code is listed below. Of course, if you haven't read and understood the section on memory management in K&R's ANSI C then it won't mean much to you (except that it works!)

Fortunately, at the time I made lots of notes about and diagrams about this and even made a PDF about it. I am attaching the PDF to this post to help you get a grip on what I am trying to do.

Attachment:
memorymanagementk&r.pdf [117.68 KiB]
Downloaded 183 times


Regarding the code itself, you might notice that apart from MALLOC and FREE themselves, there is not a whole lot of factoring going on. None of the factored words would be useful in their own right and after all, it is sort of translated from C. Fortunately, it doesn't make the stacrobatics any more difficult and doesn't create any long branch problems (in STC forths, branching is limited to +127 bytes or about 40 words maximum).

You might also notice that in spite of the weird and wonderful things you can do when it comes to using CREATE/DOES> to set up data structures, all I did is add offsets to get the address of each element in the structure. (I'm a simple guy).
Code:
: HEAPM ; ( for FORGETting )

512 CONSTANT NALLOC ( minimum size of new block allocated )

( define a "header" structure )
: >next ; IMMEDIATE ( does nothing, first element in structure )
: >size  CELL+ ;
2 CELLS CONSTANT headerSize

( global variables )
CREATE startBase 2 CELLS ALLOT
VARIABLE freep

: mallocClear ( initialize base, freep )
   startBase DUP >next !
   0 startBase >size !
   startBase freep !
;

: allocateBlock ( prevp p nunits -- address )
   ( if this function is called then p is big enough for nunits )
   OVER >size @ OVER = ( prevp p nunits p.size=nunits? )
   IF ( remove from free list )
      DROP 2DUP >next @ ( prevp p prevp p.next )
      SWAP >next ! ( prevp.next = p.next )
   ELSE ( allocate tail end of p )
      OVER >size @ OVER - >R ( prevp p nunits [ p.size-nunits ] )
      SWAP R@ OVER >size ( prevp nunits p newsize &p.size [ newsize ] ) !     
      R> + ( prevp nunits newp )
      SWAP OVER >size ! ( prevp newp )
   THEN
   SWAP freep ! headerSize +
;

: startOrEnd? ( bp p -- flag )
   DUP >next @ ( bp p p.next )
   OVER > ( bp p p.next>p? )
   IF 2DROP 0 EXIT THEN
   2DUP > -ROT ( bp>p? bp p )
   >next @ < ( bp>p? bp<p.next? )
   OR
;

: betweenBlocks? ( bp p -- flag )
  2DUP > -ROT ( bp>p? bp p )
  >next @ < ( bp>p? bp<p.next? )
  AND
;
   
: free ( addr -- )
   headerSize - freep @ ( bp p )
   BEGIN
      2DUP betweenBlocks?
      IF
         -1
      ELSE
         2DUP startOrEnd?
         IF
            -1
         ELSE
            >next @ 0
         THEN
      THEN
   UNTIL
   OVER DUP >size @ + ( bp p bp+bp.size )
   OVER >next @ = ( bp p bp+bp.size=p.next? )
   IF
      OVER >size ( bp p &bp.size )
      OVER >next @ >size @ ( bp p &bp.size p.next.size )
      SWAP +!
      2DUP >next @ >next @ ( bp p bp p.next.next )
      SWAP >next !
   ELSE
      2DUP >next @ ( bp p bp p.next )
      SWAP >next !
   THEN
   2DUP DUP >size @ + = ( bp p bp=p+p.size? )
   IF
      OVER >size @ ( bp p bp.size )
      OVER >size +!
      SWAP >next @ ( p bp.next )
      OVER >next ! ( p )
   ELSE
      SWAP OVER >next ( p bp &p.next ) !
   THEN
   freep !
;

: morecore ( nunits -- address )
   DUP NALLOC >=
   IF
      HERE OVER ALLOT ( nunits addr )
      SWAP OVER >size !
      headerSize + EXIT
   THEN
   HERE NALLOC ALLOT ( nunits addr )
   2DUP >size !
   SWAP 2DUP + ( addr nunits freeptr )
   NALLOC ROT - ( addr freeptr leftover )
   OVER >size ! ( addr freeptr )
   headerSize + free ( addr )
   headerSize +
;
: malloc ( size -- address )
   headerSize + ( nunits )
   freep @ DUP >next @ ROT ( prevp p nunits )
   BEGIN
      OVER >size @ OVER >= ( prevp p nunits p.size>=nunits? )
      IF
        allocateBlock EXIT
      THEN
      OVER freep @ = ( wrapped around? )
      IF
         NIP NIP ( nunits ) morecore ( addr ) EXIT
      THEN                 
      ROT DROP ( p nunits ) SWAP DUP >next @ ROT
   AGAIN
;

mallocClear ( starts the ball rolling )


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 20, 2015 6:42 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 575
@theGSman, thanks for the heap code. I'll take a look at it later.

Good point about the increment cost of DUP being a performance drag. Benchmarking C versus FORTH is an interesting topic because I've seen results that are all over the map. I learned C before I learned FORTH, and I'm generally more comfortable with C. But I've done a few hobby robotics projects in FORTH and the interactive nature of FORTH was helpful, although reusing C code somebody else wrote was usually the fastest way to get the job done.


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 20, 2015 7:53 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
I feel my staunch Forth defender trait coming out. :D

theGSman wrote:
Martin_H wrote:
Because the arguments are already on the stack, a function call in Forth is really cheap compared to a language like C which builds a call frame on the stack.

It's not quite as advantageous as all that. In Forth, everything has to be parked on the stack before it can be operated on which creates a speed penalty.

With good planning (which, with experience, becomes natural), the things that need to be operated on are already on the stack, in the right order, requiring no additional attention, nor requiring constant re-mirroring for one "subroutine" to pass some of its inputs to yet another, which may pass one or more to yet another. There is no additional overhead like C has for that.

Quote:
In addition, Forth words (especially system words) consume the parameters passed to them on the stack so you find yourself doing a lot of DUPs which is not much different to what happens in C.

For ones that do that frequently, like DUP >R, I made them their own word, a primitive (ie, written in assembly, not high-level), in this case DUP>R which is faster than even >R because it doesn't have to do INX INX to delete the TOS cell after copying it to the return stack.

Quote:
In addition, many C compilers will pass parameters via the registers rather than the stack which with some optimizations makes for some very fast code (especially with the optimizing compilers we get today).

The bulk of the 6502's registers are in ZP; but the really good way to run Forth is with a stack processor whose assembly language basically is Forth and the stacks are onboard and each one has its own bus so the external memory buses and the data stack and the return stack can all be accessed in the same cycle. Even 25 years ago, the relatively simple Harris RTX2000 stack processor could routinely run 16 Forth MIPS at 12MHz with some optimizations like the above, and 60 Forth MIPS peak. Its total interrupt latency was four clocks, and return-from-interrupt was zero clocks.

The newest interest of Chuck Moore (the inventor of Forth who turns 77 this year) seems to be http://www.greenarraychips.com/ where he has 144 stack processors on a single IC, for massively parallel embedded applications. Each of the processors on the IC runs Forth instructions in as little as 1.4ns, putting equivalent peak performance at 100GIPS.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 21, 2015 3:20 am 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
GARTHWILSON wrote:
I feel my staunch Forth defender trait coming out. :D

Uh Oh! I better put the flames out fast! :mrgreen:

I was only responding to the claim that Forth is faster because the data is already on the stack when the word is called (a claim I have often seen before). I was just pointing out that you still have to get the data on the stack in the first place. Also, a high level language that is designed to be as close to assembly language as you can get and for which programmers have had decades of experience to implement optimization strategies is going to provide stiff competition for Forth in the speed stakes. Even coding in assembly language is likely to result in a slower program in modern computers than C would produce.

Otherwise, everything you say is correct. You can always code Forth words in assembly if speed is an issue, careful use of the stack will keep unnecessary stackrobatics to a minimum and processors that are designed to implement Forth directly can blitz the competition. (Unfortunately, the x86 is the VHS of processors).


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 21, 2015 5:15 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 672
Forth CPU MIPS/GIPS numbers are also wonky compared to register-based processors, due to how incredibly little work an instruction does on that architecture. For instance, 5 + is two instructions on a Forth processor, while it's a single immediate on most register based ones. Plus, on a register based processor, that addition would set the N/Z/V/C flags for free, which would often be additional Forth instructions to expose those for testing. Indexed addressing modes also collapse 3 or more Forth instructions into a single register instruction. So you need to divide the IPS by the ratio of instructions required between comparable pieces of code. Also consider that these exist in tiny address spaces, so they physically can't do a lot of general computational work (specifically regarding the GreenArrays design).

Of course, the claims of ratio between number of instructions required between the two styles rages on. ;) There is instruction expansion as listed above in going to Forth, but also reduction due to judiciously considering intermediate stack state. Whether the latter makes up for the former is the battle.

Stack operations also cannot be effectively parallelized as-is, while out-of-order is common on register files, since registers are independent places to work on intermediate data. The top of stack and its history is effectively a single place for intermediate data. It can be somewhat parallelized in theory, but I don't know of any attempt at it in practice, and Forth coding style would have to drastically change. Stack-based VMs typically are such for compactness and ease of code generation, then translate to register-based form for fast execution and other transformative optimizations.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 21, 2015 7:25 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
It seems to usually take a register-based processor at least a few instructions to carry out one Forth instruction.

Here's a page I kept from 1991 on Silicon Composers' SC/Fox Cub SBC that uses the RTX2000. There are 32-bit stack processors, but I don't know if there are any 64-bit. This one is 16-bit. The performance is pretty impressive. Look how it compares to the 80386 in the chart on the second page! (Shucks-- I see in the review that somehow the order of the file attachments got reversed, so the second page is the first one shown.) Compared to the 80386 (plus '387 coprocessor) register-based processor which ran at a much higher clock speed, the RTX2000 did a 64-point FFT 137 times as fast as the '386/'387, the Sieve benchmark 16 times as fast, the sort algorithm 16 times as fast, and even the string move was faster. The only place it was slower was the floating-point, because it didn't have floating-point support in hardware.


Attachments:
FoxCubSBC2.jpg
FoxCubSBC2.jpg [ 374.76 KiB | Viewed 2899 times ]
FoxCubSBC.jpg
FoxCubSBC.jpg [ 363.78 KiB | Viewed 2899 times ]

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 25, 2015 2:00 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 672
GARTHWILSON wrote:
It seems to usually take a register-based processor at least a few instructions to carry out one Forth instruction.

And if you were to write a 6502 emulator in Forth, I'm sure it would take at least a few Forth instructions to carry out one register machine instruction, especially where ALU flags are concerned. I'm not sure that exposing one environment on the other architecture is very applicable here; I'm more pointing to generalized execution notions, like "add a number to what's in intermediate storage" or "dereference 5 nested structure pointers, each with offsets".

Regarding those benchmarks, there's no description about how they were achieved, and it's from marketing materials selling a board with the chip. I have no idea how they got the 386 to run so slow. Here's a more thoroughly documented benchmark, even though it's from a Harris employee: http://users.ece.cmu.edu/~koopman/forth/sigarch_92.pdf It's a lot more flat of a comparison (with 68020 trailing well behind the other 2), and RTX still loses in MHz-normalized wall clock speed vs Sparc. Yes it's competitive, but nowhere near wildly better.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 03, 2016 3:52 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
barrym95838 wrote:
scotws wrote:
... Do you have some mechanism to native-compile short words (so DROP is INX INX and not a jump or call)? At the moment, deciding which words to native compile in Tali is rather ad hoc ("looks short"). Once stuff is stable, I'll have to figure out something a bit more scientific based on a "acceptable percentage of overhead".


Sorry that I forgot to answer you sooner, Scot. I am writing a Forth for my 65m32 not for speed or utility, but for learning and shaking out my machine instruction set before putting a final seal on it. I know that I'm on to something good, because I always find myself fighting the urge to code high-level words as primitives, due to the almost inevitable conclusion that so many of them wind up being not just faster, but shorter as well.

In my unfinished [sigh] DTC Forth based on Dr. Brad's Camel Forth, NEXT is one machine instruction, and the following primitives are all one (or zero!) machine instructions (plus NEXT):
Code:
+ - 1+ 1- 2* >BODY @ ALIGN ALIGNED AND branch C@ CELL+ CELLS CHAR+ CHARS DROP DUP enter EXECUTE EXIT INVERT NEGATE NIP OR SWAP UNLOOP XOR [ ]

Dozens of my other primitives are only two or three machine instructions (plus NEXT), and I can imagine that I would be doing tons of in-lining in an STC system. In fact, (and this may bring some criticism down on me), I'm having trouble justifying spending any more of my limited spare time learning how to program in Forth, because programming in assembler is so much fun, even though I'm still hand-assembling everything :shock: . I am porting a version of Forth because the act of porting interests me, but I get this nagging feeling that actually programming something non-trivial in Forth will never feel "natural" to me, like programming in BASIC or C or assembler does. I wind up with something awkward like this, because "stackrobatics" don't come naturally to me (this doesn't quite work correctly, and I haven't figured it out yet):
Code:
: line                  \ plot solid line       ( 'x 'y --  )
                        \ Draw a solid line from x,y to 'x,'y
                        \ x and y are global variables, and update to 'x,'y
                        \   . means "delta "
                        \   ' means "updated value of "
                        \   ~ means "sign of delta " ... -1 if <0, +1 if >=0
                        \   | means "absolute value of delta "
  y @ -                 \ calculate .y          ( 'x .y )
  dup 0< 1 or swap      \          ~.y          ( 'x ~y .y )
  abs dup >r            \          |.y          ( 'x ~y |y )            ( R: |y )
  rot x @ -             \ calculate .x          ( ~y |y .x )
  dup 0< 1 or swap      \          ~.x          ( ~y |y ~x .x )
  abs                   \          |.x          ( ~y |y ~x |x )
  dup                   \ c is slope accum.     ( ~y |y ~x |x c )
  dup r> +              \ k is loop counter     ( ~y |y ~x |x c k )     ( R: )
  negate 1 swap do      \ main loop (           ( ~y |y ~x |x c )       ( R: 1 -k )
    plot                \   plot (x,y)
    i 0< if             \   is line finished? ( ( ~y |y ~x |x c )
      dup 0< if         \     n:  c < 0? (
        over + >r       \       y: ( c += |x    ( ~y |y ~x |x )         ( R: 1 'k 'c )
        3 pick y        \            y += ~y    ( ~y |y ~x |x ~y y )
      else              \          )
        3 pick - >r     \       n: ( c -= |y    ( ~y |y ~x |x )         ( R: 1 'k 'c )
        over x          \            x += ~x    ( ~y |y ~x |x ~x x )
      then              \          )
      +! r>             \       update x or y   ( ~y |y ~x |x 'c )      ( R: 1 'k )
    then                \   )
  loop                  \ )                     ( ~y |y ~x |x 'c )      ( R: )
  drop 2drop 2drop      \ delete temps          (  )
;                       \                       (  )


Mike B.

[Edit: Ah, I think I found something ... I should be initializing my slope accumulator to |dx|-|dy| ... not sure if that's all, but I will have to test it later.]


I replied to Michael on comp.lang.forth: https://groups.google.com/forum/#!topic ... 101-125%5D

This is what I said:
------------------------------------------------------------------------------------------------------------------------------
I skimmed over that thread. I didn't examine your code carefully.

I will say that line drawing is an example I have used previously in regard to how to write Forth code.

You don't want a function with a lot of parameters such as these:

: draw-line ( ax ay bx by -- ) ... ;
: intersect-lines ( ax ay bx by dx dy ex ey -- x y intersected? )

Having a lot of parameters quickly becomes a nightmare of stack-juggling. Using locals helps, but it is still horrible design.

You are better off to use structs:

0
w field .x
w field .y
constant point

0
point field .a
point field .b
constant line

point
w field .r
constant circle

: draw-line ( line -- ) ... ;
: intersect-lines ( lineA lineB -- point intersected? ) ... ;

In the above functions, line and lineA and lineB are all pointers to structs of type LINE. Structs are kept on the heap. The function INTERSECT-LINES returns a pointer to a struct of type POINT and a flag indicating if the lines intersected.

I have a lot of examples of structs in the novice-package. See LIST.4TH for some simple examples.

Here are some constructors (I haven't tested this, but it is pretty obvious and should work). Note that ALLOC is like ALLOCATE but does the error-checking rather than give you a flag.

: init-point ( x y node -- node )
>r
r@ .y !
r@ .x !
r> ;

: new-point ( x y -- node )
point alloc
init-point ;

: init-circle ( r x y node -- node )
init-point >r
r@ .r !
r> ;

: new-circle ( r x y -- node )
circle alloc
init-circle ;

Notice that the constructor is factored into INIT-xxx and NEW-xxx words. The purpose of this is so the INIT-xxx word can be used inside of the INIT-yyy word of a child class. In the above example, the CIRCLE class is a child of the POINT class. This is a crude form of OOP --- this is all I do in regard to OOP in the novice-package --- I have found that this works quite well for simplifying code.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 03, 2016 4:06 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
Thanks Hugh, and welcome. I can only understand a fraction of what you're saying, but I'm working on it, slowly (I have a thousand things on my plate, and sometimes I feel like I'm not making any progress at all).

Mike B.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 25 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron