6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Sep 28, 2024 3:27 pm

All times are UTC




Post new topic Reply to topic  [ 31 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
 Post subject:
PostPosted: Fri Mar 06, 2009 6:33 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
GARTHWILSON wrote:
What goes on in PCs' video cards, sound cards, and graphics engines to take a big burden off the main processor?


This is much too open-ended a question to answer here, even succinctly.

Quote:
Samuel has occasionally touched on how GEOS (software) on the Commodore 64 computed the video for various fonts and sizes,


I did?? Where?

Quote:
I sometimes wonder how much of that work of figuring out what value to give each dot is done by the main processor and how much by the video chipset.


With today's clock speeds, it doesn't really matter. However, in most cases, it's almost always done in the video chipset, and rarely in the CPU.

When scrolling a window, for example, the software will call a graphics procedure to "blit" (BLock Image Transfer) a submatrix of pixels from one location to another. This task is performed by the GPU or some comparable coprocessor.

If you have transparency effects, then this won't work -- in this case, you need to compose the image from primitive images, just as a graphics artist would create a composite image. You start by blitting the background image, then you blit all foreground images, one on top of the other, until every layer is displayed. 3D hardware acceleration takes most of this logic off of the CPU's hands -- the CPU still has to tell the GPU which images to render and where, however.

Quote:
I don't even know if there's still a separate floating-point co-processor in the same package with the main processor (not that the video calculations use FP very often)


The 80487SX was Intel's last discrete FPU. From the 80486DX onwards, all FPUs have been fully integrated onto the die of the CPU.

As far as floating point calculations for video, in fact, yes, it's used all the time. All OpenGL procedures accept floating point coordinates and transformation matrices. Many procedures dealing with outline fonts also use floating point coordinate systems because of the possibility for anti-aliasing and subpixel rendering.

Quote:
For the streaming audio and video, I have no doubt that the audio and video chipsets do a lot of buffering, maybe even several seconds' worth


Most audio chipsets I'm familiar with buffers about half a second's worth of audio. Hardware-accelerated MPEG decoders do not buffer -- they decode and render video frames in real time.

Quote:
and I suspect these may even access the sets of samples by DMA; but the control of the timing is atrocious, as witnessed by the fact that the audio may lead or lag the video by a tenth of a second or even more, which is totally unacceptable for my kind of work, to put it mildly.


I simply must take exception to this -- you are ascribing to the hardware what is clearly a fault of the encoder software.

Unsynchronized audio and video during playback can come from one of two sources only -- either the video and audio fail to synchronize in the video file itself (meaning the problem lies with encoder), or the problem lies with the player itself.

The hardware very clearly cannot be problem, for nobody has ever complained of synchronization issues when playing video games. And, believe me, since all the GPU and DSP functionality in video cards are market driven by the requirements in video games these days, clearly video games place a much larger stress on the video and audio interfaces than simply isochronously blasting a bitmap and simple PCM stream to the audio card is.

Quote:
This is obviously a common problem as we frequently see it even on TV, from major stations that can afford all the processing power they could want.


Again, let me re-iterate -- this has ZILCH to do with CPU power, nor with DMA performance, nor with buffering, nor with the DSPs or other hardware on video or audio cards.

ANY AND ALL synchronization issues is always a software problem.

Quote:
So what can the home computer builder apply from all this in order to improve performance without losing control of timing?


I'm not sure what your question is asking.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 8:25 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1041
Location: near Heidelberg, Germany
kc5tja wrote:
The hardware very clearly cannot be problem, for nobody has ever complained of synchronization issues when playing video games. And, believe me, since all the GPU and DSP functionality in video cards are market driven by the requirements in video games these days, clearly video games place a much larger stress on the video and audio interfaces than simply isochronously blasting a bitmap and simple PCM stream to the audio card is.


Just a comment - a DVD player is connected to a tv, and with 100Hz processing and de-interlacing and stuff like that the tv introduces a delay itself. Good DVD players for example have a setting that can delay the sound compared to the video, to accomodate for the tv video delay.

That situation is very much less likely to occur with PC monitors, as the PC already does the de-interlacing etc itself and can automatically accomodate for the delay.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 3:11 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
What kind of TVs are you talking about? Certainly no NTSC or PAL TV I'm familiar with exhibits this problem, and I've seen no literature on this effect for HDTVs either.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 6:29 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1041
Location: near Heidelberg, Germany
kc5tja wrote:
What kind of TVs are you talking about? Certainly no NTSC or PAL TV I'm familiar with exhibits this problem, and I've seen no literature on this effect for HDTVs either.


I don't see this on my TV, but I've seen it on a home video projector for example.

See also http://en.wikipedia.org/wiki/Audio_video_sync


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 6:41 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I honestly have to say that I've never found any A/V sync issues with DLPs -- they've always responded as fast as any other monitor, LCD or otherwise.

I have seen A/V sync issues on digital TV, but only under circumstances where there is a transient loss of signal, and even then, only for brief durations. This always seems to be accompanied by MPEG artifacting as well, suggesting that the loss of signal results in some audio data being treated as though it were video data.

I have never found this issue with video conferencing equipment. This equipment will readily drop video frames to ensure audio sync at all times. I've even had cases where I've lost video completely (just a static image) while audio continued to function in real-time.

On clean links, however, I've never experienced this phenomenon.

Where I see this happen the most, and with supine regularity, however, is with flash animations. For example, I regularly visit http://www.homestarrunner.com and on slow computers with high resource utilization, there'll be a loss of A/V sync. However, I blame this on Flash itself, since as soon as you add more resources to the system, or terminate other CPU-bound programs, it goes away completely.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 7:21 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8520
Location: Southern California
The bad audio/video sync we see frequently on TV is just with our analog TV, meaning it got transmitted that way. It's rather irritating. We're not "digital-ready," and I hope I can get away with pulling the plug when the conversion takes place. I was kind of disappointed that they didn't go through with it on Feb 17.

I'll try to reply to the other material later.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 06, 2009 9:15 pm 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
GARTHWILSON wrote:
Quote:
have you looked at the WDC65C134S?

This is kind of hastily written, and still getting away from cell processors, but while we're talking about having multiple processors instead of doing everything with a single processor, I have to ask-- What goes on in PCs' video cards, sound cards, and graphics engines to take a big burden off the main processor?
...


I did a 14-year stint in the games/videogames industry (1987 - 2000), so I feel qualified to answer this question.

On most PCs and game consoles, the graphics memory is no longer in the CPU's main memory, and is instead directly controlled by the GPU. Only part of the graphics memory is visible on the display, and the rest is used for textures and double-buffering.

On this type of system, the CPU mostly issues commands to the GPU to load textures from main memory to graphics memory, and also copy textures from between nonvisible and visible parts of graphics memory. The GPU owns the dedicated graphics memory so operations are relatively fast and asynchronous to the CPU. The GPU can also usually convert between different pixel formats (RGB444, RGB888, HSV, etc)

Most of the 3D graphics cards can nasterize perspective-correct triangles with many options (bilinear/trilinear texture filtering, ambient/point source lighting, etc) and some have programmable pixel pipelines. Many of them also have floating-point math hardware and can perform 3D matrix operations on vertex data.

So, if you have a 3D game, the CPU spends most of its time doing higher-level things, such as calculating collisions, determining which objects are visible on-screen, handling player input, performing I/O, etc. and the GPU handles some or most of the load of displaying the 3D objects on the screen.

Sound cards tend to differ wildly. Some of them are strictly multi-channel PCM waveform playback, and some have some type of sound generation capability, such as a PSG or FM synthesis chip. Most console machines have a separate general-purpose sound CPU to manage the sound hardware.

Toshi


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Apr 10, 2009 10:29 am 
Offline

Joined: Thu Jul 26, 2007 4:46 pm
Posts: 105
/me wakes up this slightly dead topic :P

Regarding the memory system of the Cell, it's actually rather interesting. The PPE (The POWER core) has direct access to the system's bus. But the SPEs don't. Each SPE has it's own local memory from which all code and data is accessed, running at their native clock speed.

There are two ways to get data and instructions into an SPE: Either you can write it in from somewhere else (For example, the PPE can write in the code, or another SPE can DMA the code over), or you use the SPE's DMA unit.

This has some useful implications: The SPEs don't need caching logic, and never stall for their own memory (Unless something else is accessing it - though it's entirely possible the SPE's memory is multi ported). The general mechanism is DMA in a bunch of data, run the calculations over it, then DMA it back out - to another SPE, to main memory, or to the GPU.

This also means that SPEs use RAM bandwidth more efficiently - as they grab the data via DMA, they suffer less from DRAM latencies.

As for GPU shader processors - their method of operation is really unusual. This is all based upon how nVIDIA GPUs work - gleaned mainly from their CUDA documentation.

To execute a shader, a bunch of shader processors are ganged up into a set. Theres a finite quantity of sets that can be constructed, and all processors in a set execute in lock step (I'm assuming because of this the processors in the set all share one instruction decoder). Normally this means that the shader processors accept massive amounts of data from RAM, perform operations on it, then output it to be rendered to the screen.

The interesting part is when you have conditionals. GPUs hate conditionals. Whenever they hit one, it executes one branch of the conditional first, then the other branch, then resynchronizes the two halves.

This is one of the reasons writing shaders is a pain. The first other is, of course, that you are effectively writing a function to be called by the GPU - when writing the caller is normally much easier. And the final reason is that you have no inter-call state


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Apr 10, 2009 6:37 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8520
Location: Southern California
Quote:
/me wakes up this slightly dead topic :P

Topics on this forum don't really go dead, but may experience long periods of inactivity since projects often take many years to go from concept to reality, for various reasons. Actually there's still something above I need to respond to, and have had a window open to it for weeks.

I need a few acronyms defined:
PPE
SPE
CUDA
I'll assume "GPU" is "graphics processor unit."


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Apr 10, 2009 9:50 pm 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
GARTHWILSON wrote:
Quote:
/me wakes up this slightly dead topic :P

Topics on this forum don't really go dead, but may experience long periods of inactivity since projects often take many years to go from concept to reality, for various reasons. Actually there's still something above I need to respond to, and have had a window open to it for weeks.

I need a few acronyms defined:
PPE
SPE
CUDA
I'll assume "GPU" is "graphics processor unit."

A PPE is a 'Power Processor Element'. Its a standard PowerPC core capable of accessing a large address space and running the main operating system.

A SPE is a 'Synergistic Processing Unit'. These are the slave processors within the core. They have only a small amount of RAM and rely on the PPE to use DMA transfers to provide them with data and take the results.

IBM have some cell processor cards that can be fitted into BladeServers. I looked into whether they could be used for some processing intensive tasks in financial markets but the small amount of RAM (only 256K) in the SPEs is a limiting factor for many problems..

CUDA is the 'Compute Unified Device Architecture' developed by nVidia. They provide a C API so that number crunching applications can get access to the GPU to perform vector and matrix operations. The latest version on SetiAtHome uses CUDA. On my 9600 GT it has reduced the processing time for a work unit from two and half hours to 4 and half minutes.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Apr 10, 2009 11:52 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Quote:
Topics on this forum don't really go dead, but may experience long periods of inactivity...


Thanks for qualifying that. I've been to other forums where they rip you for responding to old threads...


Top
 Profile  
Reply with quote  
PostPosted: Wed Jun 22, 2016 8:19 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Hi!

I want to revive this old thread as I am working on a multiplexed 6502/65C02 system (read: SRAM shared bus) in which I see a possibility to expand it to several 65C02 chips.

The unit is currently built on an expansion port card for an old Vic-20 with a buffer for the old 6502 and using BE for the 65C02 so that each can access the SRAM within the same clock cycle. As a miracle it works, but for most efficient SRAM sharing I run the 6502 program with its internal SRAM and only use the shared SRAM for storing data (e.g. accessed with LDA and STA).

For example, the STA $2000,x takes 5 cycles but only one of them are access to the $2000 memory area that is shared. E.g. if there were 10pcs 65C02´s accessing the shared SRAM at the same time while each were using its own SRAM for programs, it could run at 10 times the speed without conflict. The BE signal makes it easier to multiplex the access, giving each 65C02 1/2 cycle to the shared SRAM.

Since this tread has been sleeping for long, I was wondering if anyone has tried using multiple 6502 and made a system out of it?


Top
 Profile  
Reply with quote  
PostPosted: Wed Jun 22, 2016 11:45 pm 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 276
Location: Placerville, CA
Well, there's the AppleCrate II, although that's not a multiprocessor system so much as a networked cluster.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 23, 2016 12:10 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
kakemoms wrote:
Hi!

I want to revive this old thread as I am working on a multiplexed 6502/65C02 system (read: SRAM shared bus) in which I see a possibility to expand it to several 65C02 chips.

The unit is currently built on an expansion port card for an old Vic-20 with a buffer for the old 6502 and using BE for the 65C02 so that each can access the SRAM within the same clock cycle. As a miracle it works, but for most efficient SRAM sharing I run the 6502 program with its internal SRAM and only use the shared SRAM for storing data (e.g. accessed with LDA and STA).

For example, the STA $2000,x takes 5 cycles but only one of them are access to the $2000 memory area that is shared. E.g. if there were 10pcs 65C02´s accessing the shared SRAM at the same time while each were using its own SRAM for programs, it could run at 10 times the speed without conflict. The BE signal makes it easier to multiplex the access, giving each 65C02 1/2 cycle to the shared SRAM.

Since this tread has been sleeping for long, I was wondering if anyone has tried using multiple 6502 and made a system out of it?


This assumes that every operation the 6502 performs occupies 5 clock cycles, and so maintains phase relative to its peers. This, however, won't be the case. So, with surprising frequency in fact, your 6502s will conflict with each other while accessing shared memory.

What you're describing is called NUMA (Non-Uniform Memory Access), where CPUs have local memory resources and remote memory resources mapped into a common address space. It's really the only way to scale beyond the traditional SMP limits.

However, straight message passing would actually be faster than NUMA (particularly if all the links are point to point or a reasonable approximation thereof) in the general case, since you pass messages directly from one node to another without having to contend for a shared resource (or, if you do, it's software mitigated, so event-driven code can forward data at adaptively scheduled intervals. Think R-ALOHA, for instance). You could use a simple FLIT-routed network, or if you can afford the hardware, a point-to-point, mesh network between your processing elements, and never have to worry about sluggishness, nor synchronization.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 24, 2016 8:18 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
kc5tja, I don't think that's what's being described. In this example, the special write isn't happening to some other remote node's own local memory, it's happening to a single globally shared memory buffer, which is exposed to all nodes in the exact same way. Because there's no routing involved, there's less overhead than message passing and direct references stay intact, with the exact same memory performance no matter which node is hitting that memory (modulo contention).

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 26 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: