In the industry where I work, for critical applications or for redundancy (or both) we run dual or triple MPU systems.
These take one of three forms depending on requirements. But all are run well within the manufacturers specifications. So no overclocking. Reliability being very important.
The non-safety critical systems typically have two independent MPU cards (each has on board EPROM and SRAM) with a watchdog system. They both run the same software. Both MPU cards run all the time. But only one able to control and write to the bus to the rest of the system. In the event of the current in-use MPU failing to toggle it’s watchdog, the other MPU will automatically be promoted to being the in-use MPU and the faulty MPU will loose control. We can also manually switch between them.
For the safety critical systems, two MPU systems are present in a single cased module. Both MPUs are normally always online processing all the data. If either of them disagrees with the other on the system outputs, a comparison system will blow the internal supply fuse taking both off-line and putting all the outputs into a safe state. Note that both MPUs use the same software.
For the safety critical systems where redundancy is important, three separate modules are used, but all are interlinked. Each module (which fit in a 19 inch rack) contains a single MPU, RAM, control circuits, custom network interfaces and power circuitry. On the front there is a slot for the EPROM module. Similar to the system described above, a comparison system is used. But it is far more complex. The principal is that in the even of a fault, the two ‘good’ modules will blow the internal fuse of the ‘defective’ module and take it off-line. The two remaining modules will then continue to work, but in a dual configuration system only. Again, they all use the same software.
The software used in the safety critical systems has been carefully designed. As it is processing data predefined within a fixed limited format, it is possible to use the same program in many systems, even though the real world data is different. So each system contains EPROMs that contain a table that the system uses to process the real world data. The data in the table determines which logic is applied in response to the real world inputs, in order to decide what (if any) response will be output back to the real world.
This system enables simulation systems run on far more capable hardware to be used to test the logic of the data in the tables, with all ‘real world’ inputs and outputs also being simulated.
The downside is that these various systems are not particularly fast
Mark