الفهرس | Only 14 pages are availabe for public view |
Abstract System diagnosis is concerned with the location of the mulfua¬otioning subsystems within a digital system. Interest in this topic is motivated by the need for highly-available systems that can con¬tinue essential operations when failures occur. Many real-time applications such as air traffic control, manned spaceflight, hospital patient monitoring, and on-line process control; place severe demands on system availability. These demands are sat¬isfied by ~aximizing the system reliability. The traditional approach to achieving reliable co~puting systems has been based larrely on f;:-ult avoidance (or fault intolerance). This appro8.ch reQuires the acquisition of the most reliable components, the use of thoroughly refined techniQues for the interconnection of com~onents and assembly of subsystems, and the carrying out of com¬perhensive testing to eliminate t’he design faults. EO’l,-tever, occasional system failures are accepted as a necessary evil, and ~anual maint¬enance is provided for their corrections. There are several situati0ns in which the fault avoidance appro¬ach does not suffice. These include situations where the frequency and duration of repair time are unacceptable. An alternative approach to fault avoidance is that of fault-tolerance. In this approach the oauses of unreliability are expected to be present and to induce errors, but their disrupting effects are automatically counteracted. One reason for the use of this approach is to achieve a reliability or 2 availability that cannot be attained by the:fault avoidance approach. A second reason may be the attainment of a reliability that matchGs that attained by the fault avoidance approach, but at a lower overall cost of implementation. A third reason is the psychological support to the users who know that provisions have been made to handle faults automatically. The techniques for attempting to achieve fault-tolerance comprise strateGies for error detection, fault treatment, damage assessment, and error recovery. Fault treatment is essential to avoid the fault durinc further operation of the system. It can be accomplished in two ways. One method is to provide standby spares which can be switched in to replace faulty elements. The other method is to design into the system a capability for graceful degradation. In this scheme, rather than replacing a faulty element, the system is reconfigured to continue operation at reduced capacity without that element. Re¬g[~rdless of the fault treatment mechanism, the fault must be located to within a component of a size which is acceptable for the treatment mechanism. This is the system-level fault diagnosis. System-level fault diagnosis is also of interest to those cases in which manual repair is performed, where initial diagnosis to the level of large replaceable modules can reduce the system downtime. |