Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 79
Текст из файла (страница 79)
How long does it take to restart a component after a failure? Focus on the fastrecovery from any kind of a single fault – A single Byzantine fault [Dri03]. The zerofault case takes care of itself and the two or more independent Byzantine fault case isexpensive, unlikely to occur, and unlikely to succeed. How complex is the recovery?6.
Are the normal operating functions and the safety functions implemented indifferent components, such that they are in different FCUs?7. How stable is the message interface with respect to anticipated change requirements? What is the probability and impact of changes of a component on the restof the cluster?Energy and Power.
Energy consumption is a critical non-functional parameter of amobile device. Power control helps to reduce the silicon die temperature andconsequently the failure rate of devices:1. What is the energy budget of each component?2. What is the peak power dissipation? How will peak-power effect the temperature and the reliability of the device.3. Do different components of an FCU have different power sources to reduce thepossibility of common mode failures induced by the power supply? Is there apossibility of a common mode failure via the grounding system (e.g., lightningstroke)? Are the FCUs of an FTU electrically isolated?Physical Characteristics.
There are many possibilities to introduce common-modefailures by a careless physical installation. The following list of questions shouldhelp to check for these:1. Are mechanical interfaces of the replaceable units specified, and do these mechanical boundaries of replaceable units coincide with the diagnostic boundaries?2. Are the FCUs of an FTU (see Sect. 6.4.2) mounted at different physical locations, such that spatial proximity faults (e.g., a common mode external fault suchas water, EMI, and mechanical damage in case of an accident) will not destroymore than one FCU?3.
What are the cabling requirements? What are the consequences of transientfaults caused by EMI interference via the cabling or by bad contacts?4. What are the environmental conditions (temperature, shock, and dust) of thecomponent? Are they in agreement with the component specifications?11.4Design of Safety-Critical SystemsThe economic and technological success of embedded systems in many applicationsleads to an increased deployment of computer systems in domains where a computerfailure can have severe consequences. A computer system becomes safety-critical27211 System Design(or hard real-time) when a failure of the computer system can have catastrophicconsequences, such as the loss of life, extensive property damage, or a disastrousdamage to the environment.Example: Some examples of safety critical embedded systems are: a flight-control systemin an airplane, an electronic-stability program in an automobile, a train-control system, anuclear reactor control system, medical devices such as heart pacemakers, the control of theelectric power grid, or a control system of a robot that interacts with humans.11.4.1 What Is Safety?Safety can be defined as the probability that a system will survive a given time-spanwithout the occurrence of a critical failure mode that can lead to catastrophicconsequences.
In the literature [Lal94] the magical number 109 h, i.e., 115,000years, is the MTTF (mean-time-to-failure) that is associated with safety-criticaloperations. Since the hardware reliability of a VLSI component is less than 109 h, asafety-aware design must be based on hardware-fault masking by redundancy. It isimpossible to achieve confidence in the correctness of the design to the level of therequired MTTF in safety-critical applications by testing only – extensive testing canestablish confidence in a MTTF in the order of 104 to 105 h [Lit93]. A formalreliability model must be developed in order to establish the required level ofsafety, considering the experimental failure rates of the subsystems and the redundant structure of the system.Mixed-Criticality Architectures. Safety is a system property – the overall systemdesign determines which subsystems are safety-relevant and which subsystems canfail without any serious consequences on the remaining safety margin.
In the past,many safety-critical functions have been implemented on dedicated hardware,physically separated from the rest of the system. Under these circumstances, it isrelatively easy to convince a certification authority that any unintended interferenceof safety-critical and non-safety-critical system functions is barred by design.However, as the number of interacting safety-critical functions grows, a sharingof communication and computational resources becomes inevitable.
This resultsin a need of mixed-criticality architectures, where applications of different criticality can coexist in a single integrated architecture and the probability of anyunintended interference, both in the domains of value and time, among thesedifferent-criticality applications must be excluded by architectural mechanisms.If mixed-criticality partitions are established by software on a single CPU, thepartitioning system software, e.g., a hypervisor, is assigned the highest criticalitylevel of any application software module that is executed on this system.Fail-Safe Versus Fail-Operational. In Sect. 1.5.2 a fail-safe system has beendefined as a system, where the application can be put into a safe state in case of afailure. At present, the majority of industrial systems that are safety-relevant fallinto this category.11.4 Design of Safety-Critical Systems273Example: In most scenarios, a robot is in a safe state when it ceases to move.
A robotcontrol system is safe if it either produces correct results (both in the domain of value andtime) or no results at all, i.e., the robot comes to a standstill. The safety-requirement of arobot control system is thus a high error-detection coverage (see Sect. 6.1.2).In a number of applications, there exists a basic mechanical or hydraulic controlsystem that keeps the application in a safe state in case of a failure of the computercontrol system that optimizes the performance.
In this case it is sufficient if thecomputer system is guaranteed to fail cleanly (see Sect 6.1.3), i.e., inhibits itsoutputs when a failure is detected.Example: The ABS system in a car optimizes the braking action, depending on the surfacecondition of the road. If the ABS system fails cleanly, the conventional hydraulic brakesystem is still available to bring a car to a safe stop.There exist safety-relevant embedded applications where the physical systemrequires the continuous computer control in order to maintain a safe state.
A totalloss of computer control may cause a catastrophic failure of the physical system.In such an application, which we call fail-operational, the computer must continueto provide an acceptable level of service, if failures occur within the computersystem.Example: In a modern airplane, there is no mechanical or hydraulic backup to thecomputer-based flight control system. Therefore the flight control system must be failoperational.Fail-operational systems require the implementation of active redundancy (asdiscussed in Sect. 6.4) to mask component failures.In the future, it is expected that the number of fail-operational systems willincrease for the following reasons:1. The cost of providing two subsystems based on different technologies – a basicmechanical or hydraulic backup subsystem for basic safety functions andan elaborate computer-based control system to optimize the process – willbecome prohibitive.
The aerospace industry has already demonstrated that it ispossible to provide fault-tolerant computer systems that meet challenging safetyrequirements.2. If the difference between the functional capabilities of the computer-basedcontrol system and the basic mechanical safety system increases further andthe computer system is available most of the time, then the operator may nothave any experience in controlling the process safely with the basic mechanicalsafety system any more.3. In some advanced processes, computer-based non-linear control strategies areessential for the safe operation of a process.