Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 49
Текст из файла (страница 49)
Inexact votingmust be used if the replica determinism of the replicated components cannotbe guaranteed. The selection of an appropriate interval for an inexact voter is adelicate task: if the interval is too large, erroneous values will be accepted ascorrect; if the interval is too small, correct values will be rejected as erroneous.Irrespective of the criterion defined to determine the sameness of two results, thereseem to be difficulties.Example: Lala [Lal94] reports about the experiences with inexact voting in theAir Force’s F-16 fly-by-wire control system that uses four loosely synchronizedredundant computational channels: The consensus at the outputs of these channels causedconsiderable headaches during the development program in setting appropriatecomparison thresholds in order to avoid nuisance false alarms and yet not miss any realfaults.Byzantine Resilient Fault-Tolerant Unit.
If no assumption about the failure modeof an FCU can be made and no fault-tolerant global time base is available,four components are needed to form a fault-tolerant unit (FTU) that can toleratea single Byzantine (or malicious) fault. These four components must execute aByzantine-resilient agreement protocol to agree on a malicious failure of a singlecomponent. Theoretical studies [Pea80] have shown that these Byzantine agreement protocols have the following requirements to tolerate the Byzantine failures ofk components:1.
An FTU must consist of at least 3k + 1 components.2. Each component must be connected to all other components of the FTU by k + 1disjoint communication paths.3. To detect the malicious components, k + 1 rounds of communication must beexecuted among the components. A round of communication requires everycomponent to send a message to all the other components.An example of an architecture that tolerates Byzantine failures of the components isgiven in Hopkins et al. [Hop78].6.4.3The Membership ServiceThe failure of an FTU must be reported in a consistent manner to all operatingFTUs with a low latency. This is the task of the membership service. A point inreal-time when the membership of a component can be established, is called amembership point of the component. A small temporal delay between the membership point of a component and the instant when all other components of theensemble are informed in a consistent manner about the current membershipis critical for the correct operation of many safety-relevant applications.6.4 Fault ToleranceFig.
6.8 Example of anintelligent ABS in a car159ABSABSBrakeABSABSThe consistent activation of a never-give-up (NGU) strategy in case the faulthypothesis is violated is another important function of the membership service.Example: Consider an intelligent ABS (Antilock Braking System) braking system in acar with a node of a distributed computer system placed at each wheel.
A distributedalgorithm in each of the four nodes, one at each wheel, calculates the brake-forcedistribution to the wheels (Fig. 6.8), depending on the position of the brake pedal actuated by the driver. If a wheel node fails or the communication to a wheel computer islost, the hydraulic brake-force actuator at this wheel autonomously transits to a definedstate, e.g., in which the wheel is free running.
If the other nodes learn about the computerfailure at this wheel within a short latency, e.g., a single control loop cycle of about 2 ms,then the brake force can be redistributed to the three functioning wheels, and the car canstill be controlled. If, however, the loss of a node is not recognized with such a low latency,then, the brake force distribution to the wheels, based on the assumptions that all four-wheelcomputers are operational, is wrong and the car will go out of control.ET Architecture.
In an ET architecture, messages are sent only when a significant event happens at a component. Silence of a component in an ET architecturemeans that either no significant event has occurred at the component, or a failsilent failure has occurred (the loss of communication or the fail-silent shut-down ofthe component). Even if the communication system is assumed to be perfectly reliable,it is not possible to distinguish when there is no activity at the component fromthe situation when a silent component failure occurs in an ET architecture.
Anadditional time-triggered service, e.g., a periodic watchdog service (see Sect. 9.7.4),must be implemented in an ET architecture to solve the membership problem.TT Architecture. In a TT architecture, the periodic message-send times arethe membership points of the sender. Let us assume that a failed componentremains out-of-service for an interval with duration greater than the maximumtime interval between two membership points. Every receiver knows a prioriwhen a message of a sender is supposed to arrive and interprets the arrival of themessage as a life sign at the membership point of the sender [Kop91].
It is thenpossible to conclude, from the arrival of the expected messages at two consecutivemembership points, that the component was alive during the complete intervaldelimited by these two membership points (there is a tacit assumption that atransiently failed node does not recover within this interval). The membership ofthe FTUs in a cluster at any point in time can thus be established with a delay ofone round of information exchange.
Because the delay of one round of informationexchange is known a priori in a TT architecture, it is possible to derive an apriori bound for the temporal accuracy of the membership service.1606.56.5.16 DependabilityRobustnessThe Concept of RobustnessIn the domain of embedded systems, we consider a system to be robust, if theseverity of the consequences of a fault is inversely proportional to the probabilityof fault occurrence, i.e., faults that are expected to occur frequently shouldhave only a minor effect on the quality of service of the system.
Irrespective ofthe concrete type and source of a fault, a robust embedded system will try to recoverfrom the effects of a fault as quickly as possible in order to minimize the impactof the fault on the user. As noted above in Sect. 6.1, the immediate consequence ofa fault is an error, i.e., an unintended state. If we detect and correct the error beforeit has had a serious effect on the quality of service, we have increased the robustnessof the system. Design for robustness is not concerned with finding the detailedcause of a failure – this is the task of the diagnostic subsystem – but rather withthe fast restoration of the normal system service after a fault has occurred.The inherent periodicity of many real-time control systems and multimediasystem helps in the design for robustness.
Due to the constrained physicalpower of most actuators, a single incorrect output in a control cycle will – inmost cases – not result in an abrupt change of a physical set point. If we can detectand correct the error within the next control cycle, the effect of the fault on thecontrol application will be small. Similar arguments hold for multimedia system.If a single frame contains some incorrect pixels, or even if a complete frame islost, but the next frame in sequence is correct again, then the impact of a fault on thequality of the multimedia experience is limited.6.5.2Structure of a Robust SystemA robust system consists of at least two subsystems (Fig.
6.9) implemented asindependent FCUs, one operational component that performs the plannedoperations and controls the physical environment and a second monitoring component that reflects whether the results and the g-state of the operational componentare in agreement with the intentions of the user [Tai03].In a periodic application such as a control application, every control cycle startswith reading the g-state and the input data, then the control algorithm is calculated,Fig. 6.9 Structure of a robustsystem6.6 Component Reintegration161and finally the new set points and the new g-state are produced (see Fig. 3.9).A transient fault in one control cycle can only propagate to the next control cycle ifthe g-state has been contaminated by the fault.
In a robust system, the operationalcomponent must externalize its g-state in every control cycle such that the monitoringcomponent can check the plausibility of the g-state and perform a corrective action incase a severe anomaly has been detected in the g-state. The corrective action canconsist of resetting the operational component and restarting it with a repaired g-state.In a safety-critical application, this two-channel approach – one channelproduces a result and the other channel, the safety monitor, monitors whether theresult is plausible – is absolutely essential. Even if the software has been provencorrect, it cannot be assumed that there will be no transient faults during theexecution of the hardware. The IEC 61508 standard on functional safety requiressuch a two-channel approach, one channel for the normal function and anotherindependent channel to ensure the functional safety of a control system (see alsoSect.
11.4).In a fail-safe application, the safety monitor has no other authority then tobring the application to the safe state. A fail-silent failure of the safety monitorwill result in a loss of the safety monitoring function, while a non-fail-silent failureof the safety monitor will cause a reduction of the availability but will not impactthe safety.In a fail-operational application, a non-fail silent failure of the safety monitorhas an impact on the safety of the application. Therefore the safety-monitor itselfmust be fault-tolerant or at least self-checking in order to eliminate non-fail-silentfailures.6.6Component ReintegrationMost computer system faults are transient, i.e., they occur sporadically for a veryshort interval, corrupt the state, but do not permanently damage the hardware.If the service of the system can be reestablished quickly after a transient faulthas occurred, then in most cases the user will not be seriously affected by theconsequences of the fault.