Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 47
Текст из файла (страница 47)
6.4 Grey zone betweenintended and erroneous states6.3 Anomaly Detection151specific contexts in order to be able to detect anomalies more effectively. In areal-time control system that exhibits periodic behavior, the analysis of the timeseries of real-time data is a very effective technique for anomaly detection.An excellent survey of anomaly detection techniques is contained in Chandolaet al. [Cha09].The anomaly detection subsystem should be separated from the subsystemthat performs the operational functions for the following reasons:llThe anomaly detection subsystem should be implemented as an independentfault-containment unit, such that a failure in the anomaly detection subsystemwill have no direct effect on the operational subsystem and vice versa.Anomaly detection is a well-defined task that must be performed independentlyfrom the operational subsystem.
Two different engineering groups should work onthe operational subsystem and the anomaly detection subsystem in order to avoidcommon mode effects.The multi-cast message primitive, introduced in Sect. 4.1.1, provides a meansto make the g-state of a component accessible to an independent anomaly-detectionsubsystem without inducing a probe effect. The anomaly detection subsystemclassifies the observed anomalies on a severity scale and reports them either toan off-line diagnostic system or to an on-line integrity monitor.
The integritymonitor can take immediate corrective action in case the observed anomaly pointsto a safety-relevant incident.Example: It is an anomaly if a car keeps accelerating while the brake pedal is beingpressed. In such a situation, an on-line integrity monitor should autonomously discontinuethe acceleration.All detected anomalies should be documented in an anomaly database for furtheron-line or off-line analysis. The depth of investigation into an anomaly depends onthe severity of the anomaly – the more severe the anomaly the more informationabout the occurrence of the anomaly should be recorded. The off-line analysis of theanomaly database can expose valuable information about weak spots of a systemthat can be corrected in a future version.In a safety-critical system, every single observed anomaly must be scrutinized in detail untilthe final cause of the anomaly has been unambiguously identified.6.3.2Failure DetectionA failure can only be detected if the observed behavior of a component can bejudged in relation to the intended behavior.
Failure detection within a system isonly possible if the system contains some form of redundant information aboutthe intended behavior. The coverage of the failure detector, i.e., the probabilitythat a failure will be detected if it is present, will increase if the informationabout the intended behavior becomes more detailed. In the extreme case, whereevery failure in the behavior of a component must be detected, a second component1526 Dependabilitythat provides the basis for the comparison – a golden reference component – is needed,i.e., the redundancy is 100%.Knowledge about the regularity in the activity pattern of a computation canbe used to detect temporal failures. If it is a priori known that a result messagemust arrive every second, the non-arrival of such a message can be detected within1 s.
If it is known that the result message must arrive exactly at every full second,and a global time is available at the receiver, then the failure-detection latency isgiven by the precision of the clock synchronization. Systems that tolerate jitterdo have a longer failure-detection latency than systems without jitter. The extratime gained from an earlier failure detection can be significant for initiating amitigation action in a safety-critical real-time system.In real-time systems, the worst-case execution time (WCET see Sect.
10.2) ofall real-time tasks must be known in advance in order to find a viable schedule forthe task execution. This WCET can be used by the operating system to monitorthe execution time of a task. If a task has not terminated before its WCET expires,a temporal failure of the task has been detected.6.3.3Error DetectionAs mentioned before, an error is an incorrect data structure, e.g., an incorrectstate or an incorrect program. We can only detect an error if we have someredundant information about the intended properties of the data structure underinvestigation.
This information can be part of the data structure itself, such as aCRC field, or it can come from some other source, such as a priori knowledgeexpressed in the form of assertions or a golden channel that provides a result thatacts as golden reference data structure.Syntactic knowledge about the code space. The code space is subdivided into twopartitions, one partition encompassing syntactically correct values, with the othercontaining detectably erroneous code-words. This a priori knowledge about thesyntactic structure of valid code words can be used for error detection. One plusthe maximum number of bit errors that can be detected in a codeword is called theHamming distance of the code. Examples of the use of error-detecting codes are:error-detecting codes (e.g., parity bit) in memory, CRC polynomials in data transmission, and check digits at the man-machine interface.
Such codes are veryeffective in detecting the corruption of a value.Example: Consider the scenario where each symbol of an alphabet of 128 symbols isencoded using a single byte. Because only seven bits (27 ¼ 128) are needed to encode asymbol, the eighth bit can be used as a parity bit to be able to distinguish a valid codewordfrom an invalid codeword of the 256 code words in the code space. This code has aHamming distance of two.Duplicate channels. If two independent deterministic channels calculate two resultsusing the same input data, we can compare the results to detect a failure but cannotdecide which one of the two channels is wrong.
Fault-injection experiments [Arl03]6.4 Fault Tolerance153have shown that the duplicate execution of application tasks at different times is aneffective technique for the detection of transient hardware faults. This technique canbe applied to increase the failure-detection coverage, even if it cannot be guaranteedthat all task instances can be completed twice in the available time interval.There are many different possible combinations of hardware, software, and timeredundancy that can be used to detect different types of failures by performingthe computations twice. Of course, both computations must be replica determinate;otherwise, many more discrepancies are detected between the redundant channelsthan those that are actually caused by faults.
The problems in implementingreplica-determinate fault-tolerant software have already been discussed in Sect. 5.6.Golden reference. If one of the channels acts as a golden reference that is considered correct by definition, we can determine if the result produced by the otherchannel is correct or faulty.
Alternatively, we need three channels with majorityvoting to find out about the single faulty channel, under the assumption that all threechannels are synchronized.Example: David Cummings reports about his experience with error detection in thesoftware for NASA’s Mars Pathfinder spacecraft [Cum10]:Because of Pathfinder’s high reliability requirements and the probability of unpredictable hardware errors due to the increased radiation effects in space, we adopted a highly“defensive” programming style.
This included performing extensive error checks in thesoftware to detect the possible side effects of radiation-induced hardware glitches andcertain software bugs. One member of our team, Steve Stolper, had a simple arithmeticcomputation in his software that was guaranteed to produce an even result (2, 4, 6 and soon) if the computer was working correctly. Many programmers would not bother to checkthe result of such a simple computation.
Stolper, however, put in an explicit test to see if theresult was even. We referred to this test as his “two-plus-two-equals-five check.” We neverexpected to see it fail. Lo and behold, during software testing we saw Stolper’s errormessage indicating the check had failed. We saw it just once. We were never able toreproduce the failure, despite repeated attempts over many thousands if not millions ofiterations.
We scratched our heads. How could this happen, especially in the benignenvironment of our software test lab, where radiation effects were virtually nonexistent?We looked carefully at Stolper’s code, and it was sound.What can we learn from this example? We should never build a safety-criticalsystem that relies on the results of a single channel only.6.4Fault ToleranceThe design of any fault-tolerant system starts with the precise specification of afault hypothesis. The fault hypothesis states what types of faults must be toleratedby the fault-tolerant system and divides the fault-space into two domains, thedomain of normal faults (i.e., the faults that must be tolerated) and the domain ofrare faults, i.e., faults that are outside the fault hypotheses and are assumed tobe rare events.