Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 42
Текст из файла (страница 42)
5.9, considering that the nodesare connected by an AFDX protocol with the temporal parameters of Table 7.2.Chapter 6DependabilityOverview It is said that Nobel Laureate Hannes Alfven once remarked that inTechnology Paradise no acts of God can be permitted and everything happensaccording to the blueprints. The real world is no technology paradise – componentscan fail and blueprints (software) can contain design errors. This is the subjectof this chapter. The chapter introduces the notions of fault, error, and failure anddiscusses the important concept of a fault-containment unit. It then proceedsto investigate the topic of security and argues that a security breach can compromisethe safety of a safety-critical embedded system.
The direct connection of manyembedded systems to the Internet – the Internet of Things (IoT) – makes it possiblefor a distant attacker to search for vulnerabilities, and, if the intrusion is successful,to exercise remote control over the physical environment. Security is thus becominga prime concern in the design of embedded systems that are connected to theInternet. The following section deals with the topic of anomaly detection. Ananomaly is an out-of-norm behavior that indicates that some exceptional scenariois evolving. Anomaly detection can help to detect the early effects of a randomfailure or the activities of an intruder that tries to exploit system vulnerabilities.Whereas an anomaly lies in the grey zone between correct behavior and failure, anerror is an incorrect state that requires immediate action to mitigate the consequences of the error.
Error detection is based on knowledge about the intendedstate or behavior of a system. This knowledge can stem either from a prioriestablished regularity constraints and known properties of the correct behavior ofa computation, or from the comparison of the results that have been computed bytwo redundant channels. Different techniques for the detection of temporal failuresand value errors are discussed. The following two sections deal with the design offault-tolerant systems that are capable of masking faults that are contained inthe given fault hypothesis.
The most important fault-tolerance strategy is triplemodular redundancy (TMR), which requires a deterministic behavior of replicatedcomponents and a deterministic communication infrastructure. Robustness, whichis discussed next, is a system property that tries to provide an acceptable levelof service despite unforeseen perturbations.H. Kopetz, Real-Time Systems: Design Principles for Distributed Embedded Applications,Real-Time Systems Series, DOI 10.1007/978-1-4419-8237-7_6,# Springer Science+Business Media, LLC 20111351366 DependabilityFig. 6.1 Faults, errors, andfailuressubsystem underconsiderationFAULTcause of error(and failure)6.1ERRORunintended stateFAILUREdeviation of actualservice fromintended serviceBasic ConceptsThe seminal paper by Avizienis et al.
[Avi04] establishes the fundamental conceptsin the field of dependable computing. The core concepts of this paper are: fault,error, and failure (Fig. 6.1).Computer systems are provided to offer a dependable service to system users.A user can be a human user or another computer system. Whenever the behavior ofa system (see Sect. 4.1.1), as seen by the user of the system, deviates from theintended service, the system is said to have failed. A failure can be pinned downto an unintended state within the system, which is called an error. An error iscaused by an adverse phenomenon, which is called a fault.We use the term intended to state the correct state or behavior of a system.Ideally, this correct state or behavior is documented in a precise and completespecification.
However, sometimes the specification itself is wrong or incomplete.In order to include specification errors in our model, we introduce the wordintended to establish an abstract reference for correctness.If we relate the terms fault, error, and failure to the levels of the four universemodel (Sect. 2.3.1), then the term fault refers to an adverse phenomenon at anylevel of the model, while the terms error and failure are reserved for adversephenomena at the digital logic level, the informational level, or the external level.If we assume that a sparse global time base is available, then any adverse phenomenon at the digital logic level and above can be identified by a specific bitpattern in the value domain and by an instant of occurrence on the sparse globaltime base.
This cannot be done for phenomena occurring at the physical level.6.1.1FaultsWe assume that a system is built out of components. A component is a faultcontainment unit (FCU), if the direct effect of a single fault influences onlythe operation of a single component. Multiple FCUs should fail independently.Figure 6.2 depicts a classification of faults.Fault-Space. It is important to distinguish faults that are related to a deficiencyinternal to the FCU or to some adverse phenomena occurring external to the FCU.6.1 Basic Concepts137faultFig. 6.2 Classification offaultstimespaceinternalexternalphysical(hardware)transientdesign(software)permanentphysical(environment)input dataAn internal fault of a component, i.e., a fault within the FCU can be a physicalfault, such as the random break of a wire, or a design fault either in the software(a program error) or in the hardware (an erratum).
An external fault can be aphysical disturbance, e.g., a lightning stroke causing spikes in the power supply orthe impact of a cosmic particle. The provision of incorrect input data is anotherclass of an external fault. Fault containment refers to design and engineeringefforts that ensure that the immediate consequences of a fault are limited to asingle FCU. Many reliability models make the tacit assumption that FCUs failindependently, i.e., there is no single fault that can affect more than one FCU. ThisFCU independence assumption must be justified by the design of the system.Example: The physical separation of the FCUs of a fault-tolerant system reduces theprobability for spatial proximity faults, such that fault at a single location (e.g., impact incase of an accident) cannot destroy more than a single FCU.Fault Time. In the temporal domain a fault can be transient or permanent.
Whereasphysical faults can be transient or permanent, design faults (e.g., software errors)are always permanent.A transient fault appears for a short interval at the end of which it disappearswithout requiring any explicit repair action.
A transient fault can lead to an error,i.e., the corruption of the state of an FCU, but leaves the physical hardware undamaged(by definition). We call a transient external physical fault a transitory fault.An example for a transitory fault is the impact of a cosmic particle that corruptsthe state of an FCU. We call a transient internal physical fault an intermittent fault.Examples for intermittent faults are oxide defects, corrosion or other fault mechanismsthat have not yet developed to a stage where the hardware fails permanently (referto Table 8.1).
According to Constantinescu [Con02], a substantial number of thetransient faults observed in the field are intermittent faults. Whereas the failure rateof transitory faults is constant, the failure rate for intermittent faults increases as afunction of time. An increasing intermittent failure rate of an electronic hardwarecomponent is an indication for the wear-out of the component.
It suggests thatpreventive maintenance – the replacement of the faulty component – should beperformed in order to avoid a permanent fault of the component.1386 DependabilityA permanent fault is a fault that remains in the system until an explicit repairaction has taken place that removes the fault. An example for a permanent externalfault is a lasting breakdown of the power supply.
A permanent internal fault can bein the physical embodiment of the hardware (e.g., a break of an internal wire)or in the design of the software or hardware. The mean time it takes to repair asystem after the occurrence of a permanent fault is called MTTR (mean timeto repair).6.1.2ErrorsThe immediate consequence of a fault is an incorrect state in a component. We callsuch an incorrect state, i.e., a wrong data element in the memory, a register, orin a flip-flop circuit of a CPU, an error.
As time progresses, an error is activatedby a computation, detected by some error detection mechanism, or wiped out.An error is activated if a computation accesses the error. From this instantonwards, the computation itself becomes incorrect. If a fault impacts the contentsof a memory cell or a register, the consequent error will be activated when thismemory cell is accessed by a computation. There can be a long time-intervalbetween error occurrence and error activation (the dormancy of an error) if amemory cell is involved. If a fault impacts the circuitry of the CPU, an immediateactivation of the fault may occur and the current computation will be corrupted.As soon as an incorrect computation writes data into the memory, this part ofmemory becomes erroneous as well.We distinguish between two types of software errors, called Bohrbugs andHeisenbugs [Gra85].