Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 50
Текст из файла (страница 50)
In many embedded applications, the fast reintegration ofa failed component is thus of paramount importance and must be supported byproper architectural mechanisms.6.6.1Finding a Reintegration PointWhile a failure can occur at an arbitrary moment outside the control of thesystem designer, the system designer can plan the proper point of reintegration ofa repaired component. The key issue during the reintegration of a component in areal-time system is to find a future point in time when the state of the component1626 Dependabilityis in synchrony with the component’s environment, i.e., the other components ofthe cluster and the physical plant. Because real-time data are invalidated by thepassage of time, rolling back to a past checkpoint can be futile: it is possible andprobable that the progression of time has already invalidated the checkpointinformation (see also Table 4.1).Reintegration is simplified if the state that must be reloaded into the reintegratingcomponent is of small size and fits into a single message.
Since the size of thestate has a relative minimum immediately after the completion of an atomic operation,this is an ideal instant for the reintegration of a component. In Sect. 4.2.3 we haveintroduced the notion of the g-state (ground state) to refer to the state at the reintegration instant. In cyclic systems – many embedded control and multimedia systems arecyclic – an ideal reintegration instant of a component is at the beginning of a newcycle.
The temporal distance between two consecutive reintegration instants, thereintegration cycle is then identical to the duration of the control cycle. If the g-stateis empty at the reintegration instant, then the reintegration of a repaired componentis trivial at this moment. In many situations, however, there is no instant duringthe lifetime of a component when its g-state is completely empty.6.6.2Minimizing the Ground-StateAfter a cyclic reintegration instant has been established, the g-state at this selectedinstant must be analyzed and minimized to simplify the reintegration procedure.In a first phase, all system data structures within the component mustbe investigated to locate any hidden state. In particular, all variables that must beinitialized must be identified and the state of all semaphores and operating systemqueues at the reintegration instant must be checked.
It is good programmingpractice to output the g-state of a task in a special output message when a taskwith g-state is detected, and to re-read the g-state of the task when the task isreactivated. This identifies the g-state and makes it possible to pack all g-states ofall tasks of a component into a g-state message particular to this component.In a second phase, the identified g-state must be analyzed and minimized.Figure 6.10 displays a suggested division of the g-state information into three parts:1.
The first part of the g-state consists of input data that can be retrieved from theinstrumentation in the environment. If the instrumentation is state-based andsends the absolute values of the RT entities (state messages) rather than theirrelative values (event messages), a complete scan of all the sensors in theenvironment can establish a set of current images in the reintegrating componentand thus resynchronize the component with the external world.2. The second part of the g-state consists of output data that are in the control of thecomputer and can be enforced on the environment.
We call the set of theoutput data a restart vector. In a number of applications, a restart vector canbe defined at development time. Whenever a component must be reintegrated,6.6 Component Reintegration163Fig. 6.10 Partitioning of theg-Statethis restart vector is enforced on the environment to achieve agreement with theoutside world. If different process modes require different restart vectors, a set ofrestart vectors can be defined at development time, one for each mode.3. The third part of the g-state contains g-state data that do not fall into category(1) or category (2).
This part of the g-state must be recovered from somecomponent-external source: from a replicated component of a fault-tolerantsystem, from the monitoring component, or from the operator. In some situations, a redesign of the process instrumentation may be considered to transformg-state of category (3) into g-state of category (1).Example: When a traffic control system is restarted, it is possible to enforce a restartvector on the traffic lights that sets all cross-road lights first to yellow, and then to red, andfinally turns the main street lights to green.
This is a relatively simple way to achievesynchronization between the external world and the computer system. The alternative,which involves the reconstruction of the current state of all traffic lights from some log filethat recorded the output commands up to the point of failure, would be more complicated.In a system with replicated components in an FTU, the g-state data that cannot beretrieved directly from the environment must be communicated from one component of the FTU to the other components of the FTU by means of a g-state message.In a TT system, sending such a g-state message should be part of the standardcomponent cycle.6.6.3Component RestartThe restart of a component after a failure has been detected by a monitoringcomponent (Fig.
6.9) can proceed as follows: (1) The monitoring componentsends a trusted reset message to the TII interface of the operational component toenforce a hardware reset. (2) After the reset, the operational component performsa self-test and verifies the correctness of its core image (the job) by checkingthe provided signatures in the core image data structures. If the core image is1646 Dependabilityerroneous, a copy of the static core image must be reloaded from stable storage.
(3)The operational component scans all sensors and waits for a cluster cycle to acquireall available current information about its environment. After an analysis of thisinformation, the operational component decides the mode of the controlled object,and selects the restart vector that must be enforced on the environment. (4) Finally,after the operational component has received the g-state information that is relevantat the next reintegration instant from the monitoring component, the operationalcomponent starts its tasks in synchrony with the rest of the cluster and its physicalenvironment. Depending on the hardware performance and the characteristics ofthe real-time operating system, the time interval between the arrival of the resetmessage and the arrival of the g-state information message can be significantlylonger than the duration of a reintegration cycle. In this case, the monitoringcomponent must perform a far-reaching state estimation to establish a relevant gstate at the proper reintegration point.Points to RememberllllllllllA fault is the adjudged cause of an error or failure.An error is that part of the state of a system that deviates from the intended(correct) state.A failure is an event that denotes a deviation of the actual service from theintended service, occurring at a particular point in real time.The failure rate for permanent failures of an industrial-quality chip is in a rangebetween 10 and 100 FITS.
The failure rate for transient failures is orders ofmagnitude higher.Information security deals with the authenticity, integrity, confidentiality, privacyand availability of information and services that are provided by computer system.The main security concerns in embedded systems are the authenticity and integrityof data.A vulnerability is a deficiency in the design or operation of a computer systemthat can lead to a security incident. We call the successful exploitation of avulnerability an intrusion.The typical attacker proceeds according to the following three phases: access tothe selected subsystem, search for and discovery of a vulnerability, and finallyintrusion and control of the selected subsystem.It is widely acknowledged in security research and practice that many securityincidents are caused by human rather than technical failures.The basic cryptographic primitives that must be supported in any securityarchitecture are symmetric key encryption, public key encryption, hash functions,and random number generation.An anomaly is a system state that lies in the grey zone between correct anderroneous.
The detection of anomalies is important, since the occurrence ofan anomaly is an indication that some atypical scenario that may requireimmediate corrective action is developing (e.g., the intrusion by an adversary).Bibliographic Noteslllllllllll165In a safety-critical system, every single observed anomaly must be scrutinizedin detail until the final cause of the anomaly has been unambiguously identified.Failure detection within a system is only possible if the system contains someform of redundant information about the intended behavior.The fault hypothesis states what types of faults must be tolerated by a faulttolerant system and divides the fault-space into two domains, the domain ofnormal faults (i.e., the faults that must be tolerated) and the domain of rarefaults, i.e., faults that are outside the fault hypotheses and are assumed to berare events.A rare fault will bring the system into a state that is outside the specified faulthypothesis and therefore will not be covered by the provided fault-tolerancemechanisms.