Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 82
Текст из файла (страница 82)
While theproblems of random physical hardware faults can be solved by applying redundancy (see Sect 6.4), no generally accepted procedure to deal with the problem ofdesign (software) errors has emerged. The techniques that have been developed forhandling hardware faults are not directly applicable to the field of software, becausethere is no physical process that causes the aging of the software.Software errors are design errors that have their root in the unmanaged complexity of a design. In [Boe01] the most common software errors are analyzed.Because many hardware functions of a complex VLSI chip are implemented inmicrocode that is stored in a ROM, the possibility of a design error in the hardwaremust be considered in a safety-critical system.
The issue of a single design error thatis replicated in the software of all nodes of a distributed system warrants furtherconsideration. It is conceivable that an FTU built from nodes based on the samehardware and using the same system software exhibits common-mode failurescaused by design errors in the software or in the hardware (micro-programs).11.5.1 Diverse Software VersionsThe three major strategies to attack the problem of unreliable software are thefollowing:1.
To improve the understandability of a software system by introducing a structureof conceptual integrity and by simplifying programming paradigms. This is, by28211 System Designfar, the most important strategy that has been widely supported throughoutthis book.2. To apply formal methods in the software development process so that thespecification can be expressed in a rigorous form. It is then possible to verifyformally – within the limits of today’s technology – the consistency between ahigh-level specification expressed in a formal specification language and theimplementation.3. To design and implement diverse versions of the software such that a safe levelof service can be provided even in the presence of design faults.In our opinion, these three strategies are not contradictory, but complementary.An understandable and well-structured software system is a prerequisite for theapplication of any of the other two techniques, i.e., program verification andsoftware diversity.
In safety-critical real-time systems, all three strategies shouldbe followed to reduce the number of design errors to a level that is commensuratewith the requirement of ultra-high dependability.Design diversity is based on the hypothesis that different programmers usingdifferent programming languages and different development tools don’t make thesame programming errors. This hypothesis has been tested in a number of controlled experiments with a result that it is only partially encouraging [Avi85].Design diversity increases the overall reliability of a system.
It is, however, notjustified to assume that the errors in the diverse software versions that are developedfrom the same specification are not correlated [Kni86].The detailed analysis of field data of large software systems reveals that asignificant number of system failures can be traced to flaws in the system specification. To be more effective, the diverse software versions should be based ondifferent specifications. This complicates the design of the voting algorithm.
Practical experience with non-exact voting schemes has not been encouraging [Lal94].What place does software diversity have in safety critical real-time systems? Thefollowing case study of a fault-tolerant railway signaling system that is installed in anumber of European train stations to increase the safety and reliability of the trainservice is a good example of the practical utility of software diversity.11.5.2 An Example of a Fail-Safe SystemThe VOTRICS train signaling system that has been developed by Alcatel [Kan95]is an industrial example of the application of design diversity in a safety-criticalreal-time environment.
The objective of a train signaling system is to collect dataabout the state of the tracks in train stations, i.e., the current positions and movements of the trains and the positions of the switches, and to set the signals and shiftthe switches such that the trains can move safely through the station accordingto the given timetable entered by the operator.
The safe operation of the train systemis of utmost concern.11.5 Design Diversity283The VOTRICS system is partitioned into two independent subsystems. The firstsubsystem accepts the commands from the station operators, collects the data fromthe tracks, and calculates the intended position of the switches and signals so thatthe train can move through the station according to the desired plan. This subsystemuses a TMR architecture to tolerate a single hardware fault.The second subsystem, called the safety bag, monitors the safety of the state ofthe station. It has access to the real-time database and the intended output commands of the first subsystem.
It dynamically evaluates safety predicates that arederived from the traditional “rule book” of the railway authority. In case it cannotdynamically verify the safety of an intended output state, it has the authority toblock the outputs to the switching signals, or to even activate an emergencyshutdown of the complete station, setting all signals to red and stopping all trains.The safety bag is also implemented on a TMR hardware architecture.The interesting aspect about this architecture is the substantial independence ofthe two diverse software versions.
The versions are derived from completelydifferent specifications. Subsystem one takes the operational requirements as thestarting point for the software specification, while subsystem two takes the established safety rules as its starting point. Common mode specification errors can thusbe ruled out. The implementation is also substantially different. Subsystem one isbuilt according to a standard programming paradigm, while subsystem two is basedon expert-system technology.
If the rule-based expert system does not come up witha positive answer within a pre-specified time interval, a violation of a safetycondition is assumed. It is thus not necessary to analytically establish a WCETfor the expert system (which would be very difficult).The system has been operational in different railway stations over a number ofyears. No case has been reported where an unsafe state remained undetected.
Theindependent safety verification by the safety bag also has a positive effect duringthe commission phase, because failures in subsystem one are immediately detectedby subsystem two.From this and other experiences we can derive the general principle that in asafety-critical system, the execution of every safety-critical function must bemonitored by a second independent channel based on a diverse design. Thereshould not be any safety-critical function on a single channel system.11.5.3 Multilevel SystemThe technique described above can also be applied to fail-operational applicationsthat are controlled by a two-level computer system (Fig.
11.3). The higher-levelcomputer system provides full functionality, and has a high-error detection coverage. If the high-level computer system fails, an independent and differentlydesigned lower-level computer system with reduced functionality takes over.The reduced functionality must be sufficient to guarantee safety.28411 System Designreal-timebuseshigh level clusterlower level cluster with limitedfunctionality, implemented ondiverse hardware and diversesoftwarefield bussensors andactuatorscontrolled objectFig. 11.3 Multilevel computer system with diverse softwareSuch an architecture has been deployed in the computer system for the spaceshuttle [Lee90, p. 297].
Along with a TMR system that uses identical software, afourth computer with diverse software is provided in case of a design error thatcauses the correlated failure of the complete TMR system. Diversity is deployed ina number of existing safety critical real-time systems, as in the Airbus fly-by-wiresystem [Tra88], and in railway signaling [Kan95].11.6Design for MaintainabilityThe total cost of ownership of a product is not only the cost of the initial acquisitionof the product, but the sum of the acquisition cost, the cost of operation, theexpected maintenance cost over the product life, and finally, at the end of theproduct lifetime, the cost of product disposal.
Design for maintainability tries toreduce the expected maintenance cost over the product lifetime. The cost ofmaintenance, which can be higher than the cost of the initial acquisition of theproduct, is strongly influenced by the product design and the maintenance strategy.11.6.1 Cost of MaintenanceIn order to be able to analyze the cost structure of a maintenance action, it isnecessary to distinguish between two types of maintenance actions: preventivemaintenance and on-call maintenance.Preventive maintenance (sometimes also called scheduled or routine maintenance)refers to a maintenance action that is scheduled to take place periodically at plannedintervals, when the plant or machine is intentionally shut down for maintenance.Based on knowledge about the increasing failure rate of components and the resultsof the analysis of the anomaly detection database (see Sect.
6.3), components that are11.6 Design for Maintainability285expected to fail in the near future are identified and replaced during preventivemaintenance. An effective scheduled maintenance strategy needs extensive component instrumentation to be able to continually observe component parameters andlearn about the imminent wear-out of components by statistical techniques.On-call maintenance (sometimes also called reactive maintenance) refers to amaintenance action that is started after a product has failed to provide its service. Byits nature it is unplanned. In addition to the direct repair cost, the on call-maintenance costs comprise the cost of maintenance readiness (to ensure the immediateavailability of a repair team in case a failure occurs) and the cost of unavailability ofservice during the interval starting with the failure occurrence until the repair actionhas been completed.