Real-Time Systems. Design Principles for Distributed Embedded Applications. Herman Kopetz. Second Edition (811374), страница 81
Текст из файла (страница 81)
Concerning the type ofevidence presented in a safety case, it is commonly agreed that:1.2.3.4.Deterministic evidence is preferred over probabilistic evidence (see Sect. 5.6).Quantitative evidence is preferred over qualitative evidence.Direct evidence is preferred over indirect evidence.Product evidence is preferred over process evidence.Computer systems can fail for external and internal reasons (refer to Sect.
6.1).External reasons are related to the operational environment (e.g., mechanical stress,external electromagnetic fields, temperature, wrong input), and to the systemspecification. The two main internal reasons for failure are:1. The computer hardware fails because of a random physical fault. Section 6.4presented a number of techniques how to detect and handle random hardwarefaults by redundancy. The effectiveness of these fault-tolerance mechanismsmust be demonstrated as part of the safety case, e.g., by fault injection(Sect.
12.4).2. The design, which consists of the software and hardware, contains residualdesign faults. The elimination of the design faults and the validation that adesign (software and hardware) is fit for purpose is one of the great challengesof the scientific and engineering community. No single validation technologycan provide the required evidence that a computer system will meet ultra-highdependability requirements.27811 System DesignWhereas standard fault-tolerance techniques, such as the replication of componentsfor the implementation of triple-modular redundancy, are well established to maskthe consequences of random hardware failures, there is no such standard techniqueknown for the mitigation of errors in the design of software or hardware.Properties of the Architecture. It is a common requirement of a safety criticalapplication that no single fault, which is capable of causing a catastrophic failure,may exist in the whole system.
This implies that for a fail-safe application, everycritical error of the computer must be detected within such a short latency that theapplication can be forced into the safe state before the consequences of the erroraffect the system behavior. In a fail-operational application, a safe system servicemust be provided even after a single fault in any one of the components has occurred.Fault-Containment Unit (FCU). At the architectural level, it must be demonstratedthat every single fault can only affect a defined FCU and that it will be detected atthe boundaries of this FCU. The partitioning of the system into independent FCUsis thus of utmost concern.Experience has shown that there are a number of sensitive points in a design thatcan lead to a common-mode failure of all components within a distributed system:1.
A single source of time, such as a central clock.2. A babbling component that disrupts the communication among the correctcomponents in a communication system with shared resources (e.g., a bussystem).3. A single fault in the power supply or in the grounding system.4. A single design error that is replicated when the same hardware or systemsoftware is used in all components.Design Faults. A disciplined software-development process with inspections anddesign reviews reduces the number of design faults that are introduced into thesoftware during initial development.
Experimental evidence from testing, which initself is infeasible to demonstrate the safety of the software in the ultra-dependableregion, must be combined with structural arguments about the partitioning of thesystem into autonomous fault-containment units. The credibility can be furtheraugmented by presenting results from formal analysis of critical properties andthe experienced dependability of previous generations of similar systems. Experimental data about field-failure rates of critical components form the input toreliability models of the architecture to demonstrate that the system will maskrandom component failures with the required high probability. Finally, diversemechanisms play an important role in reducing the probability of common-modedesign failures.Composable Safety Argument.
Composability is another important architecturalproperty and helps in designing a convincing safety case (see also Sect. 2.4.3). Assumethat the components of a distributed system can be partitioned into two groups: onegroup of components that is involved in the implementation of safety critical functionsand another group of components that is not involved in safety-critical functions. If it11.4 Design of Safety-Critical Systems279can be shown at the architectural level, that no error in any one of the not-involvedcomponents can affect the proper operation of the components that implement thesafety critical function, it is possible to exclude the not-involved components fromfurther consideration during the safety case analysis.11.4.4 Safety StandardsThe increasing use of embedded computers in diverse safety-critical applicationshas prompted the appearance of many domain-specific safety-standards for thedesign of embedded systems.
This is a topic of concern, since differing safetystandards are roadblocks to the deployment of a cross-domain architecture andtools. A standardized unified approach to the design and certification of safetycritical computer system would alleviate this concern.In the following, we discuss two safety standards that have achieved wideattention in the community and have been practically used in the design of safetyrelevant embedded systems.IEC 61508. In 1998, the International Electronic Commission (IEC) has developed a standard for the design of Electric/Electronic and Programmable Electronic(E/E/PE) safety related systems, known as IEC 61508 standard on functionalsafety.
The standard is applicable to any safety-related control or protection systemthat uses computer technology. It covers all aspects in the software/hardware designand operation of safety-systems that operate on demand, also called protectionsystems, and safety-relevant control systems that operate in continuous mode.Example: An example for a safety system that operates on demand (a protection system) isan emergency shutdown system in a nuclear power plant.Example: An example of a safety-relevant control system is a control system in a chemicalplant that keeps a continuous chemical process within safe process parameters.The corner stone of IEC 61508 is the accurate specification and design of the safetyfunctions that are needed to reduce the risk to a level as low as reasonably practical(ALARP) [Bro00].
The safety functions should be implemented in an independentsafety channel. Within defined system boundaries, the safety functions are assignedto Safety-Integrity Levels (SIL), depending on the tolerated probability for a failureon demand for protection systems and a probability of failure per hour for safetyrelevant control systems (Table 11.2).The IEC 61508 standard addresses random physical faults in the hardware,design faults in hardware and software, and failures of communication in adistributed system.
IEC 61508-2 deals with the contribution of fault-tolerance tothe dependability of the safety function. In order to reduce the probability of designfaults of hardware and software, the standard recommends the adherence to adisciplined software development process and the provision of mechanisms thatmitigate the consequences of remaining design faults during the operation of asystem. It is interesting to note that dynamic reconfiguration mechanisms are not28011 System DesignTable 11.2 Safety integrity level (SIL) of safety functionsSafety integrity Average tolerated probability Average tolerated probabilitylevelfor a failure per demandfor a failure per hourSIL 410 5 to <10 410 9 to <10 8SIL 310 4 to <10 310 8 to <10 7SIL 210 3 to <10 210 7 to <10 6SIL 110 2 to <10 110 6 to <10 5recommended in systems above SIL 1.
IEC 61508 is the foundation for a number ofdomain specific safety standards, such as the emerging ISO 26262 standard forautomotive applications, EN ISO 13849 for the machinery and off-highway industry, and IEC 60601 and IEC 62304 for medical devices.Example: [Lie10] gives an example for the assignment of the automotive safety integritylevel (ASIL) according to ISO 26262 to the two tasks, the functional task and the monitoring task, of an electronic actuator pedal (EGAS) implementation. If a certified monitoringtask that detects an unsafe state and is guaranteed to bring the system into a safe state, isindependent of the functional task, then the functional task does not have to be certified.RTCA/DO-178B and DO-254.
Over the past decades, safety-relevant computersystems have been deployed widely in the aircraft industry. This is the reasonwhy the aircraft industry has extended experience in the design and operation ofsafety-relevant computer systems. The document RTCA/DO-178B: Software Considerations in Airborne Systems and Equipment Certification [ARI92] and therelated document RTCA/DO-254: Design Assurance Guidance for airborne electronic hardware [ARI05] contain standards and recommendations for the designand validation of the software and hardware for airborne safety-relevant computersystems.
These documents have been developed by a committee consisting ofrepresentatives of the major aerospace companies, airlines, and regulatory bodiesand thus represent an international consensus view on a reasonable and practicalapproach that produces safe systems. Experienced with the use of this standard hasbeen gained within a number of major projects, such as the application of RTCA/DO-178B in the design of the Boeing 777 aircraft and follow-on aircrafts.The basic idea of RTCA/DO-178B is a two phase approach: in a first phase, theplanning phase, the structure of the safety case, the procedures that must befollowed in the execution of the project, and the produced documentation is defined.In the second phase, the execution phase, it is checked that all procedures that areestablished in the first phase are precisely adhered to in the execution of the project.The criticality of the software is derived from the criticality of the software-relatedfunction that has been identified during safety analysis and is classified according toTable 11.1.
The rigor of the software development process increases with anincrease in the criticality level of the software. The standard contains tables andchecklists that suggest the design, validation, documentation, and project management methods that must be followed when developing software for a given criticality level. At higher criticality levels, the inspection procedures must be performedby personal that is independent from the development group. For the highest11.5 Design Diversity281criticality level, level A, the application of formal methods is recommended, butnot demanded.When it comes to the elimination of design faults, both standards, IEC 61508 andRTCA/DO-178B demand a rigorous software development process, hoping thatsoftware is developed according to such a process will be free of design faults.From a certification point of view, an evaluation of the software product would bemore appealing than an evaluation of the development process, but we mustrecognize there are fundamental limitations concerning the validation of a softwareproduct by testing [Lit93].Recently, the new standard RTCA/DO-297 Integrated Modular Avionics (IMA)Development Guidance and Certification Considerations has been published thataddresses the role of design methodologies, architectures, and partitioning methodsin the certification of modern integrated avionics systems in commercial aircraft.This standard also considers the contribution of time-triggered partitioningmechanisms in the design of safety-relevant distributed systems.11.5Design DiversityField data on the observed reliability of many large computer systems indicate thata significant and increasing number of computer system failures are caused bydesign errors in the software and not by physical faults of the hardware.