A. Wood - Softare Reliability. Growth Models (798489), страница 6
Текст из файла (страница 6)
To avoid confidentiality issues, the specific products and releases are notidentified, and the test data has been suitably transformed. The literature has very little realdata from commercial applications, possibly due to confidentiality concerns. We hope thistransformation technique will stimulate other software reliability practitioners to providesimilarly transformed data that can be used for model development and testing bytheoreticians.15The test data collected included three representations of the amount of testing and tworepresentations of defects as described in Section 2.1.
For each of the software releases,we evaluated the test data using the software reliability growth models described in Section2.2, the statistical techniques described in Section 2.3, and the model evaluation criteriadescribed in Section 2.4. This section describes the results of those evaluations. Section3.1 contains the test data, Section 3.2 contains the basic results, and Sections 3.3-3.8contain results obtained by varying a model parameter or evaluation technique.3.1Test DataWe collected data from four separate software releases.
As shown in Table 3-1, weartificially set the system test time for Release 1 to 10,000 hours and the number of defectsdiscovered in Release 1 to 100. All data was ratioed proportionately, e.g., all test hourswere multiplied by 10,000 and divided by the real number of test hours from Release 1. Asmentioned in Section 2.3.5, the predicted number of total defects scales by same amount asthe defect data and is unaffected by the test time scaling. The releases were tested fordifferent lengths of time (both calendar and execution) as shown in the table.
The data inTable 3-1 is shown graphically in Figure 3-1. All the data exhibits the shape of the concavemodels, e.g., Figure 1-1.Release 1Release 2Release 3Release 4Execu- No. of Execu- No. of Execu- No. of Execu- No. oftion hrs defects tion hrs defects tion hrs defects tion hrs defects11638413162625415192241,18649997883968181,4711,05431,430271326715841,893332,236341,137201,39392,21652,4902,77241401,79928113,0586492,9672,438402,880481673,625612,818483,593543,812194,422544,2818584,8803,57425754,2345,1805,2186,1048457279696,003105,8236,6344,68059297589116,5397,2297,62181954,9556032618,783127,08332868,0725,053100137,4878,4849,6049010436147,8468,84710,0649311038158,20510,560969,253391121611,0088,5649,712114399811,237178,9234199 10,08311711,243189,282100 10,17442118199,64111,305100 10,2721204220 10,000100Note: all data has been scaled by artIfiCIally setting the executIon time III Release 1 to10,000 hours and the number of defects discovered in Release 1 to 100 and ratioing allother data proportionately.TestWeekTable 3-1.
Test Data16120_ _ Release 1___Go__Release 2100- - - - Release 3~Release480.U.!!60coQ40200..-t:::::;;;;;;.._ _-+-o2,000+-+-4,0006,000+8,000+10,000--112,000Test HoursFigure 3-1. Test Data for All ReleasesThe execution hours for Releases 1-3 are obtained from the product QA groups testing therelease subsets used in this report. Other QA and development groups reported defectsagainst the release subsets, but the test effort was reasonably synchronized across allgroups, so we feel that the product QA test hours fairly represents the software test effort.For Release 4, the test effort was not well synchronized, and a larger portion of the defectswere reported by groups other than the product QA groups.
Therefore, we added the testhours from product QA groups that were not directly testing the release subset but werereporting many defects. We feel this is a better representation of the software test effort.3.2Results From the Standard ModelAs will be shown in the following sections. we achieved the best results using executiontime to measure the amount of testing, defect data rather than problem reports, and the G-O(exponential) software reliability growth model. We gathered data weekly and used thealternative least squares described in Section 2.3.3 technique to estimate the parameters.The most important parameter is the predicted total number of defects from which we candetermine the predicted number of residual defects.
A few of these calculations are shownin Tables 3-2 through 3-5. As can be seen from the tables, the total defect parameterbecomes stable (meaning the week to week variance is small) after approximately 60% ofcalendar test time and 70% of execution time. It took longer for Release 3 to stabilize,probably because there is less total data.17TestWeek1011121314151617181920ExecutionHours5,8236,5397,0837,4877,8468,2058,5648,9239,2829,64110,000PercentNo. ofExecution Hours Defects58%7565%8171%8675%9078%9382%9686%9889%9993%10096%100100%100Predicted TotalNo.
of Defects98107116123129129134139138135133PredictedResidual Defects2326303336333640383533Table 3-2. Release 1 ResultsTestWeek10111213141516171819ExecutionHours6,6347,2298,0728,4848,8479,2539,71210,08310,17410,272No. ofPercentDefectsExecution Hours65%8970%9579%10083%10486%11090%11295%11498%11799%118100%120Predicted TotalNo. of Defects203192179178184184183182183184PredictedResidual Defects114977974747269656564Table 3-3.
Release 2 ResultsTestWeek89101112ExecutionHours3,5744,2344,6804,9555,053PercentNo. ofExecution Hours Defects71%5484%5793%5998%6061100%Predicted TotalNo. of Defects163107938784Table 3-4. Release 3 Results18PredictedResidual Defects10950342723TestWeek10111213141516171819ExecutionHours6,0037,6218,7839,60410,06410,56011,00811,23711,24311,305PercentNo.
ofExecution Hours Defects53%2967%3278%3285%3689%3893%3997%394199%99%42100%42Predicted TotalNo. of Defects84534445464848505152PredictedResidual Defects55211298999910Table 3-5. Release 4 ResultsTables 3-2 through 3-5 demonstrate that the predicted total number of defects becomesstable for the simple exponential model, which is the fIrst criterion for a useful model. Thesecond criterion is that the predicted residual defects reasonably approximate fIeld use.Table 3-6 compares the predicted residual defects with the fIrst year of fIeld experience. Allof the predictions are surprisingly close to fIeld experience and well within the confIdencelimits except for Release 2.
A two-stage model, combining Releases 2 and 3, did a betterjob of predicting residual defects and is described later in this section. One criticism of theresults in Table 3-6 might be that we had to modify the simple model to obtain them, i.e.,the two-stage model for Releases 2 and 3 and the additional test hours from parallel testgroups for Release 4.
However, these modifIcations were made as the models were beingdeveloped because the differences among releases was evident during the QA test phaserather than in hindsight. Having settled on the basic model structure, it was easy to makethese types of model modifIcations.Release12342/3Predicted Residual Defects3364231033Defects in First Year34820928Table 3-6. Model Predictions vs. Field ExperienceThe defects in Table 3-6 include defects found by customers and defects found throughinternal usage as long as the defects found internally were not part of the next major QA testcycle. The defects found by customers tend to be confIgurations that are difficult toreplicate in QA, e.g., a very large system running continuously for months.
The thirdcolumn in Table 3-6 includes known defects and TPRs that were still open for analysis atthe end of the fIrst year. Additional defect data gathered for some releases shows that thenumber of defects found after the fIrst year is balanced by the number of open TPRs thattum out to be rediscoveries or non-defects. Therefore, the number of defects for the fIrstyear shown in Table 3-6 is expected to be close to the total number of defects that will beattributed to that release.19Two-Stage Model ResultsSince Release 2 greatly overestimated the number of residual defects, we examined thedetails of this release. Release 2 was a preliminary release used by very few customers.Release 3 was very similar to Release 2 with some functionality and performanceenhancements, and the Release 2 and Release 3 testing overlapped (Release 2 test week 17was the same as Release 3 test week 1).
Therefore, Release 2 and 3 can really be treated asa single release that was tested from Release 2 RQA until Release 3 QAOK in whichRelease 3 RQA corresponds to the release of additional functionality into the test process.This is the classic setup for the two-stage model described in Section 2.2. Figure 3-2shows what the data looks like for the two-stage model. This figure shows that the data hasthe shape of a two-stage model shown in Figure 2-3. Note that the data has an inflectionpoint at about 9,700 hours, which was Release 3 RQA.
When we evaluate this data usingthe two-stage model techniques described in Section 2.2, the predicted total number ofdefects is 214. From Table 3-1, the total defects in Releases 2 and 3 is 181, so thepredicted number of residual defects is 33. From Table 3-6, there were 28 defects in thefirst year for Releases 2 and 3 combined, which compares favorably with the prediction of33.200180160140120100806040200.02.0004,0006.0008,00010,00012,00014,00016,000Test HoursFigure 3-2. Combined Data for Releases 2 and 33.3Results for Different Representations of Test TimeAll previously presented results have been calculated using execution time to representamount of testing rather than calendar time or number of test cases. The reason for this isthat our results using calendar time and number of test cases have been poor. Tables 3-2through 3-5 show that execution time does not correlate well to calendar time, meaning thatthe testing effort is not spread uniformly throughout the test period.
There are times whenmajor defects or schedule conflicts may prevent test execution. Calendar time accumulatesduring these periods while execution time does not, which is one reason that calendar time20models do not seem to produce credible results. Table 3-7 shows the results of fitting theRelease 4 defects to calendar time. We were unable to get a result until week 15 because thecurve fit did not converge.
After week 15, the prediction was very unstable, especially incomparison to the very stable execution time results, as can be seen from Table 3-7. Similarresults with the other releases indicates that execution time is a much better measure of theamount of testing than calendar time in our environment.TestWeek10111213141516171819ExecutionHours6,0037,6218,7839,60410,06410,56011,00811,23711,24311,305PercentNo.
ofExecution Hours Defects53%67%78%85%89%93%97%99%99%100%29323236383939414242Predicted TotalNo. of Defects Execution Time84534445464848505152Predicted TotalNo. of Defects Calendar TimeNo Prediction45717812510185Table 3-7. Release 4 Results for Calendar TimeWe also had poor results using number of test cases to represent amount of time. Table 3-8shows the test case data and results for Release 3. The total number of test cases has beentranslated to 10,000. Note that the number of test cases increases faster than the executionhours.