Software Engineering Body of Knowledge (v3) (2014) (811503), страница 79
Текст из файла (страница 79)
We make an assumption that thechance of obtaining a success remains constant [2*, c3s6].• Poisson distribution: used to model the countof occurrence of some event over time orspace [2*, c3s9].• Normal distribution: used to model continuous random variables or discrete randomvariables by taking a very large number ofvalues [2*, c4s6].Concept of parameters. A statistical distributionis characterized by some parameters. For example, the proportion of success in any given trialis the only parameter characterizing a binomialdistribution. Similarly, the Poisson distribution ischaracterized by a rate of occurrence.
A normaldistribution is characterized by two parameters:namely, its mean and standard deviation.Once the values of the parameters are known,the distribution of the random variable is completely known and the chance (probability) ofany event can be computed. The probabilitiesfor a discrete random variable can be computedthrough the probability mass function, calledthe pmf. The pmf is defined at discrete pointsand gives the point mass—i.e., the probabilitythat the random variable will take that particularvalue.
Likewise, for a continuous random variable, we have the probability density function,called the pdf. The pdf is very much like densityand needs to be integrated over a range to obtainthe probability that the continuous random variable lies between certain values. Thus, if the pdf15-4 SWEBOK® Guide V3.0or pmf is known, the chances of the random variable taking certain set of values may be computedtheoretically.Concept of estimation [2*, c6s2, c7s1, c7s3].The true values of the parameters of a distributionare usually unknown and need to be estimatedfrom the sample observations. The estimates arefunctions of the sample values and are called statistics. For example, the sample mean is a statisticand may be used to estimate the population mean.Similarly, the rate of occurrence of defects estimated from the sample (rate of defects per line ofcode) is a statistic and serves as the estimate ofthe population rate of rate of defects per line ofcode.
The statistic used to estimate some population parameter is often referred to as the estimatorof the parameter.A very important point to note is that the resultsof the estimators themselves are random. If wetake a different sample, we are likely to get a different estimate of the population parameter. In thetheory of estimation, we need to understand different properties of estimators—particularly, howmuch the estimates can vary across samples andhow to choose between different alternative waysto obtain the estimates. For example, if we wishto estimate the mean of a population, we mightuse as our estimator a sample mean, a samplemedian, a sample mode, or the midrange of thesample.
Each of these estimators has differentstatistical properties that may impact the standarderror of the estimate.Types of estimates [2*, c7s3, c8s1].There aretwo types of estimates: namely, point estimatesand interval estimates. When we use the valueof a statistic to estimate a population parameter,we get a point estimate. As the name indicates, apoint estimate gives a point value of the parameter being estimated.Although point estimates are often used, theyleave room for many questions.
For instance, weare not told anything about the possible size oferror or statistical properties of the point estimate. Thus, we might need to supplement a pointestimate with the sample size as well as the variance of the estimate. Alternately, we might usean interval estimate. An interval estimate is arandom interval with the lower and upper limits of the interval being functions of the sampleobservations as well as the sample size.
The limits are computed on the basis of some assumptions regarding the sampling distribution of thepoint estimate on which the limits are based.Properties of estimators. Various statisticalproperties of estimators are used to decide aboutthe appropriateness of an estimator in a givensituation.
The most important properties are thatan estimator is unbiased, efficient, and consistentwith respect to the population.Tests of hypotheses [2*, c9s1].A hypothesis isa statement about the possible values of a parameter. For example, suppose it is claimed that anew method of software development reduces theoccurrence of defects. In this case, the hypothesis is that the rate of occurrence of defects hasreduced. In tests of hypotheses, we decide—onthe basis of sample observations—whether a proposed hypothesis should be accepted or rejected.For testing hypotheses, the null and alternativehypotheses are formed.
The null hypothesis is thehypothesis of no change and is denoted as H0. Thealternative hypothesis is written as H1. It is important to note that the alternative hypothesis may beone-sided or two-sided. For example, if we havethe null hypothesis that the population mean is notless than some given value, the alternative hypothesis would be that it is less than that value and wewould have a one-sided test. However, if we havethe null hypothesis that the population mean isequal to some given value, the alternative hypothesis would be that it is not equal and we wouldhave a two-sided test (because the true value couldbe either less than or greater than the given value).In order to test some hypothesis, we first compute some statistic.
Along with the computationof the statistic, a region is defined such that incase the computed value of the statistic falls inthat region, the null hypothesis is rejected. Thisregion is called the critical region (also known asthe confidence interval). In tests of hypotheses,we need to accept or reject the null hypothesison the basis of the evidence obtained. We notethat, in general, the alternative hypothesis is thehypothesis of interest. If the computed value ofthe statistic does not fall inside the critical region,then we cannot reject the null hypothesis.
Thisindicates that there is not enough evidence tobelieve that the alternative hypothesis is true.Engineering Foundations 15-5As the decision is being taken on the basisof sample observations, errors are possible; thetypes of such errors are summarized in the following table.NatureH0 istrueH0 isfalseStatistical DecisionAccept H0OKType II error(probability = b)Reject H0Type I error(probability = a)OKIn test of hypotheses, we aim at maximizing thepower of the test (the value of 1−b) while ensuring that the probability of a type I error (the valueof a) is maintained within a particular value—typically 5 percent.It is to be noted that construction of a test ofhypothesis includes identifying statistic(s) toestimate the parameter(s) and defining a criticalregion such that if the computed value of the statistic falls in the critical region, the null hypothesis is rejected.2.2. Concepts of Correlation and Regression[2*, c11s2, c11s8]A major objective of many statistical investigations is to establish relationships that make it possible to predict one or more variables in terms ofothers.
Although it is desirable to predict a quantity exactly in terms of another quantity, it is seldom possible and, in many cases, we have to besatisfied with estimating the average or expectedvalues.The relationship between two variables is studied using the methods of correlation and regression. Both these concepts are explained briefly inthe following paragraphs.Correlation.
The strength of linear relationship between two variables is measured usingthe correlation coefficient. While computing thecorrelation coefficient between two variables, weassume that these variables measure two different attributes of the same entity. The correlationcoefficient takes a value between –1 to +1. Thevalues –1 and +1 indicate a situation when theassociation between the variables is perfect—i.e.,given the value of one variable, the other can beestimated with no error. A positive correlationcoefficient indicates a positive relationship—thatis, if one variable increases, so does the other. Onthe other hand, when the variables are negativelycorrelated, an increase of one leads to a decreaseof the other.It is important to remember that correlationdoes not imply causation.
Thus, if two variablesare correlated, we cannot conclude that onecauses the other.Regression. The correlation analysis onlymeasures the degree of relationship betweentwo variables. The analysis to find the relationship between two variables is called regressionanalysis. The strength of the relationship betweentwo variables is measured using the coefficient ofdetermination. This is a value between 0 and 1.The closer the coefficient is to 1, the stronger therelationship between the variables.
A value of 1indicates a perfect relationship.3. Measurement[4*, c3s1, c3s2] [5*, c4s4] [6*, c7s5][7*, p442–447]Knowing what to measure and which measurement method to use is critical in engineeringendeavors. It is important that everyone involvedin an engineering project understand the measurement methods and the measurement resultsthat will be used.Measurements can be physical, environmental, economic, operational, or some other sort ofmeasurement that is meaningful for the particularproject. This section explores the theory of measurement and how it is fundamental to engineering. Measurement starts as a conceptualizationthen moves from abstract concepts to definitionsof the measurement method to the actual application of that method to obtain a measurementresult.
Each of these steps must be understood,communicated, and properly employed in orderto generate usable data. In traditional engineering, direct measures are often used. In softwareengineering, a combination of both direct andderived measures is necessary [6*, p273].The theory of measurement states that measurement is an attempt to describe an underlying15-6 SWEBOK® Guide V3.0real empirical system. Measurement methodsdefine activities that allocate a value or a symbolto an attribute of an entity.Attributes must then be defined in terms ofthe operations used to identify and measurethem— that is, the measurement methods.