c14-4 (779579)
Текст из файла
628Chapter 14.Statistical Description of DataStephens, M.A. 1970, Journal of the Royal Statistical Society, ser. B, vol. 32, pp. 115–122. [1]Anderson, T.W., and Darling, D.A. 1952, Annals of Mathematical Statistics, vol. 23, pp.
193–212.[2]Darling, D.A. 1957, Annals of Mathematical Statistics, vol. 28, pp. 823–838. [3]Michael, J.R. 1983, Biometrika, vol. 70, no. 1, pp. 11–17. [4]Stephens, M.A. 1965, Biometrika, vol. 52, pp. 309–321. [7]Fisher, N.I., Lewis, T., and Embleton, B.J.J. 1987, Statistical Analysis of Spherical Data (NewYork: Cambridge University Press).
[8]14.4 Contingency Table Analysis of TwoDistributionsIn this section, and the next two sections, we deal with measures of associationfor two distributions. The situation is this: Each data point has two or moredifferent quantities associated with it, and we want to know whether knowledge ofone quantity gives us any demonstrable advantage in predicting the value of anotherquantity. In many cases, one variable will be an “independent” or “control” variable,and another will be a “dependent” or “measured” variable. Then, we want to know ifthe latter variable is in fact dependent on or associated with the former variable. If itis, we want to have some quantitative measure of the strength of the association. Oneoften hears this loosely stated as the question of whether two variables are correlatedor uncorrelated, but we will reserve those terms for a particular kind of association(linear, or at least monotonic), as discussed in §14.5 and §14.6.Notice that, as in previous sections, the different concepts of significance andstrength appear: The association between two distributions may be very significanteven if that association is weak — if the quantity of data is large enough.It is useful to distinguish among some different kinds of variables, withdifferent categories forming a loose hierarchy.• A variable is called nominal if its values are the members of someunordered set.
For example, “state of residence” is a nominal variablethat (in the U.S.) takes on one of 50 values; in astrophysics, “type ofgalaxy” is a nominal variable with the three values “spiral,” “elliptical,”and “irregular.”• A variable is termed ordinal if its values are the members of a discrete, butordered, set. Examples are: grade in school, planetary order from the Sun(Mercury = 1, Venus = 2, . . .), number of offspring.
There need not beany concept of “equal metric distance” between the values of an ordinalvariable, only that they be intrinsically ordered.• We will call a variable continuous if its values are real numbers, asare times, distances, temperatures, etc. (Social scientists sometimesdistinguish between interval and ratio continuous variables, but we do notfind that distinction very compelling.)Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use.
Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).Noé, M. 1972, Annals of Mathematical Statistics, vol. 43, pp. 58–64. [5]Kuiper, N.H. 1962, Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen,ser.
A., vol. 63, pp. 38–47. [6]62914.4 Contingency Table Analysis of Two Distributions1. male...2.green. . .# ofred malesN11# ofgreen malesN12. . .# ofmalesN1⋅# ofred femalesN21# ofgreen femalesN22. . .# offemalesN2⋅....... . ....# of redN ⋅1# of greenN⋅2. . .total #NFigure 14.4.1. Example of a contingency table for two nominal variables, here sex and color.
Therow and column marginals (totals) are shown. The variables are “nominal,” i.e., the order in whichtheir values are listed is arbitrary and does not affect the result of the contingency table analysis. Ifthe ordering of values has some intrinsic meaning, then the variables are “ordinal” or “continuous,” andcorrelation techniques (§14.5-§14.6) can be utilized.A continuous variable can always be made into an ordinal one by binning itinto ranges.
If we choose to ignore the ordering of the bins, then we can turn it intoa nominal variable. Nominal variables constitute the lowest type of the hierarchy,and therefore the most general. For example, a set of several continuous or ordinalvariables can be turned, if crudely, into a single nominal variable, by coarselybinning each variable and then taking each distinct combination of bin assignmentsas a single nominal value. When multidimensional data are sparse, this is oftenthe only sensible way to proceed.The remainder of this section will deal with measures of association betweennominal variables.
For any pair of nominal variables, the data can be displayed asa contingency table, a table whose rows are labeled by the values of one nominalvariable, whose columns are labeled by the values of the other nominal variable,and whose entries are nonnegative integers giving the number of observed eventsfor each combination of row and column (see Figure 14.4.1). The analysis ofassociation between nominal variables is thus called contingency table analysis orcrosstabulation analysis.We will introduce two different approaches.
The first approach, based on thechi-square statistic, does a good job of characterizing the significance of association,but is only so-so as a measure of the strength (principally because its numericalvalues have no very direct interpretations). The second approach, based on theinformation-theoretic concept of entropy, says nothing at all about the significance ofassociation (use chi-square for that!), but is capable of very elegantly characterizingthe strength of an association already known to be significant.Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use.
Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).2. female1.red630Chapter 14.Statistical Description of DataMeasures of Association Based on Chi-SquareNi· =XNijjN =XNi· =iN·j =XXNiji(14.4.1)N·jjN·j and Ni· are sometimes called the row and column totals or marginals, but wewill use these terms cautiously since we can never keep straight which are the rowsand which are the columns!The null hypothesis is that the two variables x and y have no association. In thiscase, the probability of a particular value of x given a particular value of y shouldbe the same as the probability of that value of x regardless of y.
Therefore, in thenull hypothesis, the expected number for any Nij , which we will denote nij , can becalculated from only the row and column totals,Ni·nij=N·jNwhich impliesnij =Ni· N·jN(14.4.2)Notice that if a column or row total is zero, then the expected number for all theentries in that column or row is also zero; in that case, the never-occurring bin ofx or y should simply be removed from the analysis.The chi-square statistic is now given by equation (14.3.1), which, in the presentcase, is summed over all entries in the table,χ2 =X (Nij − nij )2nij(14.4.3)i,jThe number of degrees of freedom is equal to the number of entries in the table(product of its row size and column size) minus the number of constraints that havearisen from our use of the data themselves to determine the nij .
Each row total andcolumn total is a constraint, except that this overcounts by one, since the total of thecolumn totals and the total of the row totals both equal N , the total number of datapoints. Therefore, if the table is of size I by J, the number of degrees of freedom isIJ − I − J + 1. Equation (14.4.3), along with the chi-square probability function(§6.2), now give the significance of an association between the variables x and y.Suppose there is a significant association. How do we quantify its strength, sothat (e.g.) we can compare the strength of one association with another? The ideahere is to find some reparametrization of χ2 which maps it into some convenientinterval, like 0 to 1, where the result is not dependent on the quantity of data that wehappen to sample, but rather depends only on the underlying population from whichSample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use.
Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).Some notation first: Let Nij denote the number of events that occur with thefirst variable x taking on its ith value, and the second variable y taking on its jthvalue. Let N denote the total number of events, the sum of all the Nij ’s.
Характеристики
Тип файла PDF
PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.
Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.















