c14-6 (779581)
Текст из файла
14.6 Nonparametric or Rank Correlation639CITED REFERENCES AND FURTHER READING:Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (NewYork: Wiley).Hoel, P.G. 1971, Introduction to Mathematical Statistics, 4th ed. (New York: Wiley), Chapter 7.von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: AcademicPress), Chapters IX(A) and IX(B).Korn, G.A., and Korn, T.M.
1968, Mathematical Handbook for Scientists and Engineers, 2nd ed.(New York: McGraw-Hill), §19.7.Norusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSSX Advanced Statistics Guide (New York: McGraw-Hill).14.6 Nonparametric or Rank CorrelationIt is precisely the uncertainty in interpreting the significance of the linearcorrelation coefficient r that leads us to the important concepts of nonparametric orrank correlation.
As before, we are given N pairs of measurements (xi , yi ). Before,difficulties arose because we did not necessarily know the probability distributionfunction from which the xi ’s or yi ’s were drawn.The key concept of nonparametric correlation is this: If we replace the valueof each xi by the value of its rank among all the other xi ’s in the sample, thatis, 1, 2, 3, . . ., N , then the resulting list of numbers will be drawn from a perfectlyknown distribution function, namely uniformly from the integers between 1 and N ,inclusive. Better than uniformly, in fact, since if the xi ’s are all distinct, then eachinteger will occur precisely once. If some of the xi ’s have identical values, it isconventional to assign to all these “ties” the mean of the ranks that they would havehad if their values had been slightly different.
This midrank will sometimes be aninteger, sometimes a half-integer. In all cases the sum of all assigned ranks will bethe same as the sum of the integers from 1 to N , namely 12 N (N + 1).Of course we do exactly the same procedure for the yi ’s, replacing each valueby its rank among the other yi ’s in the sample.Now we are free to invent statistics for detecting correlation between uniformsets of integers between 1 and N , keeping in mind the possibility of ties in the ranks.There is, of course, some loss of information in replacing the original numbers byranks.
We could construct some rather artificial examples where a correlation couldbe detected parametrically (e.g., in the linear correlation coefficient r), but could notSample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited.
To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).sxy += xt*yt;}*r=sxy/(sqrt(sxx*syy)+TINY);*z=0.5*log((1.0+(*r)+TINY)/(1.0-(*r)+TINY));Fisher’s z transformation.df=n-2;t=(*r)*sqrt(df/((1.0-(*r)+TINY)*(1.0+(*r)+TINY)));Equation (14.5.5).*prob=betai(0.5*df,0.5,df/(df+t*t));Student’s t probability./**prob=erfcc(fabs((*z)*sqrt(n-1.0))/1.4142136)*/For large n, this easier computation of prob, using the short routine erfcc, would give approximately the same value.}640Chapter 14.Statistical Description of DataSpearman Rank-Order Correlation CoefficientLet Ri be the rank of xi among the other x’s, Si be the rank of yi among theother y’s, ties being assigned the appropriate midrank as described above.
Then therank-order correlation coefficient is defined to be the linear correlation coefficientof the ranks, namely,P(Ri − R)(Si − S)qP(14.6.1)rs = q P i22(R−R)(S−S)iiiiThe significance of a nonzero value of rs is tested by computingsN −2t = rs1 − rs2(14.6.2)which is distributed approximately as Student’s distribution with N − 2 degrees offreedom. A key point is that this approximation does not depend on the originaldistribution of the x’s and y’s; it is always the same approximation, and alwayspretty good.It turns out that rs is closely related to another conventional measure ofnonparametric correlation, the so-called sum squared difference of ranks, defined asD=NX(Ri − Si )2(14.6.3)i=1(This D is sometimes denoted D**, where the asterisks are used to indicate thatties are treated by midranking.)When there are no ties in the data, then the exact relation between D and rs is6D(14.6.4)N3 − NWhen there are ties, then the exact relation is slightly more complicated: Let fk bethe number of ties in the kth group of ties among the Ri ’s, and let gm be the numberof ties in the mth group of ties among the Si ’s.
Then it turns out thatrs = 1 −61 P1 P33D + 121− 3k (fk − fk ) + 12m (gm − gm )N−Nrs ="#1/2 "#1/2P 3P3k (fk − fk )m (gm − gm )1−1−N3 − NN3 − N(14.6.5)Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).be detected nonparametrically.
Such examples are very rare in real life, however,and the slight loss of information in ranking is a small price to pay for a very majoradvantage: When a correlation is demonstrated to be present nonparametrically,then it is really there! (That is, to a certainty level that depends on the significancechosen.) Nonparametric correlation is more robust than linear correlation, moreresistant to unplanned defects in the data, in the same sort of sense that the medianis more robust than the mean.
For more on the concept of robustness, see §15.7.As always in statistics, some particular choices of a statistic have already beeninvented for us and consecrated, if not beatified, by popular use. We will discusstwo, the Spearman rank-order correlation coefficient (rs ), and Kendall’s tau (τ ).14.6 Nonparametric or Rank Correlation641holds exactly. Notice that if all the fk ’s and all the gm ’s are equal to one, meaningthat there are no ties, then equation (14.6.5) reduces to equation (14.6.4).In (14.6.2) we gave a t-statistic that tests the significance of a nonzero rs . It isalso possible to test the significance of D directly.
The expectation value of D inthe null hypothesis of uncorrelated data sets is1 31 X 31 X 3(N − N ) −(fk − fk ) −(g − gm )61212 m m(14.6.6)kits variance isVar(D) =(N − 1)N 2 (N + 1)236PP 33k (fk − fk )m (gm − gm )1−× 1−N3 − NN3 − N(14.6.7)and it is approximately normally distributed, so that the significance level is acomplementary error function (cf.
equation 14.5.2). Of course, (14.6.2) and (14.6.7)are not independent tests, but simply variants of the same test. In the program thatfollows, we calculate both the significance level obtained by using (14.6.2) and thesignificance level obtained by using (14.6.7); their discrepancy will give you an ideaof how good the approximations are. You will also notice that we break off the taskof assigning ranks (including tied midranks) into a separate function, crank.#include <math.h>#include "nrutil.h"void spear(float data1[], float data2[], unsigned long n, float *d, float *zd,float *probd, float *rs, float *probrs)Given two data arrays, data1[1..n] and data2[1..n], this routine returns their sum-squareddifference of ranks as D, the number of standard deviations by which D deviates from its nullhypothesis expected value as zd, the two-sided significance level of this deviation as probd,Spearman’s rank correlation rs as rs, and the two-sided significance level of its deviation fromzero as probrs.
Характеристики
Тип файла PDF
PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.
Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.















