c14-4 (779579), страница 3
Текст из файла (страница 3)
If the two variables are completely dependent, then H(x) =H(y) = H(x, y), so (14.4.16) equals unity. In fact, you can use the identities (easilyproved from equations 14.4.9–14.4.12)H(x, y) = H(x) + H(y|x) = H(y) + H(x|y)(14.4.18)to show thatH(x)U (x|y) + H(y)U (y|x)(14.4.19)H(x) + H(y)i.e., that the symmetrical measure is just a weighted average of the two asymmetricalmeasures (14.4.15) and (14.4.16), weighted by the entropy of each variable separately.Here is a program for computing all the quantities discussed, H(x), H(y),H(x|y), H(y|x), H(x, y), U (x|y), U (y|x), and U (x, y):U (x, y) =Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited.
To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).y (in which case the two variables are associated!):Xpij /pi·pij lnH(y|x) − H(y) = −p·ji,jXp·j pi·=pij lnpiji,jXp·j pi·≤pij−1piji,jXXpi· p·j −pij=14.4 Contingency Table Analysis of Two Distributions#include <math.h>#include "nrutil.h"#define TINY 1.0e-30635A small number.sumi=vector(1,ni);sumj=vector(1,nj);for (i=1;i<=ni;i++) {sumi[i]=0.0;for (j=1;j<=nj;j++) {sumi[i] += nn[i][j];sum += nn[i][j];}}for (j=1;j<=nj;j++) {sumj[j]=0.0;for (i=1;i<=ni;i++)sumj[j] += nn[i][j];}*hx=0.0;for (i=1;i<=ni;i++)if (sumi[i]) {p=sumi[i]/sum;*hx -= p*log(p);}*hy=0.0;for (j=1;j<=nj;j++)if (sumj[j]) {p=sumj[j]/sum;*hy -= p*log(p);}*h=0.0;for (i=1;i<=ni;i++)for (j=1;j<=nj;j++)if (nn[i][j]) {p=nn[i][j]/sum;*h -= p*log(p);}*hygx=(*h)-(*hx);*hxgy=(*h)-(*hy);*uygx=(*hy-*hygx)/(*hy+TINY);*uxgy=(*hx-*hxgy)/(*hx+TINY);*uxy=2.0*(*hx+*hy-*h)/(*hx+*hy+TINY);free_vector(sumj,1,nj);free_vector(sumi,1,ni);Get the row totals.Get the column totals.Entropy of the x distribution,and of the y distribution.Total entropy: loop over both xand y.Uses equation (14.4.18),as does this.Equation (14.4.15).Equation (14.4.16).Equation (14.4.17).}CITED REFERENCES AND FURTHER READING:Dunn, O.J., and Clark, V.A.
1974, Applied Statistics: Analysis of Variance and Regression (NewYork: Wiley).Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).void cntab2(int **nn, int ni, int nj, float *h, float *hx, float *hy,float *hygx, float *hxgy, float *uygx, float *uxgy, float *uxy)Given a two-dimensional contingency table in the form of an integer array nn[i][j], where ilabels the x variable and ranges from 1 to ni, j labels the y variable and ranges from 1 to nj,this routine returns the entropy h of the whole table, the entropy hx of the x distribution, theentropy hy of the y distribution, the entropy hygx of y given x, the entropy hxgy of x given y,the dependency uygx of y on x (eq.
14.4.15), the dependency uxgy of x on y (eq. 14.4.16),and the symmetrical dependency uxy (eq. 14.4.17).{int i,j;float sum=0.0,p,*sumi,*sumj;636Chapter 14.Statistical Description of DataNorusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSSX Advanced Statistics Guide (New York: McGraw-Hill).Fano, R.M. 1961, Transmission of Information (New York: Wiley and MIT Press), Chapter 2.We next turn to measures of association between variables that are ordinalor continuous, rather than nominal. Most widely used is the linear correlationcoefficient. For pairs of quantities (xi , yi ), i = 1, .
. . , N , the linear correlationcoefficient r (also called the product-moment correlation coefficient, or Pearson’sr) is given by the formulaP(xi − x)(yi − y)rPr = rP i(14.5.1)(xi − x)2(yi − y)2iiwhere, as usual, x is the mean of the xi ’s, y is the mean of the yi ’s.The value of r lies between −1 and 1, inclusive. It takes on a value of 1, termed“complete positive correlation,” when the data points lie on a perfect straight linewith positive slope, with x and y increasing together. The value 1 holds independentof the magnitude of the slope. If the data points lie on a perfect straight line withnegative slope, y decreasing as x increases, then r has the value −1; this is called“complete negative correlation.” A value of r near zero indicates that the variablesx and y are uncorrelated.When a correlation is known to be significant, r is one conventional way ofsummarizing its strength.
In fact, the value of r can be translated into a statementabout what residuals (root mean square deviations) are to be expected if the data arefitted to a straight line by the least-squares method (see §15.2, especially equations15.2.13 – 15.2.14). Unfortunately, r is a rather poor statistic for deciding whetheran observed correlation is statistically significant, and/or whether one observedcorrelation is significantly stronger than another. The reason is that r is ignorant ofthe individual distributions of x and y, so there is no universal way to compute itsdistribution in the case of the null hypothesis.About the only general statement that can be made is this: If the null hypothesisis that x and y are uncorrelated, and if the distributions for x and y each haveenough convergent moments (“tails” die off sufficiently rapidly), and if N is large(typically > 500), then r is distributedapproximately normally, with a mean of zero√and a standard deviation of 1/ N .
In that case, the (double-sided) significance ofthe correlation, that is, the probability that |r| should be larger than its observedvalue in the null hypothesis, is√ !|r| N√(14.5.2)erfc2where erfc(x) is the complementary error function, equation (6.2.8), computed bythe routines erffc or erfcc of §6.2. A small value of (14.5.2) indicates that theSample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5)Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software.Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machinereadable files (including this one) to any servercomputer, is strictly prohibited.
To order Numerical Recipes books,diskettes, or CDROMsvisit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America).14.5 Linear Correlation.















