An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 98
Текст из файла (страница 98)
Then, we have(18.20)TCk = Uk′ Σ′k V ′ k ,where Σ′k is the square k × k submatrix of Σk with the singular values σ1 , . . . , σk onthe diagonal. The primary advantage of using (18.20) is to eliminate a lot of redundant columns of zeros in U and V, thereby explicitly eliminating multiplication bycolumns that do not affect the low-rank approximation; this version of the SVD issometimes known as the reduced SVD or truncated SVD and is a computationallysimpler representation from which to compute the low rank approximation.For the matrix C in Example 18.3, write down both Σ2 and Σ2′ .18.4Latent semantic indexingWe now discuss the approximation of a term-document matrix C by one oflower rank using the SVD.
The low-rank approximation to C yields a newrepresentation for each document in the collection. We will cast queriesOnline edition (c) 2009 Cambridge UP41318.4 Latent semantic indexingLATENT SEMANTICINDEXINGLSA(18.21)into this low-rank representation as well, enabling us to compute querydocument similarity scores in this low-rank representation. This process isknown as latent semantic indexing (generally abbreviated LSI).But first, we motivate such an approximation.
Recall the vector space representation of documents and queries introduced in Section 6.3 (page 120).This vector space representation enjoys a number of advantages includingthe uniform treatment of queries and documents as vectors, the inducedscore computation based on cosine similarity, the ability to weight different terms differently, and its extension beyond document retrieval to suchapplications as clustering and classification.
The vector space representation suffers, however, from its inability to cope with two classic problemsarising in natural languages: synonymy and polysemy. Synonymy refers to acase where two different words (say car and automobile) have the same meaning. Because the vector space representation fails to capture the relationshipbetween synonymous terms such as car and automobile – according each aseparate dimension in the vector space. Consequently the computed similarity ~q · d~ between a query ~q (say, car) and a document d~ containing both carand automobile underestimates the true similarity that a user would perceive.Polysemy on the other hand refers to the case where a term such as chargehas multiple meanings, so that the computed similarity ~q · d~ overestimatesthe similarity that a user would perceive.
Could we use the co-occurrencesof terms (whether, for instance, charge occurs in a document containing steedversus in a document containing electron) to capture the latent semantic associations of terms and alleviate these problems?Even for a collection of modest size, the term-document matrix C is likelyto have several tens of thousand of rows and columns, and a rank in thetens of thousands as well. In latent semantic indexing (sometimes referredto as latent semantic analysis (LSA)), we use the SVD to construct a low-rankapproximation Ck to the term-document matrix, for a value of k that is farsmaller than the original rank of C.
In the experimental work cited laterin this section, k is generally chosen to be in the low hundreds. We thusmap each row/column (respectively corresponding to a term/document) toa k-dimensional space; this space is defined by the k principal eigenvectors(corresponding to the largest eigenvalues) of CC T and C T C. Note that thematrix Ck is itself still an M × N matrix, irrespective of k.Next, we use the new k-dimensional LSI representation as we did the original representation – to compute similarities between vectors. A query vector~q is mapped into its representation in the LSI space by the transformation~qk = Σk−1 UkT~q.Now, we may use cosine similarities as in Section 6.3.1 (page 120) to compute the similarity between a query and a document, between two docu-Online edition (c) 2009 Cambridge UP41418Matrix decompositions and latent semantic indexingments, or between two terms.
Note especially that Equation (18.21) does notin any way depend on ~q being a query; it is simply a vector in the space ofterms. This means that if we have an LSI representation of a collection ofdocuments, a new document not in the collection can be “folded in” to thisrepresentation using Equation (18.21). This allows us to incrementally adddocuments to an LSI representation.
Of course, such incremental additionfails to capture the co-occurrences of the newly added documents (and evenignores any new terms they contain). As such, the quality of the LSI representation will degrade as more documents are added and will eventuallyrequire a recomputation of the LSI representation.The fidelity of the approximation of Ck to C leads us to hope that the relative values of cosine similarities are preserved: if a query is close to a document in the original space, it remains relatively close in the k-dimensionalspace. But this in itself is not sufficiently interesting, especially given thatthe sparse query vector ~q turns into a dense query vector ~qk in the lowdimensional space. This has a significant computational cost, when compared with the cost of processing ~q in its native form.✎Example 18.4:shipboatoceanvoyagetripd110110Consider the term-document matrix C =d201100d310000d400011d500010d600001Its singular value decomposition is the product of three matrices as below.
First wehave U which in this example is:shipboatoceanvoyagetrip1−0.44−0.13−0.48−0.70−0.262−0.30−0.33−0.510.350.6530.57−0.59−0.370.15−0.4140.580.000.00−0.580.5850.250.73−0.610.16−0.09When applying the SVD to a term-document matrix, U is known as the SVD termmatrix. The singular values are Σ =2.160.000.000.000.000.001.590.000.000.000.000.001.280.000.000.000.000.001.000.000.000.000.000.000.39Finally we have V T , which in the context of a term-document matrix is known asthe SVD document matrix:Online edition (c) 2009 Cambridge UP41518.4 Latent semantic indexing12345d1−0.75−0.290.280.00−0.53d2−0.28−0.53−0.750.000.29d3−0.20−0.190.450.580.63d4−0.450.63−0.200.000.19d5−0.330.220.12−0.580.41d6−0.120.41−0.330.58−0.22By “zeroing out” all but the two largest singular values of Σ, we obtain Σ2 =2.160.000.000.000.000.001.590.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.00From this, we compute C2 =12345d1−1.62−0.460.000.000.00d2−0.60−0.840.000.000.00d3−0.44−0.300.000.000.00d4−0.971.000.000.000.00d5−0.700.350.000.000.00d6−0.260.650.000.000.00Notice that the low-rank approximation, unlike the original matrix C, can havenegative entries.Examination of C2 and Σ2 in Example 18.4 shows that the last 3 rows ofeach of these matrices are populated entirely by zeros.
This suggests thatthe SVD product UΣV T in Equation (18.18) can be carried out with only tworows in the representations of Σ2 and V T ; we may then replace these matricesby their truncated versions Σ2′ and (V ′ ) T . For instance, the truncated SVDdocument matrix (V ′ ) T in this example is:12d1−1.62−0.46d2−0.60−0.84d3−0.44−0.30d4−0.971.00d5−0.700.35d6−0.260.65Figure 18.3 illustrates the documents in (V ′ ) T in two dimensions. Notealso that C2 is dense relative to C.We may in general view the low-rank approximation of C by Ck as a constrained optimization problem: subject to the constraint that Ck have rank atmost k, we seek a representation of the terms and documents comprising Cwith low Frobenius norm for the error C − Ck . When forced to squeeze theterms/documents down to a k-dimensional space, the SVD should bring together terms with similar co-occurrences.
This intuition suggests, then, thatnot only should retrieval quality not suffer too much from the dimensionreduction, but in fact may improve.Online edition (c) 2009 Cambridge UP41618Matrix decompositions and latent semantic indexingdim 21.0×d4×d60.5×d5−1.5−1.0dim 1−0.5××d3−0.5d1×d2−1.0◮ Figure 18.3 The documents of Example 18.4 reduced to two dimensions in (V ′ ) T .Dumais (1993) and Dumais (1995) conducted experiments with LSI onTREC documents and tasks, using the commonly-used Lanczos algorithmto compute the SVD.
At the time of their work in the early 1990’s, the LSIcomputation on tens of thousands of documents took approximately a dayon one machine. On these experiments, they achieved precision at or abovethat of the median TREC participant. On about 20% of TREC topics theirsystem was the top scorer, and reportedly slightly better on average thanstandard vector spaces for LSI at about 350 dimensions.