An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 96

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 96 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 962020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 96)

What is the overallcomplexity of single-link clustering for a set of points on a line?Exercise 17.11Prove that single-link, complete-link, and group-average clustering are monotonic inthe sense defined on page 378.Exercise 17.12For N points, there are ≤ N K different flat clusterings into K clusters (Section 16.2,page 356). What is the number of different hierarchical clusterings (or dendrograms)of N documents? Are there more flat clusterings or more hierarchical clusterings forgiven K and N?Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.

Feedback welcome.18403Matrix decompositions and latentsemantic indexingOn page 123 we introduced the notion of a term-document matrix: an M × Nmatrix C, each of whose rows represents a term and each of whose columnsrepresents a document in the collection. Even for a collection of modest size,the term-document matrix C is likely to have several tens of thousands ofrows and columns. In Section 18.1.1 we first develop a class of operationsfrom linear algebra, known as matrix decomposition.

In Section 18.2 we use aspecial form of matrix decomposition to construct a low-rank approximationto the term-document matrix. In Section 18.3 we examine the applicationof such low-rank approximations to indexing and retrieving documents, atechnique referred to as latent semantic indexing. While latent semantic indexing has not been established as a significant force in scoring and rankingfor information retrieval, it remains an intriguing approach to clustering in anumber of domains including for collections of text documents (Section 16.6,page 372). Understanding its full potential remains an area of active research.Readers who do not require a refresher on linear algebra may skip Section 18.1, although Example 18.1 is especially recommended as it highlightsa property of eigenvalues that we exploit later in the chapter.18.1RANKLinear algebra reviewWe briefly review some necessary background in linear algebra. Let C bean M × N matrix with real-valued entries; for a term-document matrix, allentries are in fact non-negative.

The rank of a matrix is the number of linearlyindependent rows (or columns) in it; thus, rank(C ) ≤ min{ M, N }. A squarer × r matrix all of whose off-diagonal entries are zero is called a diagonalmatrix; its rank is equal to the number of non-zero diagonal entries. If allr diagonal entries of such a diagonal matrix are 1, it is called the identitymatrix of dimension r and represented by Ir .For a square M × M matrix C and a vector ~x that is not all zeros, the valuesOnline edition (c) 2009 Cambridge UP40418Matrix decompositions and latent semantic indexingof λ satisfying(18.1)EIGENVALUE(18.2)C ~x = λ~xare called the eigenvalues of C .

The N-vector ~x satisfying Equation (18.1)for an eigenvalue λ is the corresponding right eigenvector. The eigenvectorcorresponding to the eigenvalue of largest magnitude is called the principaleigenvector. In a similar fashion, the left eigenvectors of C are the M-vectors ysuch that~y T C = λ~y T .The number of non-zero eigenvalues of C is at most rank(C ).The eigenvalues of a matrix are found by solving the characteristic equation,which is obtained by rewriting Equation (18.1) in the form (C − λI M )~x = 0.The eigenvalues of C are then the solutions of |(C − λI M )| = 0, where |S|denotes the determinant of a square matrix S.

The equation |(C − λI M )| = 0is an Mth order polynomial equation in λ and can have at most M roots,which are the eigenvalues of C. These eigenvalues can in general be complex,even if all entries of C are real.We now examine some further properties of eigenvalues and eigenvectors,to set up the central idea of singular value decompositions in Section 18.2 below. First, we look at the relationship between matrix-vector multiplicationand eigenvalues.✎Example 18.1:Consider the matrix30S= 00020000 .1Clearly the matrix has rank 3, and has 3 non-zero eigenvalues λ1 = 30, λ2 = 20 andλ3 = 1, with the three corresponding eigenvectors100x~1 =  0  , x~2 =  1  and x~3 =  0  .001For each of the eigenvectors, multiplication by S acts as if we were multiplying theeigenvector by a multiple of the identity matrix; the multipledifferent for each is 2eigenvector.

Now, consider an arbitrary vector, such as ~v =  4  . We can always6express ~v as a linear combination of the three eigenvectors of S; in the current examplewe have2~v =  4  = 2x~1 + 4x~2 + 6x~3 .6Online edition (c) 2009 Cambridge UP40518.1 Linear algebra reviewSuppose we multiply ~v by S:S~v(18.3)====S (2x~1 + 4x~2 + 6x~3 )2S x~1 + 4S x~2 + 6S x~32λ1 x~1 + 4λ2 x~2 + 6λ3 x~360x~1 + 80x~2 + 6x~3 .Example 18.1 shows that even though ~v is an arbitrary vector, the effect ofmultiplication by S is determined by the eigenvalues and eigenvectors of S.Furthermore, it is intuitively apparent from Equation (18.3) that the productS~v is relatively unaffected by terms arising from the small eigenvalues of S;in our example, since λ3 = 1, the contribution of the third term on the righthand side of Equation (18.3) is small. In fact, if we were to completely ignorethe contribution in Equation (18.3) from the third eigenvectorcorresponding60to λ3 = 1, then the product S~v would be computed to be  80  rather than060the correct product which is  80 ; these two vectors are relatively close6to each other by any of various metrics one could apply (such as the lengthof their vector difference).This suggests that the effect of small eigenvalues (and their eigenvectors)on a matrix-vector product is small.

We will carry forward this intuitionwhen studying matrix decompositions and low-rank approximations in Section 18.2. Before doing so, we examine the eigenvectors and eigenvalues ofspecial forms of matrices that will be of particular interest to us.For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvaluesare all real.✎(18.4)Example 18.2:Consider the real, symmetric matrixS=2112.From the characteristic equation | S − λI | = 0, we have the quadratic (2 − λ)2 − 1 =0,yield the eigenvalues 3 and 1. The corresponding eigenvectors solutions whose11are orthogonal.and1−1Online edition (c) 2009 Cambridge UP4061818.1.1MATRIXDECOMPOSITIONEIGEN DECOMPOSITION(18.5)(18.6)Matrix decompositions and latent semantic indexingMatrix decompositionsIn this section we examine ways in which a square matrix can be factoredinto the product of matrices derived from its eigenvectors; we refer to thisprocess as matrix decomposition.

Matrix decompositions similar to the onesin this section will form the basis of our principal text-analysis techniquein Section 18.3, where we will look at decompositions of non-square termdocument matrices. The square decompositions in this section are simplerand can be treated with sufficient mathematical rigor to help the reader understand how such decompositions work. The detailed mathematical derivation of the more complex decompositions in Section 18.2 are beyond thescope of this book.We begin by giving two theorems on the decomposition of a square matrix into the product of three matrices of a special form. The first of these,Theorem 18.1, gives the basic factorization of a square real-valued matrixinto three factors.

The second, Theorem 18.2, applies to square symmetricmatrices and is the basis of the singular value decomposition described inTheorem 18.3.Theorem 18.1. (Matrix diagonalization theorem) Let S be a square real-valuedM × M matrix with M linearly independent eigenvectors. Then there exists aneigen decompositionS = UΛU −1,where the columns of U are the eigenvectors of S and Λ is a diagonal matrix whosediagonal entries are the eigenvalues of S in decreasing orderλ1λ2 , λ i ≥ λ i +1 .···λMIf the eigenvalues are distinct, then this decomposition is unique.(18.7)To understand how Theorem 18.1 works, we note that U has the eigenvectors of S as columnsU = (~u1 u~2 · · · u~M ) .Then we haveSU= S (~u1 u~2 · · · u~M )= (λ1 u~1 λ2 u~2 · · · λ M u~M )λ1λ2= (~u1 u~2 · · · u~M ) ···λMOnline edition (c) 2009 Cambridge UP.18.2 Term-document matrices and singular value decompositions407Thus, we have SU = UΛ, or S = UΛU −1.We next state a closely related decomposition of a symmetric square matrixinto the product of matrices derived from its eigenvectors.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.