An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 99
Текст из файла (страница 99)
Here are some conclusions on LSI first suggested by their work, and subsequently verified bymany other experiments.• The computational cost of the SVD is significant; at the time of this writing, we know of no successful experiment with over one million documents. This has been the biggest obstacle to the widespread adoption toLSI. One approach to this obstacle is to build the LSI representation on arandomly sampled subset of the documents in the collection, followingwhich the remaining documents are “folded in” as detailed with Equation (18.21).Online edition (c) 2009 Cambridge UP18.5 References and further reading417• As we reduce k, recall tends to increase, as expected.• Most surprisingly, a value of k in the low hundreds can actually increaseprecision on some query benchmarks.
This appears to suggest that for asuitable value of k, LSI addresses some of the challenges of synonymy.• LSI works best in applications where there is little overlap between queriesand documents.SOFT CLUSTERING18.5CROSS - LANGUAGEINFORMATIONRETRIEVALThe experiments also documented some modes where LSI failed to matchthe effectiveness of more traditional indexes and score computations. Mostnotably (and perhaps obviously), LSI shares two basic drawbacks of vectorspace retrieval: there is no good way of expressing negations (find documents that contain german but not shepherd), and no way of enforcing Booleanconditions.LSI can be viewed as soft clustering by interpreting each dimension of thereduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.References and further readingStrang (1986) provides an excellent introductory overview of matrix decompositions including the singular value decomposition. Theorem 18.4 is dueto Eckart and Young (1936).
The connection between information retrievaland low-rank approximations of the term-document matrix was introducedin Deerwester et al. (1990), with a subsequent survey of results in Berryet al. (1995). Dumais (1993) and Dumais (1995) describe experiments onTREC benchmarks giving evidence that at least on some benchmarks, LSIcan produce better precision and recall than standard vector-space retrieval.http://www.cs.utk.edu/˜berry/lsi++/ and http://lsi.argreenhouse.com/lsi/LSIpapers.htmloffer comprehensive pointers to the literature and software of LSI.
Schützeand Silverstein (1997) evaluate LSI and truncated representations of centroids for efficient K-means clustering (Section 16.4). Bast and Majumdar(2005) detail the role of the reduced dimension k in LSI and how differentpairs of terms get coalesced together at differing values of k.
Applications ofLSI to cross-language information retrieval (where documents in two or moredifferent languages are indexed, and a query posed in one language is expected to retrieve documents in other languages) are developed in Berry andYoung (1995) and Littman et al. (1998). LSI (referred to as LSA in more general settings) has been applied to host of other problems in computer scienceranging from memory modeling to computer vision.Hofmann (1999a;b) provides an initial probabilistic extension of the basiclatent semantic indexing technique. A more satisfactory formal basis for aOnline edition (c) 2009 Cambridge UP41818DocID123456Matrix decompositions and latent semantic indexingDocument texthelloopen housemi casahola Profesorhola y bienvenidohello and welcome◮ Figure 18.4 Documents for Exercise 18.11.SpanishmicasaholaprofesorybienvenidoEnglishmyhousehelloprofessorandwelcome◮ Figure 18.5 Glossary for Exercise 18.11.probabilistic latent variable model for dimensionality reduction is the LatentDirichlet Allocation (LDA) model (Blei et al.
2003), which is generative andassigns probabilities to documents outside of the training set. This model isextended to a hierarchical clustering by Rosen-Zvi et al. (2004). Wei and Croft(2006) present the first large scale evaluation of LDA, finding it to significantly outperform the query likelihood model of Section 12.2 (page 242), butto not perform quite as well as the relevance model mentioned in Section 12.4(page 250) – but the latter does additional per-query processing unlike LDA.Teh et al.
(2006) generalize further by presenting Hierarchical Dirichlet Processes, a probabilistic model which allows a group (for us, a document) tobe drawn from an infinite mixture of latent topics, while still allowing thesetopics to be shared across documents.?Exercise 18.11Assume you have a set of documents each of which is in either English or in Spanish.The collection is given in Figure 18.4.Figure 18.5 gives a glossary relating the Spanish and English words above for yourown information. This glossary is NOT available to the retrieval system:1. Construct the appropriate term-document matrix C to use for a collection consisting of these documents. For simplicity, use raw term frequencies rather thannormalized tf-idf weights.
Make sure to clearly label the dimensions of your matrix.Online edition (c) 2009 Cambridge UP18.5 References and further reading4192. Write down the matrices U2 , Σ2′ and V2 and from these derive the rank 2 approximation C2 .3. State succinctly what the (i, j) entry in the matrix C T C represents.4.
State succinctly what the (i, j) entry in the matrix C2T C2 represents, and why itdiffers from that in C T C.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.19421Web search basicsIn this and the following two chapters, we consider web search engines. Sections 19.1–19.4 provide some background and history to help the reader appreciate the forces that conspire to make the Web chaotic, fast-changing and(from the standpoint of information retrieval) very different from the “traditional” collections studied thus far in this book.
Sections 19.5–19.6 deal withestimating the number of documents indexed by web search engines, and theelimination of duplicate documents in web indexes, respectively. These twolatter sections serve as background material for the following two chapters.19.1HTTPHTMLBackground and historyThe Web is unprecedented in many ways: unprecedented in scale, unprecedented in the almost-complete lack of coordination in its creation, and unprecedented in the diversity of backgrounds and motives of its participants.Each of these contributes to making web search different – and generally farharder – than searching “traditional” documents.The invention of hypertext, envisioned by Vannevar Bush in the 1940’s andfirst realized in working systems in the 1970’s, significantly precedes the formation of the World Wide Web (which we will simply refer to as the Web), inthe 1990’s.
Web usage has shown tremendous growth to the point where itnow claims a good fraction of humanity as participants, by relying on a simple, open client-server design: (1) the server communicates with the clientvia a protocol (the http or hypertext transfer protocol) that is lightweight andsimple, asynchronously carrying a variety of payloads (text, images and –over time – richer media such as audio and video files) encoded in a simple markup language called HTML (for hypertext markup language); (2) theclient – generally a browser, an application within a graphical user environment – can ignore what it does not understand. Each of these seeminglyinnocuous features has contributed enormously to the growth of the Web, soit is worthwhile to examine them further.Online edition (c) 2009 Cambridge UP42219 Web search basicsURLThe basic operation is as follows: a client (such as a browser) sends an httprequest to a web server.
The browser specifies a URL (for Universal Resource Locator) such as http://www.stanford.edu/home/atoz/contact.html.In this example URL, the string http refers to the protocol to be used fortransmitting the data. The string www.stanford.edu is known as the domain and specifies the root of a hierarchy of web pages (typically mirroring afilesystem hierarchy underlying the web server). In this example, /home/atoz/contact.htmlis a path in this hierarchy with a file contact.html that contains the information to be returned by the web server at www.stanford.edu in responseto this request. The HTML-encoded file contact.html holds the hyperlinks and the content (in this instance, contact information for Stanford University), as well as formatting rules for rendering this content in a browser.Such an http request thus allows us to fetch the content of a page, something that will prove to be useful to us for crawling and indexing documents(Chapter 20).The designers of the first browsers made it easy to view the HTML markuptags on the content of a URL.
This simple convenience allowed new users tocreate their own HTML content without extensive training or experience;rather, they learned from example content that they liked. As they did so, asecond feature of browsers supported the rapid proliferation of web contentcreation and usage: browsers ignored what they did not understand.
Thisdid not, as one might fear, lead to the creation of numerous incompatibledialects of HTML. What it did promote was amateur content creators whocould freely experiment with and learn from their newly created web pageswithout fear that a simple syntax error would “bring the system down.” Publishing on the Web became a mass activity that was not limited to a fewtrained programmers, but rather open to tens and eventually hundreds ofmillions of individuals. For most users and for most information needs, theWeb quickly became the best way to supply and consume information oneverything from rare ailments to subway schedules.The mass publishing of information on the Web is essentially useless unless this wealth of information can be discovered and consumed by otherusers. Early attempts at making web information “discoverable” fell into twobroad categories: (1) full-text index search engines such as Altavista, Exciteand Infoseek and (2) taxonomies populated with web pages in categories,such as Yahoo! The former presented the user with a keyword search interface supported by inverted indexes and ranking mechanisms building onthose introduced in earlier chapters.
The latter allowed the user to browsethrough a hierarchical tree of category labels. While this is at first blush aconvenient and intuitive metaphor for finding web pages, it has a number ofdrawbacks: first, accurately classifying web pages into taxonomy tree nodesis for the most part a manual editorial process, which is difficult to scalewith the size of the Web.