An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 102
Текст из файла (страница 102)
Here the query A320returns algorithmic search results about the Airbus aircraft, together with advertisements for various non-aircraft goods numbered A320, that advertisers seek to marketto those querying on this query. The lack of advertisements for the aircraft reflects thefact that few marketers attempt to sell A320 aircraft on the web.SEARCH ENGINEMARKETINGCLICK SPAM?retrieval and microeconomics, and is beyond the scope of this book. Foradvertisers, understanding how search engines do this ranking and how toallocate marketing campaign budgets to different keywords and to differentsponsored search engines has become a profession known as search enginemarketing (SEM).The inherently economic motives underlying sponsored search give riseto attempts by some participants to subvert the system to their advantage.This can take many forms, one of which is known as click spam.
There iscurrently no universally accepted definition of click spam. It refers (as thename suggests) to clicks on sponsored search results that are not from bonafide search users. For instance, a devious advertiser may attempt to exhaustthe advertising budget of a competitor by clicking repeatedly (through theuse of a robotic click generator) on that competitor’s sponsored search advertisements.
Search engines face the challenge of discerning which of theclicks they observe are part of a pattern of click spam, to avoid charging theiradvertiser clients for such clicks.Exercise 19.5The Goto method ranked advertisements matching a query by bid: the highest-biddingadvertiser got the top position, the second-highest the next, and so on.
What can gowrong with this when the highest-bidding advertiser places an advertisement that isirrelevant to the query? Why might an advertiser with an irrelevant advertisementbid high in this manner?Exercise 19.6Suppose that, in addition to bids, we had for each advertiser their click-through rate:the ratio of the historical number of times users click on their advertisement to thenumber of times the advertisement was shown. Suggest a modification of the Gotoscheme that exploits this data to avoid the problem in Exercise 19.5 above.Online edition (c) 2009 Cambridge UP43219 Web search basics19.4The search user experienceIt is crucial that we understand the users of web search as well.
This isagain a significant change from traditional information retrieval, where userswere typically professionals with at least some training in the art of phrasingqueries over a well-authored collection whose style and structure they understood well. In contrast, web search users tend to not know (or care) aboutthe heterogeneity of web content, the syntax of query languages and the artof phrasing queries; indeed, a mainstream tool (as web search has come tobecome) should not place such onerous demands on billions of people.
Arange of studies has concluded that the average number of keywords in aweb search is somewhere between 2 and 3. Syntax operators (Boolean connectives, wildcards, etc.) are seldom used, again a result of the compositionof the audience – “normal” people, not information scientists.It is clear that the more user traffic a web search engine can attract, themore revenue it stands to earn from sponsored search.
How do search engines differentiate themselves and grow their traffic? Here Google identifiedtwo principles that helped it grow at the expense of its competitors: (1) afocus on relevance, specifically precision rather than recall in the first few results; (2) a user experience that is lightweight, meaning that both the searchquery page and the search results page are uncluttered and almost entirelytextual, with very few graphical elements. The effect of the first was simplyto save users time in locating the information they sought.
The effect of thesecond is to provide a user experience that is extremely responsive, or at anyrate not bottlenecked by the time to load the search query or results page.19.4.1INFORMATIONALQUERIESNAVIGATIONALQUERIESUser query needsThere appear to be three broad categories into which common web searchqueries can be grouped: (i) informational, (ii) navigational and (iii) transactional. We now explain these categories; it should be clear that some querieswill fall in more than one of these categories, while others will fall outsidethem.Informational queries seek general information on a broad topic, such asleukemia or Provence. There is typically not a single web page that contains all the information sought; indeed, users with informational queriestypically try to assimilate information from multiple web pages.Navigational queries seek the website or home page of a single entity that theuser has in mind, say Lufthansa airlines.
In such cases, the user’s expectationis that the very first search result should be the home page of Lufthansa.The user is not interested in a plethora of documents containing the termLufthansa; for such a user, the best measure of user satisfaction is precision at1.Online edition (c) 2009 Cambridge UP19.5 Index size and estimationTRANSACTIONALQUERY19.5433A transactional query is one that is a prelude to the user performing a transaction on the Web – such as purchasing a product, downloading a file ormaking a reservation. In such cases, the search engine should return resultslisting services that provide form interfaces for such transactions.Discerning which of these categories a query falls into can be challenging.
The category not only governs the algorithmic search results, but thesuitability of the query for sponsored search results (since the query may reveal an intent to purchase). For navigational queries, some have argued thatthe search engine should return only a single result or even the target webpage directly. Nevertheless, web search engines have historically engaged ina battle of bragging rights over which one indexes more web pages.
Doesthe user really care? Perhaps not, but the media does highlight estimates(often statistically indefensible) of the sizes of various search engines. Usersare influenced by these reports and thus, search engines do have to pay attention to how their index sizes compare to competitors’. For informational(and to a lesser extent, transactional) queries, the user does care about thecomprehensiveness of the search engine.Figure 19.7 shows a composite picture of a web search engine includingthe crawler, as well as both the web page and advertisement indexes. Theportion of the figure under the curved dashed line is internal to the searchengine.Index size and estimationTo a first approximation, comprehensiveness grows with index size, althoughit does matter which specific pages a search engine indexes – some pages aremore informative than others. It is also difficult to reason about the fractionof the Web indexed by a search engine, because there is an infinite number ofdynamic web pages; for instance, http://www.yahoo.com/any_stringreturns a valid HTML page rather than an error, politely informing the userthat there is no such page at Yahoo! Such a "soft 404 error" is only one example of many ways in which web servers can generate an infinite number ofvalid web pages.
Indeed, some of these are malicious spider traps devisedto cause a search engine’s<b>Текст обрезан, так как является слишком большим</b>.