Crawling AJAX by Inferring User Interface State Changes (2008) (1176906), страница 4

Файл №1176906 Crawling AJAX by Inferring User Interface State Changes (2008) (тематика web-краулеров) 4 страницаCrawling AJAX by Inferring User Interface State Changes (2008) (1176906) страница 42020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 4)

C4 is an A JAX site thatcan function as a tool for comparing the visual impressionof different typefaces. C3 (online shop), C5 (sport center),and C6 (Gucci) are all single-page commercial sites withmany clickables and states.6 http://www.sitemaps.org/protocol.php7 http://watij.com8 http://developer.mozilla.org/en/docs/XULRunner/9 http://www.mozilla.org/projects/blackwood/webclient/10 http://jtidy.sourceforge.net11 http://xerces.apache.org/xerces-j/12 http://xmlbeans.apache.org13 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd14 http://jgrapht.sourceforge.net15 http://www.antlr.org16 http://www.stringtemplate.org17 http://maven.apache.orgTUD-SERG-2008-0225.2 Experimental DesignOur goals in conducting the experiment include:G1 Effectiveness: evaluating the effectiveness of obtaininghigh-quality results in retrieving relevant clickables including the ones dynamically injected into the DOM,G2 Correctness: assessing the quality and correctness ofthe states and static pages automatically generated,G3 Performance: analyzing the overall performance of ourapproach in terms of input size versus time,18 http://java.sun.com/developer/releases/petstore/7Mesbah et.

al. – Crawling AJAX by Inferring User Interface State ChangesSERGTable 1. Case objects and examples of their clickable elements.CaseC1A JAX sitespci.st.ewi.tudelft.nl/demo/aowe/C2PETSTOREC3www.4launch.nlC4www.blindtextgenerator.comC5site.snc.tudelft.nlC6www.gucci.comaClickable Elements<span id="testspan2" class="testing">testing span 2</span><a onclick="nav(’l2’); return false;" href="#">Second link</a><a title="Topics" href="#Topics" class="remoteleft left">Topics of Interest</a><a class="accordionLink" href="#" id="feline01" onmouseout="this.className=’accordionLink’;" onmouseover="this.className=’accordionLinkHover’;">Hairy Cat</a><div onclick="setPrefCookies(’Gaming’, ’DESTROY’, ’DESTROY’);loadHoofdCatsTree(’Gaming’, 1, ’’)"><a id="uberCatLink1"class="ubercat" href="javascript:void(0)">Gaming</a></div><td onclick="open url(’..producteninfo.php?productid=037631’,..)">Harddisk Skin</td><input type="radio" value="7" name="radioTextname" class="js-textname iradio"id="idRadioTextname-EN-li-europan"/><a id="idSelectAllText" title="Select all" href="#"><div class="itemtitlelevel1 itemtitle" id="menuitem 189 e">organisatie</div><a href="#" onclick="ajaxNews(’524’)">...</a><a onclick="Shop.selectSort(this); return false" class="booties" href="#">booties</a><div id="thumbnail 7" class="thumbnail highlight"><img src="...001 thumb.jpg" /><divclass="darkening"/></div>a http://www.gucci.com/nl/uk-english/nl/spring-summer-08/womens-shoes/G4 Scalability: examining the capability of C RAWLJAX onreal sites used in practice and the scalability in crawling sites with thousands of dynamic states and clickables.Environment & Tool ConfigurationWe use a laptop with Intel Pentium M 765 processor 1.73GHz, with 1GB RAM and Windows XP to runC RAWLJAX.Configuring C RAWLJAX itself is done through a simplecrawljax.properties file, which can be used to set theURL of the site to be analyzed, the tag elements C RAWLJAXshould look for, the depth level, and the similarity threshold.There are also a number of other configuration parametersthat can be set, such as the directory in which the generatedpages should be saved in.OutputWe determine the average DOM string size, number of candidate elements, number of detected clickables, number ofdetected states, number of generated static pages, and performance measurements for crawling and generating pagesseparately for each experiment object.

The actual generatedlinked static pages also form part of the output.model was created manually by clicking through the different states in a browser. In total 16 clickables were noted ofwhich 10 were on the top level, i.e., index state. To constrain the reference model for C2, we chose two productcategories, namely CATS and DOGS, from the five available categories. We annotated 36 elements (product items)by modifying a JavaScript method which turns the items retrieved from the server into clickables on the interface. Forthe four external sites (C3–C6) which have many states, itis very difficult to manually inspect and determine, for instance, the number of expected clickables and states.

Therefor, for each site, we randomly selected 10 clickables inadvance by noting their tag name, attributes, and XPath expression. After each crawling process, we checked the presence of the 10 elements among the list of detected clickables.G2: After the generation process the generated HTML filesand their content are manually examined to see whetherthe pages are the same as the corresponding DOM statesin A JAX in terms of structure, style, and content.

Also theinternal linking of the static pages is manually checked. Totest the clone detection ability we have intentionally introduced a clone state into C1.Method of EvaluationSince other comparable tools and methods are currently notavailable to conduct similar experiments as with C RAWL JAX , it is difficult to define a baseline against which wecan compare the results. Hence, we manually inspect thesystems under examination and determine which expectedbehavior should form our reference baseline.G3: We measure the time in milliseconds taken to crawleach site.

We expect the crawling performance to be directly proportional to the input size which is comprisedof the average DOM string size, number of candidate elements, and number of detected clickables and states.We also measure the generation performance which isthe period taken to generate the static HTML pages fromthe inferred state-flow graph.G1: For the experiment we have manually added extraclickables in different states of C1, especially in the deltaupdates, to explore whether clickables dynamically injectedinto the DOM can be found by C RAWLJAX.

A referenceG4: To test the capability of our method in crawlingreal sites and coping with unknown environments, werun C RAWLJAX on four external cases C3–C6. We runC RAWLJAX with depth level 2 on C3 and C5 each having a8TUD-SERG-2008-022SERGMesbah et. al. – Crawling AJAX by Inferring User Interface State Changeshuge state space to examine the scalability of our approachin analyzing tens of thousands of candidate clickables andfinding clickables.5.3 Results and EvaluationTable 2 presents the results obtained by running C RAWL on the subject systems. The measurements were allread from the log file produced by C RAWLJAX at the end ofeach process.JAXG1 As can be seen in Table 2, for C1 C RAWLJAX finds allthe 16 expected clickables and states with a precision andrecall of 100%.For C2, 33 elements were detected from the annotated36. One explanation behind this difference could be the waysome items are shown to the user in P ET S TORE. P ET S TOREuses a Catalog Browser to show a set of the total numberof the product items.

The 3 missing product items could bethe ones that were never shown on the interface because ofthe navigational flow e.i., the order of clickables.C RAWLJAX was able to find 95% of the expected 10clickables (noted initially) for each of the four external sitesC3–C6.G2 The clone state introduced in C1 is correctly detectedand that is why we see 16 states being reported instead of17. Inspection of the static pages in all cases shows that thegenerated pages correspond correctly to the DOM state.G3 When comparing the results for the two internal sites,we see that it takes C RAWLJAX 14 and 26 seconds to crawlC1 and C2 respectively. As can be seen, the DOM in C2is 5 times and the number of candidate elements 3 timeshigher.

In addition to the increase in DOM size and thenumber of candidate elements, C RAWLJAX cannot rely onthe browser Back method when crawling C2. This meansfor every state change on the browser C RAWLJAX has toreload the application and click through to the previous stateto go further. This reloading and clicking through has anegative effect on the performance. The generation timealso doubles for C2 due to the increase in the input size. It isclear that the running time of C RAWLJAX increases linearlywith the size of the input. We believe that the executiontime of a few minutes to crawl and generate a mirror multipage instance of an A JAX application automatically withoutany human intervention is very promising.

Характеристики

Тип файла

PDF-файл

Размер

236,19 Kb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.