Crawling AJAX by Inferring User Interface State Changes (2008) (1176906), страница 3

Файл №1176906 Crawling AJAX by Inferring User Interface State Changes (2008) (тематика web-краулеров) 3 страницаCrawling AJAX by Inferring User Interface State Changes (2008) (1176906) страница 32020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 3)

Moreover, the current state pointer of thestate machine is also updated to this newly added state atthat moment (line 22).3.6 Processing Document Tree DeltasAfter a clickable has been identified, and its corresponding state created, the crawl procedure is recursively called(line 23) to find new possible states in the changes made tothe DOM tree.Upon every new (recursive) entry into the crawl procedure, the first thing done (line 12) is computing the differences between the previous document tree and the currentone, by means of an enhanced Diff algorithm [6, 19]. Such“delta updates” may be due, for example, to a server requestcall that injects new elements into the DOM.

The resultingdelta updates are used to find new candidate clickables (line13), which are then further processed in a depth-first manner.It is worth mentioning that in order to avoid a loop, alist of visited elements is maintained to exclude alreadychecked elements in the recursive algorithm. We use the tagname, the list of attribute names and values, and the XPathexpression of each element to conduct the comparison. Additionally, a depth number can be defined to constrain thedepth level of the recursive function (not shown in the algorithm).5SERGMesbah et.

al. – Crawling AJAX by Inferring User Interface State Changes3.7 Navigating the StatesUpon completion of the recursive call, the browsershould be put back into the state it was in before the call.Unfortunately, navigating (back and forth) through an A JAXsite is not as easy as navigating a classical web site. A dynamically changed DOM state does not register itself withthe browser history engine automatically, so triggering the‘Back’ function of the browser does not bring us to theprevious state. This complicates traversing the applicationwhen crawling A JAX. We distinguish two situations:Browser History Support It is possible to programaticallyregister each state change with the browser history throughframeworks such as the jQuery history/remote plugin4 orthe Really Simple History library5. If an A JAX applicationhas support for the browser history (line 25), then for changing the state in the browser, we can simply use the built-inhistory back functionality to move backwards (line 26).Click Through From Initial State In case the browser history is not supported, which is the case with many A JAX applications currently, the only way to get to a previous stateis by saving information about the elements and the orderin which their execution results in reaching to a particularstate.

Once we have such information, we can reload theapplication (line 28) and follow and execute the elementsfrom the initial state to the desired state. As an optimizationstep, we use Dijkstra’s shortest path algorithm [10] to findthe shortest element execution path on the graph to a certainstate (line 29).We initially considered using the ID attribute of a clickable element to find it back after a reload of the page. Whenwe reload the application in the browser, all the internalobjects are replaced by new ones and the ID attribute wouldbe a way to follow the path to a certain state by clickingon those elements whose IDs have been saved in the statemachine. Soon we realized that firstly, not all A JAX sitesassign ID attributes to the elements and, secondly, if IDsare provided, they are not always persistent, i.e., they aredynamically set and can change with each reload.To overcome these challenges, we adopt XPath to provide a better, more reliable, and persistent element identification mechanism.

For each state changing element, we reverse engineer the XPath expression of that element whichgives us its exact location on the DOM (line 18). We savethis expression in the state machine (line 19) and use it tofind the element after a reload, persistently (line 31).Note that because of side effects of the element execution, there is no guarantee that we reach the exact same statewhen we traverse a path a second time. It is, however, asclose as we can get.4 http://stilbuero.de/jquery/history/5 http://code.google.com/p/reallysimplehistory/6crawl MyAjaxSite {url: http :// spci .

st . ewi. tudelft. nl / aowe /;navigate Nav1 {event: type=mouseover xpath=/ HTML / BODY / SPAN [3];event: type=click id= headline;···}navigate Nav2 {event: type=clickxpath="// DIV[ contains(. ," Interviews")]";event: type=input id= article " john doe ";event: type=click id= search ;} ···}Figure 4. An instance of CASL.3.8CASL: Crawling A JAX SpecificationLanguageTo give users control over which candidate clickablesto select, we have developed a Domain Specific Language(DSL) [9] called Crawling A JAX Specification Language(CASL). Using CASL, the developer can define the elements (based on IDs and XPath expressions) to be clicked,along with the exact order in which the crawler should crawlthe A JAX application.

CASL accepts different types ofevents. The event types include click, mouseover, andinput currently.Figure 4 shows an instance of CASL. Nav1 tells ourcrawler to crawl by first firing an event of type mouseoveron the element with XPath /HTML/BODY/SPAN[3] and thenclicking on the element with ID headline in that order.Nav2 commands the crawler to crawl to the Interviewsstate, then insert the text ‘john doe’ into the input elementwith ID article and afterward click on the search element. Using this DSL, the developer can take control of theway an A JAX site should be crawled.3.9 Generating Indexable PagesAfter the crawling A JAX process is finished, the createdstate-flow graph can be passed to the generation process,corresponding to the bottom part of Figure 3.The first step is to establish links for the DOM states byfollowing the outgoing edges of each state in the state-flowgraph.

For each clickable, the element type must be examined. If the element is a hypertext link (an a-element), thehref attribute is updated. In case of other types of clickables (e.g., div, span) we replace the element by a hypertext link element. The href attribute in both situationsrepresents the link to the name and location of the generatedstatic page.After the linking process, each DOM object in the stateflow graph is transformed into the corresponding HTMLstring representation and saved on the file system in a dedicated directory (e.g., /generated/). Each generated staticTUD-SERG-2008-022SERGMesbah et.

al. – Crawling AJAX by Inferring User Interface State Changes5 Case Studiesfile represents the style, structure, and content of the A JAXapplication as seen in the browser, in exactly its specificstate at the time of crawling.Here, we can adhere to the Sitemap Protocol6, generating a valid instance of the protocol automatically after eachcrawling session consisting of the URLs of all generatedstatic pages.In order to evaluate the effectiveness, correctness, performance, and scalability of the proposed crawling method forA JAX, we have conducted a number of case studies, whichare described in this section, following Yin’s guidelines forconducting case studies [24].4 Tool Implementation5.1 Subject SystemsWe have implemented the concepts presented in this paper in a tool called C RAWLJAX.

C RAWLJAX is released under the open source BSD license and is available for download. More information about the tool can be found on ourwebsite http://spci.st.ewi.tudelft.nl/crawljax/.C RAWLJAX is implemented in Java. We have engineereda variety of software libraries and web tools to build and runC RAWLJAX. Here we briefly mention the main modules andlibraries.The embedded browser interface has two implementations: IE-based on Watij7 and Mozilla-based on XULRunner8.

Webclient9 is used to access the run-time DOM andthe browser history mechanism in the Mozilla browser. Forthe Mozilla version, the Robot component makes use ofthe java.awt.Robot class to generate native system inputevents on the embedded browser. The IE version uses aninternal Robot to simulate events.The generator uses JTidy10 to pretty-print DOM statesand Xerces11 to serialize the objects to HTML.

In theSitemap Generator, XMLBeans12 generates Java objectsfrom the Sitemap Schema13 which after being used byC RAWLJAX to create new URL entries, are serialized to thecorresponding valid XML instance document.The state-flow graph is based on the JGrapht14 library.The grammar of CASL is implemented in ANTLR15 .ANTLR is used to generate the necessary parsers forCASL. In addition, StringTemplate16 is used for generating the source-code from CASL. Log4j is used to optionally log various steps in the crawling process, such as theidentification of DOM changes and clickables. C RAWLJAXis entirely based on Maven17 to generate, compile, test (JUnit), release, and run the application.We have selected 6 A JAX sites for our experiment asshown in Table 1. The case ID, the actual site, and a numberof real clickables to illustrate the type of the elements canbe seen for each case object.Our selection criteria include the following: sites thatuse A JAX to change the state of the application by using JavaScript, assigning events to HTML elements, asynchronously retrieving delta updates from the server and performing partial updates on the DOM.The first site C1 in our case study is an A JAX test site developed internally by our group using the jQuery A JAX library.

Although the site is small, it is representative by having different types of dynamically set clickables as shownin Figure 1 and Table 1.Our second case object, C2, is Sun’s Ajaxified P ETS TORE 2.018 which is built on the Java ServerFaces, andthe Dojo A JAX toolkit. This open-source web applicationis designed to illustrate how the Java EE Platform can beused to develop an A JAX-enabled Web 2.0 application andadopts many advanced rich A JAX components.The other four cases are all external A JAX sites and wehave no access to their source-code.

Характеристики

Тип файла

PDF-файл

Размер

236,19 Kb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.