Crawling AJAX by Inferring User Interface State Changes (2008) (1176906), страница 5

Файл №1176906 Crawling AJAX by Inferring User Interface State Changes (2008) (тематика web-краулеров) 5 страницаCrawling AJAX by Inferring User Interface State Changes (2008) (1176906) страница 52020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 5)

Note that theperformance is also dependent on the CPU and memory ofthe machine C RAWLJAX is running on, as well as the speedof the server and network properties of the case site. C6, forinstance, is slow in reloading and retrieving updates fromits server and that increases the performance measurementnumbers in our experiment.G4 C RAWLJAX was able to run smoothly on the externalsites. Except a few minor adjustments (see Section 6) weTUD-SERG-2008-022did not witness any difficulties. C3 with depth level 2 wascrawled successfully in 83 minutes resulting in 19247 examined candidate elements, 1101 detected clickables, and1071 detected states. The generation process for the 1071states took 13 minutes.

For C5, C RAWLJAX was able tofinish the crawl process in 107 minutes on 32365 candidateelements, resulting in 1554 detected clickables and 1234states. The generation process took 13 minutes. As expected, in both cases, increasing the depth level from 1 to 2expands the state space greatly.6 Discussion6.1 Back ImplementationC RAWLJAX assumes that if the Browser Back functionality is implemented, then it is implemented correctly. Aninteresting observation was the fact that even though Backis implemented for some states, it is not correctly implemented i.e., calling the Back method brings the browserin a different state than expected which naturally confusesC RAWLJAX.

This implies that the Back method to go to aprevious state is not reliable and using the reload and clickthrough method is much more safe.6.2 Constantly Changing DOMAnother interesting observation in C2 in the beginningof the experiment was that every element was seen as aclickable. This phenomenon was caused by the banner.jswhich constantly changed the DOM with textual notifications. Hence, we had to either disable this banner to conductour experiment or use a higher similarity threshold so thatthe textual changes were not seen as a relevant state changefor detecting clickables.6.3 CookiesCookies can also cause some problems in crawling A JAXapplications.

C3 uses Cookies to store the state of the application on the client. With Cookies enabled, when C RAWL JAX reloads the application to navigate to a previous state,the application does not start in the expected initial state. Inthis case, we had to disable Cookies to perform a correctcrawling process.6.4 State SpaceThe set of found states and generated HTML pages isby no means complete, i.e., C RAWLJAX generates a staticinstance of the A JAX application but not necessarily the instance. This is partly inherent in dynamic web applications.9SERGMesbah et. al. – Crawling AJAX by Inferring User Interface State Changes5401813150192473808267323656972163314811015526715548316341481071561451234791634148107156145123479141292637949886750127267708380633464361867014168451643177237842952161143958041392879832122121Any crawler can only crawl and index a snapshot instanceof a dynamic web application in a point of time.

The order in which clickables are chosen could generate differentstates. Even executing the same clickable twice from anstate could theoretically produce two different DOM statesdepending on, for instance, server-side factors.The number of possible states in the state space of almost any realistic web application is huge and can causethe well-know state explosion problem [23]. Just as a traditional web crawler, C RAWLJAX provides the user with a setof configurable options to constrain the state space such asthe maximum search depth level, the similarity threshold,maximum number of states per domain, maximum crawling time, and the option of ignoring external links and linksthat match some pre-defined set of regular expressions, e.g.,mail:*, *.ps, *.pdf.The current implementation of C RAWLJAX keeps theDOM states in the memory which can lead to an state explosion and out of memory exceptions with approximately3000 states on a machine with a 1GB RAM.

As an optimization step we intend to abstract and serialize the DOM stateinto a database and only keep a reference in the memory.This saves much space in the memory and enables us to handle much more states. With a cache mechanism, the essential states for analysis can be kept in the memory while theother ones can be retrieved from the database when neededin a later stage.7 ApplicationsAs mentioned in the introduction, we believe that thecrawling and generating capabilities of our approach havemany applications for A JAX sites.10TagsDepth134404Generation Performance (ms)C6Crawl Performance (ms)40282165411Generated Static PagesC4C5Detected States459024636262505Detected ClickablesC1C2C3Candidate ElementsDOM string size (byte)CaseTable 2.

Results of running C RAWLJAX on 6 A JAX applications.A, DIV, SPAN, IMGA, IMGAA, TDA, DIV, INPUT, IMGAA, DIVA, DIVWe believe that the crawling techniques that are part ofour solution can serve as a starting point and be adoptedby general search engines to be able to crawl A JAX sites.General web search engines, such as Google and Yahoo!,cover only a portion of the web called the publicly indexable web which consists of the set of web pages reachablepurely by following hypertext links, ignoring forms [4] andclient-side scripting.

The pages not reached this way arereferred to as the hidden-web, which is estimated to comprise several millions of pages [4]. With the wide adoptionof A JAX techniques that we are witnessing today this figure will only increase. Although there has been extensiveresearch on crawling and exposing the data behind forms[4, 8, 14, 21, 22], crawling the hidden-web induced as aresult of client-side scripting in general and A JAX in particular has gained very little attention so far.

Consequently,while A JAX techniques are very promising in terms of improving rich interactivity and responsiveness [20, 5], A JAXsites themselves may very well be ignored by the search engines.There are some industrial proposed techniques that assist in making a modern A JAX website more accessible anddiscoverable by general search engines. In web engineering terms, the concept behind Graceful Degradation [12]is to design and build for the latest and greatest user-agentand then add support for less capable devices, i.e., focuson the majority on the mainstream and add some supportfor outsiders. Graceful Degradation allows a web site to‘step down’ in such a way as to provide a reduced level ofservice rather than failing completely.

A well-known example is the menu bar generated by JavaScript which wouldnormally be totally ignored by search engines. By usingHTML list items with hypertext links inside a noscriptTUD-SERG-2008-022SERGMesbah et. al. – Crawling AJAX by Inferring User Interface State Changestag, the site can degrade gracefully. The term ProgressiveEnhancement19 has been used as the opposite side to Graceful Degradation. This technique aims for the lowest common denominator, i.e., a basic markup HTML document,and begins with a simple version of the web site, then addsenhancements and extra rich functionality for the more advanced user-agents using CSS and JavaScript.Another way to expose the hidden-web content behindA JAX applications is by making the content available tosearch engines at the server-side by providing it in an accessible style.

The content could, for instance, be exposedthrough RSS feeds. In the spirit of Progressive Enhancement, an approach called Hijax20 involves building a traditional multi-page website first. Then, using unobtrusiveevent handlers, links and form submissions are interceptedand routed through the XMLHttpRequest object. Generating and serving both the A JAX and the multi-page version depending on the visiting user-agent is yet another approach. Another option is the use of XML/XSLT to generate indexable pages for search crawlers [3]. In these approaches, however, the server-side architecture will need tobe quite modular, capable of returning delta changes as required by A JAX, as well as entire pages.The Graceful Degradation and Progressive Enhancementapproaches mentioned constrain the use of A JAX and havelimitations in the content exposing degree.

It is very hardto imagine a single-page desktop-style A JAX applicationthat degrades into a plain HTML website using the samemarkup and client-side code. The more complex the A JAXfunctionality, the higher the cost of weaving advanced andaccessible functionality into the components21. The serverside generation approaches increase the complexity, development costs, and maintainability effort as well. We believeour proposed solution can assist the web developer in theautomatic generation of the indexable version of their A JAXapplication, thus significantly reducing the cost and effort ofmaking A JAX sites more accessible to search engines.

Suchan automatically built mirror site can also improve the accessibility22 of the application towards user-agents that donot support JavaScript.When it comes to states that need textual input from theuser (e.g., input forms) CASL can be very helpful to crawland generate the corresponding state. The Full Auto Scan,however, does not have the knowledge to provide such inputautomatically. Therefore, we believe a combination of thethree modes to take the best of each could provide us with atool not only for crawling but also for automatic testing ofA JAX applications.The ability to automatically exercise all the executable19http://hesketh.com/publications/progressive enhancementpaving way for future.html20 http://www.domscripting.com/blog/display/4121 http://blogs.pathf.com/agileajax/2007/10/accessibility-a.html22 http://bexhuff.com/node/165TUD-SERG-2008-022elements of an A JAX site gives us a powerful test mechanism.

The crawler can be utilized to find abnormalitiesis A JAX sites. As an example, while conducting the casestudy, we noticed a number of 404 Errors and exceptionson C3 and C4 sites. Such errors can easily be detectedand traced back to the elements and states causing the errorstate in the inferred state-flow graph. The asynchronous interaction in A JAX can cause race conditions [20] betweenrequests and responses, and the dynamic DOM updatescan also introduce new elements which can be sources offaults. Detection of such conditions by analyzing the generated state machine and static pages can be assisted aswell. In addition, testing A JAX sites for compatibility ondifferent browsers (e.g., IE, Mozilla) can be automated using C RAWLJAX.The crawling methods and the produced state machinecan be applied in conducting state machine testing [1] forautomatic test case derivation, verification, and validationbased on pre-defined conditions for A JAX applications.8 Related WorkThe concept behind C RAWLJAX, is the opposite direction of our earlier work R ETJAX [19], in which we try toreverse-engineer a traditional multi-page website to A JAX.The work of Memon et al.

[17, 18] on GUI Rippingfor testing purposes is related to our work in terms of howthey reverse engineer an event-flow graph of desktop GUIapplications by applying dynamic analysis techniques.There are some industrial proposed approaches for improving the accessibility and discoverability of A JAX as discussed in Section 7.There has been extensive research on crawling thehidden-web behind forms [4, 7, 8, 14, 21, 22]. This issharp contrast with the the hidden-web induced as a resultof client-side scripting in general and A JAX in particular,which has gained very little attention so far. As far as weknow, there are no academic research papers on crawlingA JAX at the moment.9 Concluding RemarksCrawling A JAX is the process of turning a highly dynamic, interactive web-based system into a static mirrorsite, a process that is important to improve searchability,testability, and accessibility of A JAX applications.

Характеристики

Тип файла

PDF-файл

Размер

236,19 Kb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.