Crawling AJAX by Inferring User Interface State Changes (2008) (тематика web-краулеров), страница 4
Описание файла
Файл "Crawling AJAX by Inferring User Interface State Changes (2008)" внутри архива находится в папке "тематика web-краулеров". PDF-файл из архива "тематика web-краулеров", который расположен в категории "". Всё это находится в предмете "английский язык" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 4 страницы из PDF
C4 is an A JAX site thatcan function as a tool for comparing the visual impressionof different typefaces. C3 (online shop), C5 (sport center),and C6 (Gucci) are all single-page commercial sites withmany clickables and states.6 http://www.sitemaps.org/protocol.php7 http://watij.com8 http://developer.mozilla.org/en/docs/XULRunner/9 http://www.mozilla.org/projects/blackwood/webclient/10 http://jtidy.sourceforge.net11 http://xerces.apache.org/xerces-j/12 http://xmlbeans.apache.org13 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd14 http://jgrapht.sourceforge.net15 http://www.antlr.org16 http://www.stringtemplate.org17 http://maven.apache.orgTUD-SERG-2008-0225.2 Experimental DesignOur goals in conducting the experiment include:G1 Effectiveness: evaluating the effectiveness of obtaininghigh-quality results in retrieving relevant clickables including the ones dynamically injected into the DOM,G2 Correctness: assessing the quality and correctness ofthe states and static pages automatically generated,G3 Performance: analyzing the overall performance of ourapproach in terms of input size versus time,18 http://java.sun.com/developer/releases/petstore/7Mesbah et.
al. – Crawling AJAX by Inferring User Interface State ChangesSERGTable 1. Case objects and examples of their clickable elements.CaseC1A JAX sitespci.st.ewi.tudelft.nl/demo/aowe/C2PETSTOREC3www.4launch.nlC4www.blindtextgenerator.comC5site.snc.tudelft.nlC6www.gucci.comaClickable Elements<span id="testspan2" class="testing">testing span 2</span><a onclick="nav(’l2’); return false;" href="#">Second link</a><a title="Topics" href="#Topics" class="remoteleft left">Topics of Interest</a><a class="accordionLink" href="#" id="feline01" onmouseout="this.className=’accordionLink’;" onmouseover="this.className=’accordionLinkHover’;">Hairy Cat</a><div onclick="setPrefCookies(’Gaming’, ’DESTROY’, ’DESTROY’);loadHoofdCatsTree(’Gaming’, 1, ’’)"><a id="uberCatLink1"class="ubercat" href="javascript:void(0)">Gaming</a></div><td onclick="open url(’..producteninfo.php?productid=037631’,..)">Harddisk Skin</td><input type="radio" value="7" name="radioTextname" class="js-textname iradio"id="idRadioTextname-EN-li-europan"/><a id="idSelectAllText" title="Select all" href="#"><div class="itemtitlelevel1 itemtitle" id="menuitem 189 e">organisatie</div><a href="#" onclick="ajaxNews(’524’)">...</a><a onclick="Shop.selectSort(this); return false" class="booties" href="#">booties</a><div id="thumbnail 7" class="thumbnail highlight"><img src="...001 thumb.jpg" /><divclass="darkening"/></div>a http://www.gucci.com/nl/uk-english/nl/spring-summer-08/womens-shoes/G4 Scalability: examining the capability of C RAWLJAX onreal sites used in practice and the scalability in crawling sites with thousands of dynamic states and clickables.Environment & Tool ConfigurationWe use a laptop with Intel Pentium M 765 processor 1.73GHz, with 1GB RAM and Windows XP to runC RAWLJAX.Configuring C RAWLJAX itself is done through a simplecrawljax.properties file, which can be used to set theURL of the site to be analyzed, the tag elements C RAWLJAXshould look for, the depth level, and the similarity threshold.There are also a number of other configuration parametersthat can be set, such as the directory in which the generatedpages should be saved in.OutputWe determine the average DOM string size, number of candidate elements, number of detected clickables, number ofdetected states, number of generated static pages, and performance measurements for crawling and generating pagesseparately for each experiment object.
The actual generatedlinked static pages also form part of the output.model was created manually by clicking through the different states in a browser. In total 16 clickables were noted ofwhich 10 were on the top level, i.e., index state. To constrain the reference model for C2, we chose two productcategories, namely CATS and DOGS, from the five available categories. We annotated 36 elements (product items)by modifying a JavaScript method which turns the items retrieved from the server into clickables on the interface. Forthe four external sites (C3–C6) which have many states, itis very difficult to manually inspect and determine, for instance, the number of expected clickables and states.
Therefor, for each site, we randomly selected 10 clickables inadvance by noting their tag name, attributes, and XPath expression. After each crawling process, we checked the presence of the 10 elements among the list of detected clickables.G2: After the generation process the generated HTML filesand their content are manually examined to see whetherthe pages are the same as the corresponding DOM statesin A JAX in terms of structure, style, and content.
Also theinternal linking of the static pages is manually checked. Totest the clone detection ability we have intentionally introduced a clone state into C1.Method of EvaluationSince other comparable tools and methods are currently notavailable to conduct similar experiments as with C RAWL JAX , it is difficult to define a baseline against which wecan compare the results. Hence, we manually inspect thesystems under examination and determine which expectedbehavior should form our reference baseline.G3: We measure the time in milliseconds taken to crawleach site.
We expect the crawling performance to be directly proportional to the input size which is comprisedof the average DOM string size, number of candidate elements, and number of detected clickables and states.We also measure the generation performance which isthe period taken to generate the static HTML pages fromthe inferred state-flow graph.G1: For the experiment we have manually added extraclickables in different states of C1, especially in the deltaupdates, to explore whether clickables dynamically injectedinto the DOM can be found by C RAWLJAX.
A referenceG4: To test the capability of our method in crawlingreal sites and coping with unknown environments, werun C RAWLJAX on four external cases C3–C6. We runC RAWLJAX with depth level 2 on C3 and C5 each having a8TUD-SERG-2008-022SERGMesbah et. al. – Crawling AJAX by Inferring User Interface State Changeshuge state space to examine the scalability of our approachin analyzing tens of thousands of candidate clickables andfinding clickables.5.3 Results and EvaluationTable 2 presents the results obtained by running C RAWL on the subject systems. The measurements were allread from the log file produced by C RAWLJAX at the end ofeach process.JAXG1 As can be seen in Table 2, for C1 C RAWLJAX finds allthe 16 expected clickables and states with a precision andrecall of 100%.For C2, 33 elements were detected from the annotated36. One explanation behind this difference could be the waysome items are shown to the user in P ET S TORE. P ET S TOREuses a Catalog Browser to show a set of the total numberof the product items.
The 3 missing product items could bethe ones that were never shown on the interface because ofthe navigational flow e.i., the order of clickables.C RAWLJAX was able to find 95% of the expected 10clickables (noted initially) for each of the four external sitesC3–C6.G2 The clone state introduced in C1 is correctly detectedand that is why we see 16 states being reported instead of17. Inspection of the static pages in all cases shows that thegenerated pages correspond correctly to the DOM state.G3 When comparing the results for the two internal sites,we see that it takes C RAWLJAX 14 and 26 seconds to crawlC1 and C2 respectively. As can be seen, the DOM in C2is 5 times and the number of candidate elements 3 timeshigher.
In addition to the increase in DOM size and thenumber of candidate elements, C RAWLJAX cannot rely onthe browser Back method when crawling C2. This meansfor every state change on the browser C RAWLJAX has toreload the application and click through to the previous stateto go further. This reloading and clicking through has anegative effect on the performance. The generation timealso doubles for C2 due to the increase in the input size. It isclear that the running time of C RAWLJAX increases linearlywith the size of the input. We believe that the executiontime of a few minutes to crawl and generate a mirror multipage instance of an A JAX application automatically withoutany human intervention is very promising.