реферат (1185422), страница 2
Текст из файла (страница 2)
The first method splits PDG to weaklyconnected components. The second method splits PDG to subgraphs, where everypair has less than N common nodes. These two methods have two basicdisadvantages: sizes of subgraphs might have big variation; corresponding sourcecode line for one subgraph might be located far from each other. To escape thesedisadvantages, the third new method is purposed. Edges of PDG are considered asintervals.
For PDG’s corresponding source code lines, the numbers of intersectedintervals are considered. Source code is split based on those lines, which haveminimal number of intersection intervals. Corresponding subgraphs for split codefragments are considered for clone detection. Experimental results shown that thismethod of splitting allows detect about 1.5 – 2 times more clones.4.2.
Fast checksThese algorithms have liner complexity and try to prove that the pair of PDGsdoes not have big enough isomorphic subgraphs. Two nodes of PDG are similar iftheir types are the same. Fast check algorithms compare PDG’s nodes based on theirtypes. If algorithm was not able to detect enough pair of similar nodes incorresponding graphs, these graphs can’t have big enough isomorphic subgraphs.The first algorithm store PDG’s node to hash set, the key for the set is node’stype.
If the size of intersection, for the sets of corresponding pair of PDGs, is not bigenough then this pair of PDGs does not have desired isomorphic subgraphs.The second algorithm compute characteristic vector for every PDG. Elementsof this vector are count of nodes with specific type. If the Euclidean distance forcorresponding vectors of considered pair of PDGs, is to big then they do not haveenough big isomorphic subgraphs.4.3.
Metrics based clone detectionFor every node of PDG characteristic vector is constructed. The length of thevector is equal 2 * M, where M is number of all possible different types of nodes,7values are 0 or 1. Definition 1: For two vectors V1 and V2 with length N, V1 & V2 is a newvector V with length N where V[i] = V1[i] & V2[i] (1 & 1 = 1, otherwise 0), 1 ≤ i ≤ N. Definition 2: For two vectors V1 and V2 with length N, V1 | V2 is a new vectorV with length N where V[i] = V1[i] | V2[i] (0 | 0 = 0, otherwise 1), 1 ≤ i ≤ N. Definition 3: For two vectors V1 and V2 with equal length, and with elements0 or 1, (1, 2) is a number of 1 in V1 & V2 vector. Definition 4: For two vectors V1 and V2 with equal length, and with elements 0or 1, (1, 2) is a number of 1 in V1 | V2 vector. Definition 5: The similarity for two vectors V1 and V2 with equal length, andwithelements0or1,isnumber1–(1,2),where(1,2) = (1, 2) − (1, 2)1 + (1, 2) is metric for the setof N length vectors with values {0, 1}, and 0 ≤ (1, 2) ≤ 1.For the pair of PDGs, similar subgraphs are detected based in summary ofcharacteristic vectors for corresponding nodes.
Algorithm considers all pairs ofnodes for corresponding PDGs. Maximal similar nodes are added to correspondingisomorphic subgraphs and the process is continued for remaining nodes.4.4. Slice based clone detectionFor considered pair of PDGs candidate pairs of nodes are constructed.
The firstnode in the pair is from the first PDG, the second one from the second PDG. Forevery pair of nodes backward and forward slices are applied to construct isomorphicsubgraphs. Maximal isomorphic subgraphs are selected from constructed set of pairsof isomorphic subgraphs.Two approaches are developed for candidate set construction. The firstapproach for every node of the first PDG chooses the most similar node from thesecond PDG. Metrics are used for similar nodes detection. The second approach8considers vertices with maximal number of neighbors from the first PDG and triesto find identical vertices (with neighbors) from the second PDG.4.5.
Tree based clone detectionPDG graph is transformed to tree. During transformation new vertices andedges are added to save as much information from PDG as possible. Then clones aredetected as isomorphic subtrees. This approach allows apply accurate algorithms ofmaximal isomorphic subtrees detection. Tree transformation method has two basicsteps. The first step is topological sort of PDG graph. Every PDG graph containsstart vertices (vertices which do not have incoming edges). Topological sort startsfrom these vertices, back forwarded edges are removed during sorting. The secondstage considers levels of nodes, obtained by topological sort, in reverse order. Thosenodes which have more than one incoming edges are transformed.
Correspondingtrees for isomorphic PDG graphs are isomorphic.4.6. Differences of clone detection methodsPurposed three methods of code clone detection have some differences.Metrics based method has lowest computational complexity. This method hasrelatively low accuracy, some strongly modified fragments of code are not detected,but it works much faster. Slice based method is the most accurate one, it capable todetect strong modifications in the source code, but it has highest complexity. Treebased method accurately detects T1 and T2 types of clones. T3 type of clones isdetected with low accuracy. The type of algorithm for clone detection depends onspecific task.
If required clones are T1 and T2 types,thentreebasedmethod effectively detects them. For accurate detection of all clones slice basedmethod should be applied.4.7. FiltrationThe last stage in the process of code clone detection is filtration of somedetected pairs of isomorphic subgraphs. The need for a filter arises from the fact thatthe concept of code clone is defined for source code of the program, but isomorphicsubgraphs are considered as clones. Code clone must be certain sequence of lines in9the file (not necessarily consecutive, but not highly dispersed).
The purpose offiltering is to verify that the source code for corresponding isomorphic subgraphs isnot much scattered.5. AUTOMATIC CLONE GENERATION FOR TESTINGTwo approaches are purposed for automatic generation of code clones. The firstmethod uses standard transformation and optimization passes of LLVM. For everyfunction of LLVM bitcode two PDGs are constructed. The first PDG is original, itconstructed based on LLVM bitcode generated by clang. The second PDG is clone,it constructed based on transformed bitcode. Standard passes of LLVM are appliedto bitcode for transformation. The second method merge original list of PDGs for theproject to generate code clones.
Three methods are applied for PDGs’ merge. Thefirst method unions two PDGs without adding of extra edges and vertices. Thesecond method unions pair of PDGs and add extra random edges between nodes ofcorresponding graphs. The third method considers nodes of the first PDG and try tofind similar nodes in the second PDG. If similar node is detected, then all neighborsof this node are added to first PDG with their corresponding edges.
To checkcorrectness of realized clone detection algorithms original and cloned PDGs arecompared. Number of detected clones describes correctness of algorithm.6. CONCLUSIONDescribed methods have been tested on large projects such as Android OS,Firefox web-browser and Linux Kernel. The resulting tool proved scalability, itallows find places of code cloning in projects with millions lines of code. During myresearch the tool was used and improved. The first stage is almost the same as inoriginal tool except for a few important optimizations, which allows construct PDGfor very big source files. The second stage is PDG’s generation for project with somevulnerabilities. Specific defects can be found on the Internet in different databases.Each defect has an identifier that can be used to find a patch for it.
The third stage is10creating intermediate representation of defect using the information in patch aboutthe changed, deleted or added lines. Next the tool is comparing each defect’s PDGwith all PDGs in the analyzed project. Comparison method is based ongraph-subgraph isomorphism algorithm. Comparing is the search of subgraphs inlarge PDG of project’s function which are isomorphic to small defect’s PDG. This isthe NP problem, but there are algorithms having a quadratic complexity from thenumber of vertices of a larger graph.
Since the second graph is small enough the toolworks quickly. The last stage repeats last stage from original tool. Results haveshown that method of searching code clones can be applied in the field of softwaresecurity.11BIBLIOGRAPHY1. Chanchal K. Roy, James R. Cordy, Rainer Koschke Comparison andevaluation of code clone detection techniques and tools: A qualitativeapproach // Science of Computer Programming. 2009. 74. N 7. P. 470-495.2.
Susan Horwitz, Thomas Reps, David Binkley Interprocedural Slicing UsingDependence Graphs // ACM Transactions on Programming Languages andSystems (TOPLAS). 1990. 12. N 1. P. 26-60.3. Jens Krinke Identifying Similar Code with Program Dependence Graphs //WCRE '01 Proceedings of the Eighth Working Conference on ReverseEngineering (WCRE'01). DC, USA: IEEE Computer Society Washington,2001. P. 301.4. L. P.
Cordella, P. Foggia, C. Sansone, M. Vento A (Sub)Graph IsomorphismAlgorithm for Matching Large Graphs // IEEE Transactions on PatternAnalysis and Machine Intelligence. 2004. 26. N 10. P. 1367-1372.12.