реферат
Описание файла
PDF-файл из архива "реферат", который расположен в категории "". Всё это находится в предмете "английский язык" из 10 семестр (2 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст из PDF
Lomonosov Moscow State UniversityFaculty of Computational Mathematics and CyberneticsDepartment of algorithmic languagesReport.Code clones detection and its using.Asiryan Alexander KamoevichGroup 524Moscow, 2017ContentsINTRODUCTION ...................................................................................................... 3BACKGROUND ......................................................................................................... 3Clone types ................................................................................................................
3Code clone detection approaches ............................................................................. 4Program Dependence Graph ..................................................................................... 5PDG GENERATION.................................................................................................. 5CLONE DETECTION ............................................................................................... 6PDGs’ splitting .......................................................................................................
7Fast checks ............................................................................................................. 7Metrics based clone detection ................................................................................ 7Slice based clone detection .................................................................................... 8Tree based clone detection ..................................................................................... 9Differences of clone detection methods ................................................................. 9Filtration .................................................................................................................
9AUTOMATIC CLONE GENERATION FOR TESTING ................................... 10CONCLUSION ......................................................................................................... 10BIBLIOGRAPHY ..................................................................................................... 1221. INTRODUCTIONSoftware developers can reuse the same fragments of code many times bymaking small modifications. Hard deadlines usually increase copy-paste activities,which increase number of code clones. Code cloning can lead to many semanticerrors. For example, software developer can forget to rename some variable aftercopy-past. The software, which has many clones, probably will have many mistakesand low quality.
According to different studies up to 20 percent of source code canbe clone in software. Clone detection tools are widely used: During software development to avoid mistakes and improve its quality. For automatic refactoring. For code size optimizations. For semantic mistakes detection.The goal of the research is to introduce LLVM based code clone detectionframework.
In the first stage of tool’s work PDGs are generated for each function ineach source file of the project. They are constructed, based on intermediaterepresentation of LLVM bitcode, during compilation time of the project. Thisapproach allows generating PDGs of the project very fast and without doubly analyzeof source code. Third stage is responsible for splitting PDGs on small subgraphs. Thethird stage analyzes PDGs for code clones detection.
It contains number of newalgorithms for similar subgraphs detection. Due to use of combined algorithms thetool scalable to analyze million lines of source code. Last stage is the results filtering.2. BACKGROUND2.1. Clone typesThere are three basic types of clones. The first type is identical code fragmentsexcept the variations in whitespace (may be also variations in layout) and comments(T1). The second type is structurally/syntactically identical code fragments exceptthe variations in identifiers, literals, types, layout and comments (T2). The third typeis copied fragments of code with further modifications. Statements can be changed,3added or removed in addition to variations in identifiers, literals, types, layout andcomments (T3).2.2.
Code clone detection approachesThere are five basic approaches for code clone detection.1. Methods based on textual approach consider the source code as text andtry to find equal substrings. These substrings are clones. When all clonesare found, clones which are located nearby can be combined to one.Basically (T1) types of clones are determined.2.
In case of lexical approach source code is parsed to sequence of tokens.Then longest common subsequence is determined. There are a few effectivealgorithms based on the parameterized suffix tree for clone detection.One more interesting method transform java code to some intermediaterepresentation and compare them instead of original source. These types ofalgorithms can find basically (T1) and (T2) clone types.3. The next is syntactic approach.
The algorithm works on Abstract SyntaxTree (AST). In this case clones are matched subtrees of AST. Somealgorithms directly compare two ASTs to find common subtrees. Anotheralgorithm constructs vectors for AST subtrees and compares them.Algorithms based on this approach find all three types of clones.4. Metrics based algorithms are widely used for clone detection. Algorithmsbased on this method, compute number of metrics for code fragment andcompare them. Basically these metrics are computed for AST and ProgramDependence Graph (PDG).
Another method clusters computed metrics byusing neural networks. Metrics based algorithms have better performancethan AST or PDG comparison algorithms, but low accuracy.5. The last is semantic approach. The source code is parsed to PDG. Nodesof PDG are instructions of program. Edges of PDG are dependencesbetween instructions. Algorithms based on PDG try to find maximalisomorphic subgraphs for pair of PDGs. All algorithms are approximate4because maximal isomorphic subgraphs detection problem is NP hard.PDG based methods have high accuracy but low performance.2.3.
Program Dependence GraphProgram is presented in a program dependence graph or PDG. This is one ofthe most common representations of code like Abstract Syntax Tree or Control FlowGraph, which shows dependencies between statements and predicates as orientedgraph. The advantage of using code representation PDG compared with otherstandard concepts such as the CFG and AST is that PDG explicitly shows therelationship data, in contrast to CFG, and flow control is presented only implicitlyin AST.3. PDG GENERATIONPDGs for the project are generated based on LLVM intermediaterepresentation called bitcode. Separate pass of LLVM is added for these graphsgeneration.
It has several advantages. Graphs are generated during compilation-timeof the project. It allows effectively construct graphs for large scale projects (millionlines of source code). Vertices of PDG graph are LLVM bitcode instructions. Edgesare obtained based on LLVM use-def, alias and control flow analyses. Those verticeswhich have no edges are removed, after optimized PDGs stored to files.Edges indicate dependencies between instructions, and may be of data andcontrol. Edges responsible for control flow is Control-dependent and conducted bytransmitting control instruction to the instruction, in which control is transferred.They can be built in three ways:1. Only edges which represent transitions between base blocks.2.
The edges are constructed, not only between base blocks, but alsowithin them, between successive instructions.3. The edges are constructed between base blocks, but are held to all theinstructions of the base block, not just the first.Edges showing data dependence constructed in two ways:51. use-def analysis: the edges represent the relation between the LLVMinstruction and its operands.2. alias analysis: there are three types of relationships: True-dependence: the first instruction writes to memory, fromwhich the second statement then reads. Anti-dependence: the first instruction reads the memory, which thenwrites the second instruction. Output-dependence: the first instruction writes to memory, in thatthe second statement also writes.Also between the first and second instructions, there is not even a singlestatement, which overwrites the memory used by the first instruction, and may be inthe execution path between them.
All the edges are held by the first instruction to thesecond.LLVM provides compiler API and has big set of optimization libraries. Dueto this many programming languages provide source code translation to LLVMbitcode. It allows apply developed tool for all these languages. PDG is uniform forall supported languages which allows detect code clones cross different languages.4. CLONE DETECTIONClone detection is multistage process. At first generated PDGs are loaded tomemory, then four basic steps are performed. The first step is splitting of PDGs tosubgraphs.
These subgraphs are considered as potential clones of each other. Thesecond step is application of fast check algorithms. These algorithms have linearcomplexity and try to prove that considered pair of PDGs cannot have enough bigisomorphic subgraphs. The third stage is maximal isomorphic subgraphs detection.New algorithms, based on slice, metrics and tree, are purposed for maximalisomorphic subgraphs detection. The forth step is filtration of obtained pairs ofmaximal isomorphic subgraphs. Last step is printing of corresponding source codefor isomorphic subgraphs, as clone.64.1. PDGs’ splittingThree methods are realized for splitting.