Font Size: a A A

An Integrated Framework For Constraint-Based Mining Of Source Code

Posted on:2013-06-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Shaheen KhatoonFull Text:PDF
GTID:1228330371980808Subject:Computer Application Technology
Abstract/Summary:PDF Full Text Request
In information world data mining is gaining a rapid popularity because of its aim towards extraction of useful knowledge hidden in large datasets. The objective of data mining is to mine or extract knowledge from piles of data. Data Mining is mostly used in applications such as product analysis, understanding consumer research marketing, e-commerce, demand and supply analysis, direct marketing, health industry, e-commerce, stocks&real estates, customer relationship management (CRM), telecommunication industry and financial sector investment trends etc.Software organizations also produce large volumes of data in software development process. Such data refer as source code, change history, execution traces, bug reports, open source packages etc. Data mining can efficiently discover the useful information from large volumes of Software Engineering (SE) data that can play an important role in improving software quality and productivity.This research thesis proposed an innovative approach that applied data mining techniques on source code data to extract useful information and use such information for improving various SE tasks.Source code contain lot of structural features that embody latent information that if were identified can help software engineers to develop quality software in less amount of time. For example many of programming rules are hidden in set of function calls, variable usage, data accesses in functions, object interaction etc. that seldom exist outside the minds of developers. Violations of these types of rules may introduce semantic bugs which are difficult to uncover, report to bug-tracking systems and fix unless the rules are explicitly documented and made available to the development team. This problem motivate us to apply strong analysis techniques combined with data mining techniques on source code to find latent programming patterns that can be potentially useful for identification of programming rules, duplicated code, bug detection and standard API usage patterns.The approach is demonstrated by automatically inferring programming rules. Our observation is that various program entities are inherently correlated and need to be accessed together with their correlated peer. In the proposed approach, given a source file of software system we apply association rule mining to find interesting association within the program that can be used for rule formation. In subsequent step these rule are used to find semantic bugs that if remain uncover may seriously affect the system. Furthermore, the approach is enriched by automatic detection of clone code. Given a source file the proposed framework apply the sequential mining to detect frequent clone code. The occurrence of clone code is inevitable due to common practice of copying and pasting the code to reduce programming efforts. Hence a method is devised for finding duplicates and preparing reports about them. Also the efficient algorithm is proposed to find another class of semantic bugs caused by copying and pasting the code.To further demonstrate the approach, it also proposed a method for automatic retrieving relevant code from web to reduce programming efforts and rapid software development. The programmer provide the required lines of code in search context, the approach automatically retrieves relevant code samples (relevant APIs) from web or from previous projects. This process consists of two phases:At first stage a database is build by transforming code samples into Abstract Syntax Tree (AST), the mining algorithm computes all the patterns and forms the pattern database. In second stage developers searches for the specific usage pattern for a given task from pattern database.To make the mining process more effective and user focused we used constraint-based mining. In many real-life applications, users may have certain phenomena in mind on which to focus the mining (e.g., may want to find specific business rule). Without user focus, association rules mining and sequential mining process generates thousands of patterns from a given set of data most of which are not of user interest or having a tiny fraction of interest. A particular user is only interested in small subset of patterns which satisfy some user-specified constraints. Efficient mining only mines the patterns that satisfy user-specified constraints. In proposed approach user specifies the constraints on the form of rules to be mined and on interestingness measure values. This combination is applied to revealed system specific characteristics. The constraint-based mining aid in the mining process to capture specific system characteristics employ by the user.
Keywords/Search Tags:Data mining, software engineering data mining, source code mining, code analysis, programming patterns, code clones, API usage patterns
PDF Full Text Request
Related items