Font Size: a A A

Source Code Search Engine Based On Semantic Network And Big Data Mining

Posted on:2015-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:X HuFull Text:PDF
GTID:2428330476952931Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of computer science and technology, big data mining becomes one of the hottest topic in computer science. The reason of the wide concern and application of big data mining is that it can extract valuable information from huge amount of information which could not handle by human beings by efficient and reliable analyzing method. With the rise of open source projects hosting services such as Source Forge, Google Code, Github and so on, it is not difficult to grab great amount of source codes with good coding styles. So it has high research and application value in analyzing source codes with strong semantic information in identifies combined with semantic network, and recommend usable codes to user who provide key information in natural words.For traditional source code recommendation, the first one is code location in a specified project by natural language. The second one is key words matching in database. The first one is limited in current project so it can not recommend codes which is not in current project. The second one can provide all possible usable codes only if you know the key words but it has three main problems. The first is it has too much code matching single key word and too little matching several key words. The second is that abbreviations of identifies in codes cannot match the key words input. The last is that a lot of matching results are interface or declaration which don't contain enough meaningful codes. The key word matching code search of Github has above three problems.Aiming at above problems, this paper proposed an analyzing and searching method. In an open source project whose source code is well organized, identifies often contain clear semantic information. By analyzing source code, we can obtain usage dependency of variances. Identifies in source code can be split into natural language token by the analyzing algorithm introduced in this paper. By the input specification defined in this paper, user's input can be parsed into natural language tokens with structure. Tokens are related with identifies. Structure information of tokens is related with dependency of identifies. And identifies is related with source code. The three points make it possible to search source code from natural language inputs. First of all, grab huge amount of source code from open source hosting sites. This paper takes Java as main program language and mainly grabs Java code. Then expand identify by analyzing context of it in source code. After that we establish usage relationship of variances and meanwhile combine it with semantics network. At last, by providing input like {argument keywords, method keywords, affected keywords}, the engine can locate matching source code.Finally, experiment is designed to validate the method proposed in the paper by test cases of different difficulty and performance of searching.
Keywords/Search Tags:big data analyzing, code recommendation, program analysis, semantic network
PDF Full Text Request
Related items