Font Size: a A A

Research On Key Technologies Of Web Data Extraction And Mining On Open Source Community

Posted on:2012-11-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X ZhuFull Text:PDF
GTID:1268330392973881Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Open source community is a platform for groups of people with common interestspublishing their source code under different open source licenses. Accompanying withvigorous development of Open Source Software (OSS), Internet and Web technology,Web-based open source community have become data and resource center of OSS. Webdata in open source community contains wealth of knowledge about software, mining ofwhich is critical to understanding software architecture, functionality, complexity, rulesof software evolution and comprehending the organization, human source distributionof development teams. However, the characteristics of Web data like large scale, highheterogeneity, pervasive dynamics, diversified user groups and abundant informationare big challenge for data extraction and knowledge mining on open source communityeffectively. In such a context, this paper delves into four core issues about dataacquisition, data mining and knowledge application orderly, and proposes solutions foreach research issue. The validation of our research is built upon real data extracted fromopen source community. The primary research tasks and results include:(1)We study the issue of information extraction of lists from a single Web page.We propose an indent shape based algorithm to handle this issue. We define indentshape model firstly, which consists of indent value and the first tag of each line inHTML code. The model simplifies the document model while retaining the repeatedpatterns. The algorithm detects repeated patterns indirectly by identifying tandemrepeated waves from indent shape. Finally, it uses a classic HTML parsing method toextract the data. Results of our experiment indicate that the algorithm improvesextraction efficiency better compared with existing works.(2) We study the issue of information extraction from heterogeneous Web pages.We propose a redundancy-based information extraction algorithm. The algorithmconstructs a seed set of attribute names and their enumerable values firstly. Then, itlocates the position of each seed attribute in each target Web page with training pages,and selects the positions with highest support as the extraction rule. Finally, it extractsother pages from the same target site via both matched positions and relative tag nodes.The results of our experiment indicate that this algorithm makes good use of dataredundancy among different websites of the same domain and is capable of extractingmultiple entity attribute values from several heterogeneous websites by using only oneseed attribute set. It preforms better efficiency compared to related works.(3)We study the issue of entity ranking in developer collaborative networks. Wepropose a topic and time sensitive algorithm. It projects a collaborative network basedon the topic firstly. Then it calculates transition probability based on time factor onclassical Markov chain and finally ranks entities based on each node’s ranking result. Experiments based on real data from open source community prove that our algorithmis more accurate than traditional approaches and it can offer finer grained searchapplications supporting entity ranking with different technical topics in different timeintervals.(4) We study the issue of automatic classification of open source softwares. Wepropose an automatic classification algorithm based on online evolving topic model. Inour algorithm, the online evolving topic model is proposed based on traditional LDAtopic model and Gibbs sampling, which constructs topic model of software documentstream by time slices online incrementally. It gets the topic-word distribution and thedocument-topic distribution via parameter estimation. Each topic denoted with the topiclabel is consists of key words and the distribution of these words. Each softwaredocument is assigned to each topic with a certain probability. Then, the algorithmannotates semantics for each topic by using the predefined topic ontology, whichclassifies each software document into topics with explicit semantics. We test ouralgorithm with software documents during ten years from open source community, andthe results indicate that our algorithm exhibits better precision than related works. Ourmethod can also be taken as a good reference for judging the effectiveness of topicclustering and analyzing patterns of topic evolving. It supports applications of softwaresearching based on topic taxonomy greatly.In order to test validity of our algorithm in practice, we develop an experimentalplatform called INFLUX for searching resources on open source community. Theplatform crawls open source project index pages from websites in open sourcecommunity on a global scale. It archives various open source project attributes into alocal database using data extraction, data integration technology, which is integratedinto a relatively full-scale open source software information repository with source codeand development history data. It supports specific mining tasks for different experimentor application settings, the result of mining can be used as supporting data for differentapplication services. Currently, it mainly supports two core application services:developer searching and software resource searching. Developer search results areranked by entity ranking in collaborative networks, pushing developers matchedsearching requirement of users to the front of the result list. Software searching serviceis mainly based on automatic classification results of open source projects. Itautomatically clusters software into different functional classes, which help users tobrowse and search software through taxonomy. In the meantime, it supports thediscovery of new functional classes, which helps adjustment of software classes.In conclusion, this paper studies the characteristics of Web data in open sourcecommunity, delves into key technologies of Web data extraction and mining, andproposes several novel algorithms with lots of pratical experiments. This paper hasimportant theoretical and realistic significance for the analysis of open source community and software technology in the era of the Internet. Meanwhile, it exhibitssignificant applicable value in lots of applications like open source software searching,developer searching and software topic evolving.
Keywords/Search Tags:Open source software, Open source community, Web DataMining, Web Data Extraction, Information Network Analysis, Automatic SoftwareClassification, Resource Search
PDF Full Text Request
Related items