Font Size: a A A

Research And Design On Open Source Community Data Mining Key Technologies

Posted on:2013-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2298330422974027Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the swift spread of the Internet and the rapid progress of the open sourcemovement, more and more open source software (OSS) has acquired remarkablesuccess. In the research field of Software Engineering, data mining on OSS has becomemore and more important. Curent OSS mining tasks are mainly focused on softwarerepositories, the mining of OS community data has yet to form its own researchdirection. For the lack of a unified data mining platform, existing OS community datamining tasks have to rebuild infrastructures time after time, and are faced with ashortage of experiment data reserve and the difficulties of comparative study. In order tocounter these challenges, this thesis has carried out the following work:(1) This thesis has designed INFLUX, an OS community data mining platform,which has a four layer architecture: Data Storage, Information Extraction, Data Miningand Concept Application. This thesis has implemented the three kernel infrastructuremodules of INFLUX, i.e., Web Crawling, Software Entity Information Extraction andData Indexing. Currently, INFLUX platform has retrieved near700,000OSS projects’web data from five OS communities and has provided a search service.(2) This thesis has designed the core algorithm of INFLUX’s Software EntityInformation Extraction module. Presently, the Software Entity Information Extractionmodule uses an observation-based web extraction method, such a method needs tomanually scan each OS community and each OSS entity attribute in that OS community,and design distinct extraction strategies for each of them. This thesis has proposed anautomatic method for extracting OSS entity. It exploits OSS entity attribute redundancyacross different OS community websites to automatically induce webpage templates ofa new-coming website. Tag path notation and attribute name verification are used toachieve high accuracy OSS entity extraction. The extraction experiment is built on topof data provided by INFLUX, the result indicates that only a small seed database isneeded to achieve high performance automatic extraction.(3) This thesis has designed the core algorithm of INFLUX’s Software TagTaxonomy Mining service. We have designed the Software Tag Taxonomy Miningservice in the data mining layer of INFLUX, whose aim is to induce tag taxonomy byanalyzing OS community tag data, hence to provide a better way to organize andmanage software tags. This thesis has designed its core algorithm AHCTC. Among allother tag taxonomy generation methods, AHCTC is the first to use the classicagglomerative hierarchical clustering framework, which deliberately avoids the bottleneck of computing tag generality. Using the dataset provided by INFLUX, this thesishas carried out several comparative experiments. The qualitative evaluatoin shows thatAHCTC can disclose new semantic structures that supplement the output of previous approaches. The quantitative evaluation demonstrates that AHCTC’s taxonomiesembrace higher quality.
Keywords/Search Tags:Open Source Software, Open Source Community, Data MiningPlatform, Web Information Extraction, Social Tag, Taxonomy Generation
PDF Full Text Request
Related items