Research And Design On Open Source Community Data Mining Key Technologies

Posted on:2013-11-09

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2298330422974027

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the swift spread of the Internet and the rapid progress of the open sourcemovement, more and more open source software (OSS) has acquired remarkablesuccess. In the research field of Software Engineering, data mining on OSS has becomemore and more important. Curent OSS mining tasks are mainly focused on softwarerepositories, the mining of OS community data has yet to form its own researchdirection. For the lack of a unified data mining platform, existing OS community datamining tasks have to rebuild infrastructures time after time, and are faced with ashortage of experiment data reserve and the difficulties of comparative study. In order tocounter these challenges, this thesis has carried out the following work:(1) This thesis has designed INFLUX, an OS community data mining platform,which has a four layer architecture: Data Storage, Information Extraction, Data Miningand Concept Application. This thesis has implemented the three kernel infrastructuremodules of INFLUX, i.e., Web Crawling, Software Entity Information Extraction andData Indexing. Currently, INFLUX platform has retrieved near700,000OSS projectsâ€™web data from five OS communities and has provided a search service.(2) This thesis has designed the core algorithm of INFLUXâ€™s Software EntityInformation Extraction module. Presently, the Software Entity Information Extractionmodule uses an observation-based web extraction method, such a method needs tomanually scan each OS community and each OSS entity attribute in that OS community,and design distinct extraction strategies for each of them. This thesis has proposed anautomatic method for extracting OSS entity. It exploits OSS entity attribute redundancyacross different OS community websites to automatically induce webpage templates ofa new-coming website. Tag path notation and attribute name verification are used toachieve high accuracy OSS entity extraction. The extraction experiment is built on topof data provided by INFLUX, the result indicates that only a small seed database isneeded to achieve high performance automatic extraction.(3) This thesis has designed the core algorithm of INFLUXâ€™s Software TagTaxonomy Mining service. We have designed the Software Tag Taxonomy Miningservice in the data mining layer of INFLUX, whose aim is to induce tag taxonomy byanalyzing OS community tag data, hence to provide a better way to organize andmanage software tags. This thesis has designed its core algorithm AHCTC. Among allother tag taxonomy generation methods, AHCTC is the first to use the classicagglomerative hierarchical clustering framework, which deliberately avoids the bottleneck of computing tag generality. Using the dataset provided by INFLUX, this thesishas carried out several comparative experiments. The qualitative evaluatoin shows thatAHCTC can disclose new semantic structures that supplement the output of previous approaches. The quantitative evaluation demonstrates that AHCTCâ€™s taxonomiesembrace higher quality.

Keywords/Search Tags:

Open Source Software, Open Source Community, Data MiningPlatform, Web Information Extraction, Social Tag, Taxonomy Generation

PDF Full Text Request

Related items

1	Research On Key Technologies Of Web Data Extraction And Mining On Open Source Community
2	An Approach Of Automatic Fork Summary Generation In Open Source Community Based On Feature Extraction
3	The Key Technology Research And Implementation Of The Software Big Data Continuoius Aggregatio Platform For Open Source Community
4	Research On Technologies Of Web Data Extraction On Open Source Community
5	Research On Software Mining Technology For Open Source Community
6	Research On Software Recommendation Method Based On Open Source Community And User Behavior
7	Research On Relationship Between Code Quality And Software Defects For Open Source Software
8	Research On The Key Technologies Of Massive Open Source Resources Location For Software Reuse
9	Research And Implementation Of Open Source Software Systems Situation Analysis
10	Measuring The Contribution Of Developers In Open Source Software Community