Font Size: a A A

Research On Technologies Of Web Data Extraction On Open Source Community

Posted on:2018-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2428330569499065Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet accelerates the revolution of software industry and the process of open source movement advances vigorously around the world.Various stakeholders,e.g.,individual developers,company organizations and even the leading IT giants,have joined the open source movement and become loyal fans of open source software.Open source means that the software development process transforms from closed to open and the ownership of software products transforms from exclusive to shared.The creation and growth of open source software are carried out in the open source community,where the wisdom of the group blends and the glory of knowledge glows.The exponential emergence of open source software boosted the growth of the open source community.Ultimately the open source community differentiates into two forms,group development oriented collaborative-development community and thoughts exchange oriented knowledge-sharing community,which contain massive open source data.The connection and fusion of these data constitute a well-functioning open source system.The heterogeneity of the open source community and the fragmentation of the massive open source data resources have brought great challenges to the research of web data extraction technology for open source community.Based on the deep analysis of the characteristics of open source community data and the research of information extraction technology,This paper presents a Web data extraction rule generation algorithm based on data block,the extraction rule template based on Xpath and regular expression can be generated automatically,which can effectively extract the Web data in open source communities.Then constructs a common Web data extraction framework that supports Web page preprocessing,data extraction and persistence.A Web data extraction prototype system for the open source community is implemented and integrated into the OSSEAN open source software retrieval and analysis platform.Finally,the validity of the Web data extraction system for the open source community has been verified.This system effectively extracts web data in the open source community and provides a strong data support for the development and operation of the open source big data service platform "OSSEAN" based on the deep data mining.
Keywords/Search Tags:open source community, Web data extraction, extraction rules, rule generation
PDF Full Text Request
Related items