Font Size: a A A

Research On Web Forums Information Extraction System Based On Distributed Architecture

Posted on:2013-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:M H YaoFull Text:PDF
GTID:2248330371990250Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of large-scale application of computer technology and Internet technology, computer technology has led us into the information age. The demand of information accessing and information processing are becoming more and more diverse and integrated. There are a lot of semi-structured data in a large number of Internet Web pages. Nowadays, with the deep research and application of semi-structured data, the demand of automatically extracting the valuable information from a large number of semi-structured data is increasing.Web forum has now become an important data source on the network, it has provided people with a lot of valuable knowledge and information. Because of a large number of users exchanging views and discussing issues in the Web forum, it made vast amounts of information resources to be saved in the Web forum. Web forum had a variety of styles, the complexity of content and other characteristics. Making the effective extracting information from semi-structured Web forum has become an important research direction of the information extraction technology.Due to the rapid development of the Web forum, the data sets of Web forum that are extracted by information extraction algorithm is very large, stand-alone is difficult to complete the extraction task. Correspond to that situation is that the computers on the network have sufficient resources but can not effectively be used. So how to effectively organized the all idle machines to complete the extraction information task is a difficult problem of information extraction. We proposed a combination of both two methods to solve the deficiencies of the stand-alone information extraction after detailed analysis of frequent subtree mining algorithms and distributed system architecture.We designed and implemented a semi-structured Web forum information extraction system which is based on frequent subtree mining techniques and master-slave distributed architecture in the paper. According to the system requirements analysis, system architecture used a hierarchical structure, namely the presentation layer, control layer and data processing layer. The presentation layer is responsible for displaying the extraction results and the control layer is responsible for the distribution of the extraction task, data processing layer is responsible for the extraction of information. The paper analyzed the basic principles of the functional modules of the system, it included the communication module of the distributed nodes which used the ACE middleware technology, the task distribution module which used the consistent hashing algorithm, the frequent pattern extraction module which used frequent subtree mining algorithm, and the information extraction module which used the largest common sub-tree matching algorithm.The system is in the trial running stage. We select total of660the post pages of the most representative10forums as the experimental data source in the paper, at the same time we compared and analyzed the performance of the extraction system. The result shows that the system is in stable condition, not only safe and practical, but also operated easily. The system can greatly improved the efficiency of the stand-alone semi-structured Web forum information extraction system. It can reduce the investment of human and material resources for data maintenance, so has the good prospects for the development and application.
Keywords/Search Tags:semi-structured Web forum, distributed system, frequent pattern, largest common sub-tree, information extraction
PDF Full Text Request
Related items