Font Size: a A A

The Research On Web Information Extraction Technology

Posted on:2015-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:L L JiaFull Text:PDF
GTID:2298330467962382Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In recent decades, the rapid development of the Internet changed the way people get information. It is essential for everyone to find valuable information on the Internet. Under the circumstances, the web information extraction technology comes up with its most important goal that accurately extracting information from the semi-structured data pool. This paper studies the way to extract structured data from a large number of web pages accurately and efficiently. Details are as follows:1. Based on the regular expressions, build an incremental unified information extraction system. The system crawl the forum, blog, news web site incrementally. It builds the unified architecture to get information from different web sites. First, store the regular expression in the table named template. Thus, when adding a new site, we can just add one seed and one instead of making changes of the whole program. So simple and convenient it is to build the web information extraction system. What’s more, the cost is significantly reduced and the system’s scalability is increasing.2. A library information collection system is built to make the information extraction system based on regular expression further use. After analysis and study of the library structure and data form, I put all the libraries into four groups, and overcome the downloading difficulties one by one. Finally, more than seventeen million pieces of data is extracted.3. To ensure the accuracy, I put forward an algorithm that BBS comments extraction based the web vision segmentation, reducing the cost of development. First of all, this paper proposed a page segmentation method based on information theory, remove the noise information. Secondly, as the BBS comments has some similarities with each other, this paper proposed an algorithm that calculate the DOM tree similarity based on the depth. Then extract BBS comments using the DOM tree similarity algorithm from the page that the noisy information has been removed. It reduces the difficulty of human work when people involve and develop the web information extraction system.The two proposed algorithms can extract information from web accurately and efficiently. The methods have good prospects and high reference value in information extraction for public opinion analysis and the search engines.
Keywords/Search Tags:Information Extraction, Regular Expression, PageSegmentation, DOM Tree, Similarity
PDF Full Text Request
Related items