The Research On Web Information Extraction Technology

Posted on:2015-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:L L Jia

Full Text:PDF

GTID:2298330467962382

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

In recent decades, the rapid development of the Internet changed the way people get information. It is essential for everyone to find valuable information on the Internet. Under the circumstances, the web information extraction technology comes up with its most important goal that accurately extracting information from the semi-structured data pool. This paper studies the way to extract structured data from a large number of web pages accurately and efficiently. Details are as follows:1. Based on the regular expressions, build an incremental unified information extraction system. The system crawl the forum, blog, news web site incrementally. It builds the unified architecture to get information from different web sites. First, store the regular expression in the table named template. Thus, when adding a new site, we can just add one seed and one instead of making changes of the whole program. So simple and convenient it is to build the web information extraction system. What’s more, the cost is significantly reduced and the system’s scalability is increasing.2. A library information collection system is built to make the information extraction system based on regular expression further use. After analysis and study of the library structure and data form, I put all the libraries into four groups, and overcome the downloading difficulties one by one. Finally, more than seventeen million pieces of data is extracted.3. To ensure the accuracy, I put forward an algorithm that BBS comments extraction based the web vision segmentation, reducing the cost of development. First of all, this paper proposed a page segmentation method based on information theory, remove the noise information. Secondly, as the BBS comments has some similarities with each other, this paper proposed an algorithm that calculate the DOM tree similarity based on the depth. Then extract BBS comments using the DOM tree similarity algorithm from the page that the noisy information has been removed. It reduces the difficulty of human work when people involve and develop the web information extraction system.The two proposed algorithms can extract information from web accurately and efficiently. The methods have good prospects and high reference value in information extraction for public opinion analysis and the search engines.

Keywords/Search Tags:

Information Extraction, Regular Expression, PageSegmentation, DOM Tree, Similarity

PDF Full Text Request

Related items

1	The Research And Implementation Of Web Information Extraction System Based On The Regular Expression
2	The Design And Implementation Of Regular Expression Engines Based On Deterministic Finite Automata
3	The Research Of Web Information Extraction Technique And Application Based On NFA Regular Matching
4	The Application And Research Of Regular Expression In Webpage Extration
5	Research On Multi-dimensional Regular Expression Matching Algorithm For Network Security
6	Study On Automaton-Based Regular Expression Matching Algorithms
7	Research And System Realization Of Key Technology Of Information Extraction Optimization
8	Research And Implementation Of A Generic Web Information Extraction System
9	A Web-based News And Information Extraction System Design And Realization
10	The Application Research Of Regular Expression In Telecommunication Services Processing