Font Size: a A A

The Research And Implementation Of Web Information Extraction System Based On The Regular Expression

Posted on:2012-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:W YangFull Text:PDF
GTID:2248330395955696Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the deep and rapid development of Web information extraction technologyresearch in recent years, information extraction technology based on the regularexpression has been a hot spot of data mining at present stage. This paper proposesinformation extraction technology based REIE(Regular Expression InformationExtraction) algorithm by deeply researching this technology and combining the classicmethods of Web information extraction.First, this paper introduces the relevant knowledge and structure of Web informationextraction technology. It proposes an information extraction technology based on REIE(Regular Expression Information Extraction) and the evaluation criterion of InformationExtraction System is given by analyzing and comparing some classic method ofinformation extraction. Secondly, this paper introduces the relevant knowledge aboutweb and Regular Expression in detail. Next, this paper proves HTMLParser informationparsing method and extraction principles in detail and shows HTMLParser datastructure by analyzing Web text. Finally according to regular expression extractionprinciples, I propose the core algorithm of this system, REIE (Regular ExpressionInformation Extraction).At last, based on regular expression, this paper achieves a system of web contentinformation extraction which mainly extracts headlines of news on the web pages,hyperlink, and text and so on. And this system can do real-time extraction of web pagesand make the results visible to users. At the same time, the system can check its validityfrom an experimental viewpoint. The experiment shows that this paper puts forward amethod which can extract comprehensively and accurately and improve real time andaccuracy of Web information extraction.
Keywords/Search Tags:HTMLParser, Regular Expression, Information Extraction, REIE Algorithm
PDF Full Text Request
Related items