Font Size: a A A

Design And Implementation Of News Web Data Extraction

Posted on:2012-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2218330362452722Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of internet accelerates the speed of the news release, and the internet becomes a place to gather the most comprehensive news.The sharp increase in the daily news makes it important to how to quickly find the interested news in the Internet.In order to meet people's need,the major search engines have launched a special news section, but only rely on these traditional browser, it is difficult to meet the user's information needs in a particular area. The paper proposes for the cultural field news information extraction system and meets the university's needs.Base on the analysis of a large number of news pages'structure,the paper uses a combination method of the rules and statistics to extract the informantion in the news pages.The paper take serval cultural field web sites as the object to collect web pages,and then extract the information from the pages.The method is divided into three steps: web clustering, rule structure and data integration.Firstly,the paper mainly achieve the similar structure page to have the same sign.Secondly,the paper make the extraction rule on one page of same sign pages and mainly introduces the news text extraction rule and the title extraction rule.On the basis of the other's work of the full stop density in the text to find the news text section, paper uses a method to more accurately determine the scope of the news text and then, the paper uses the chinese word extraction device on the title page to determine the title extraction rule. Finially, the information is extracted into the DB base with the rule. Paper also provides custom content extraction rules. Users can choose their own content of interest and the system automatically generates the extraction rules of the content. Paper makes use of regular expressions to construct the user rules.The paper designs news information extraction system based on the above methods and gets a good result by testing the servel culture field news website.
Keywords/Search Tags:news page, dom tree, clustering, similarity, chinese word device
PDF Full Text Request
Related items