Font Size: a A A

Reasrch On The Intelligent Acquisition Of Web-Based News Contents

Posted on:2017-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:X A ChenFull Text:PDF
GTID:2348330485488151Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Web technology innovation makes it has gradually become the platform of production and consumption of online content, the Internet is filled with countless existing sources in the form of Web pages, search engine, network monitoring, and personalized news to push the further development of application and research work, such as making the Internet news text collection become hot topics in the study of domestic and foreign relevant personage. Web pages contain the user not only pay attention to the body of the content, also contains a large number of noise information, such as advertising, navigation, and recommend related information, etc., makes the page text text extraction technology become one of the news text collection problem.Traditional web page text on the basis of the template is not only need manual configuration each website template, and not in real time to adapt to changes in the structure of the web, making the late maintenance costs. Heterogeneous characteristics of Web pages based on the existing rule learning wrapper Web analytic technology is also put forward the new challenges. This thesis will focus on the news pages of text parsing technology research, combining with the characteristics of news Web structure characteristics, the news text labels, and Web news text collection demand, put forward to intelligent to adapt to changes in the Web structure and general in major portal and news page text extraction method.Our work mainly consists of three parts listed as following.(1) Based on the text label characteristics mining web text extraction method. The method main mining web tree structure characteristics, the text label of centricity, text labels continuity, the body of the tag hierarchy and Html modified characteristics etc, using the hierarchical clustering algorithm to cluster the tabs, weight calculation and empirical adjustment was carried out on the label clusters to determine the final text tag clustering, In order to ensure that the news text collected during the process of collecting as much as possible is the content of the news page text, in this thesis, based on the text label characteristics mining web text extraction method is proposed on the basis of the news text is gathering of news web pages from recognition method, then according to the text label clusters of extracting text.(2) Based on the intelligent template news pages of text extraction method. This method can avoid the messy process of artificial configuration template, but according to the characteristics of the portal and news page structure, using the text label characteristics mining methods to learn a lot of news web pages within the network, and then automatically generate the website page parse templates, according to the template to extract information of the website page of text.In summary, using the actual page of text extraction methods are proposed in this thesis the experiment data. The experiment result shows that this method not only in smart Web news text collection system of feasibility and high accuracy, and this method was verified in the Web page text extraction of generality and intelligence.
Keywords/Search Tags:Page Text Extraction, Tags Feature, Intelligent Templates, Non-news Page Self-identification, Machine Learning
PDF Full Text Request
Related items