Font Size: a A A

Research And Implementation Of Web News Extraction Method Based On Tag Path And Keyword Features

Posted on:2022-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhangFull Text:PDF
GTID:2518306605968599Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet and HTML technology,network news has become the main way of news transmission,while the redundant information such as the navigation bar advertising record information in the web page has affected people's access to news content.In order to enable users to obtain pure news content,information extraction is needed for web page content,that is,extracting main news information such as news title and content from web pages.In order to achieve this goal,this paper designs a Web news extraction system,which can extract the main information of news from the Web page,classify the extracted news and store it in the database.For the problem of extracting content from news web pages,this paper designs an extraction algorithm based on tag path and keyword characteristics on the basis of the existing extraction algorithm based on statistical information.There are three key points in this algorithm.One is to block the web pages according to the tag path,which can reduce the amount of calculation.The third is to replace the picture with the text of its context and then calculate the importance of the picture.This method is not limited to the picture,but an idea of equivalent substitution is used here.The content that cannot directly measure the importance,such as audio,video,can be converted into text in this way to calculate the importance.At the end,after calculating the feature value for each web page block,the support vector machine classification algorithm is used to classify all web page blocks to identify all content blocks.Based on the above ideas,this paper designs and implements a Web news extraction system.The system has eight module.They are download module,preprocessing module,feature value calculation module,node classification module,news classification module,storage module,agent pool module,log module.The proxy pool realizes the real-time acquisition and management of the proxy and provides the available proxy for the crawler.The log module implements a configurable log component to record system operation status.The download module is responsible for downloading web page source documents and extracting links.The preprocessing module performs node fusion and partial noise filtering on the source document.The feature value calculation module calculates the feature value of the node.The node classification module classifies the content and noise of nodes through node feature values.The news classification part realizes the function of classifying the extracted news;The storage part stores the news with category information in the Redis database.During the operation of the system,the attributes of the noise block will be recorded each time,and some nodes can be filtered according to the recorded attribute information when processing the next news web page.As the news processed by the system becomes more and more,the accuracy rate of extracting news content will be higher and higher.Finally,the entire system is tested,and the test results show that the system can operate stably.
Keywords/Search Tags:Tag Path, Key Words, Content Extraction, News Classification, Web Crawler
PDF Full Text Request
Related items