Font Size: a A A

Design And Implementation Of Web Automatic News Acquisition System

Posted on:2018-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y R ZhangFull Text:PDF
GTID:2348330542971911Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of rapid development of the Internet,the network media with its speed,spread a wide range of characteristics to become a new window for people understand the outside world.However,in order to spread latest news and important event to the users,the network editor often on duty in the morning and evening.But because of the news of the strong timeliness,the conflict between working time and personal energy,missing news is inevitable which is a loss for the media credibility.Thus,the news automatic acquisition system needs is extremely urgent.In addition,this thesis find that,although the market already exists acquisition products,the use of results unsatisfactory.Most of the products exist multiple news repetition,uncompleted webpage analysis,the lack of accuracy of news channel classification and other issues.The core of the Web news automatic acquisition system is the setting of the acquisition strategy and the subsequent processing of the collected text.Combined with user needs,the system will provide automatic news collection,removal of duplication,classification and other functions.The main contents of this thesis are as follows:(1)The application of the core technology of news automatic acquisition system at home and abroad is analyzed,and the classification and crawling strategy of web crawler are introduced.The characteristics of text classification,word segmentation,feature selection and feature extraction are discussed;(2)We analyze the requirements of Web news automatic acquisition system,including functional requirements and performance requirements,and discuss the design goals and principles.On this basis,the overall architecture and function modules of the system are designed;(3)The news acquisition module and the text processing module of the news automatic acquisition system are designed and realized in detail,and the application layer of the system is designed and implemented in detail.The deployment environment of the automatic acquisition system is designed;(4)Describe the system deployment environment,test and display a number offunctions such as acquisition,removal of duplication and classification of news automatic acquisition system,test the running time of news automatic acquisition system and the performance of collecting website support quantity.The thesis combined with the actual situation of news editing and editing work in the Internet to find and obtain real-time news of the vertical industry website,the crawl of the page to remove duplication,noise reduction and other pre-processing work into the news library and other functions to help the news site timely release important news,and further improve the credibility of the site,to achieve the purpose of access to traffic.
Keywords/Search Tags:news gathering, web crawler, removal of duplication, text classification, page analysis
PDF Full Text Request
Related items