Font Size: a A A

The Design And Implementation Of Network News Gathering System Based On Topics And Categories

Posted on:2018-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2348330518468431Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,network news has become one of the important source of information.Network news has the advantages of fast propagation,wide impact,wide social audience,etc.,but there are some false,low-quality network news.Network news quality is uneven to reduce the users' reading experience.In addition,the network news to some extent has become the source of the network of public opinion and transmission,so in the massive network news data collected in real,accurate,structured network news data has become the focus of network public opinion research.This paper focuses on the topic network news and category network news,solving the problem of collecting and collecting the theme in network news gathering,and further improving the performance of the system based on the realization of its basic functions.This paper introduces the concept of theme crawler and SVM classifier,introduces Xpath and multi-thread technology.Based on the above theory and technology,this paper designs and implements a network news gathering system based on topic and category.The system has the functions of collecting and storing topic network news and category network news.In the topic-based network news gathering,the system forms the crawling priority queue by calculating the similarity of the page,and then extracts the title,URL,publication time,release source,text and other content of the subject network news through Xpath technology.Finally,the collected thematic network news data is stored in the system database.In the category-based network news gathering,this paper introduces the Libsvm package to realize the training and construction of the classifier,and then extracts the title,URL,publication time,release sources and text of the category news through Xpath technology,including social,entertainment,finance and sports,and finally the collection of classified Internet news data stored in the system database.Firstly,this paper introduces the research background and significance of network news gathering,and introduces the research work of the focused crawler and classifier at home and abroad.Secondly,this paper introduces the theory and technology involved in the network news gathering process,including Robots protocol,universal web crawler,support vector machine,theme crawler search strategy,Xpath technology,etc.Then,this paper analyzes and introduces the demand of the system,designs the system architecture,the module of the system,and the system module including the news site seed injection module,the web page source code acquisition module,the webpage analysis module,the classification module,the theme filter module,the URL dispatch module,the URL de-emphasis module,the web page information extraction module and the database storage module.In addition,based on the detailed design,by calling the ICTCLAS package and the Libsvm package,the above design of the many modules,to further achieve the theme of network news gathering and based on the type of network news gathering function.Finally,this paper lists the hardware environment and software environment required for system operation.The function and performance of the system are tested separately.The results of the test meet the requirements of the system,but there are many areas that need to be improved.The system uses C # language in Windows7 32-bit operating system environment for theme collection and category collection was achieved.System robustness,efficiency,persistence,stability and so on,can be accurately and timely,effectively collected and stored based on the subject and category based network news data.
Keywords/Search Tags:Network news, subject crawler, support vector machine, information extraction
PDF Full Text Request
Related items