Font Size: a A A

Design And Implementation Of Information Acquisition System Based On News And Forums

Posted on:2015-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:L Y KongFull Text:PDF
GTID:2268330425995993Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, our modern society is now in an era of informationexplosion, and any people can post any message through the network at any time,any place aslong as he wants,. Admit of no doubt, the network has gone deep into every aspect of our lives.In the face of the complex information of the Internet, how to effectively deal with and make useof the huge amount of data becomes a great challenge that we have to face. Therefore, the onlineinformation collection, analysis, publishing and information processing has increasingly becomethe focus of scholars and institutions at home and abroad. Therefore, the research on informationacquisition system has great significance and practical value.By reading a lot of literature, this paper analyzes the present situation and developmenttrend of information acquisition system, and describes the research significance and practicalvalue in detail. In addition, this paper studies the technology of information acquisition system indetail, including web crawler, proxy server technology, seed, URL extraction and normalizationprocessing, regular expression technology and Chinese segmentation technology and so on.These technologies are all the key technology of information acquisition system, and theresearch on these technologies plays an important role in the design of this informationacquisition system based on news and forums.This system uses the C#programming language developing a comprehensive informationacquisition system based on news and forum. This system achieved the collection of Sina News,Tencent News, Sohu News, Netease, Tianya forum and Mop forum. Different from theinformation collection system for a single site, this system can realize the collection on multiplesites, at the same time has no effect on the acquisition speed and accuracy. The system can add ordelete acquisition channel according to the needs of the users at any time, increasing theflexibility of the system. The system uses the MySQL database, the name of the database isMSD0, and the database has three main data tables: final, news and AdminInfo.The overall structure of this information acquisition system mainly includes five modules:system login interface, data capture modules, data access module, data processing module andadd URL module. By introducing the design of the data acquisition system, this paper describesin detail the design and implementation of data processing module and the adding URL moduleand information collection module. The core part of this system is the information collectionmodule, this part can collect information of different sites according to the choice of users for thesource of collection and sampling depth, and at the same time display the collected results. The data processing module has the function of word segmentation and speech tagging according tothe needs of users. The part of adding URL has the function of adding or deleting URL for theusers at any time if they need.This paper also uses sina news, Tencent news, Netease and Sohu news as an example tocarry out a detailed demonstration, and based on the four news websites as the test site and withthe “primary and secondary school textbooks” as the acquisition theme, test and analyze thesystem. Through the test, this paper analyzes the performance of the system acquisition speedand quasi rate, found that the system has a good grasp of the general effect of the static WEBpages, and the speed is relatively faster.
Keywords/Search Tags:Information Acquisition System, Web Crawler, Data processing
PDF Full Text Request
Related items