Font Size: a A A

Research And Realization Of Preprocess System Based On Microblog Data

Posted on:2017-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:C H ZhouFull Text:PDF
GTID:2428330596490066Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In today's Internet-developed society,the network has become the most important channel for information exchange and communication.There are various kinds of information on the network,covering politics,science,education and so on,and different groups have different needs for these information.Therefore,the analysis and processing of network data has become a very important topic in the information society,and data collection and pre-processing as a prerequisite for data analysis has also become a key issue.Based on the characteristics and requirements of micro-blog data,and through analyzing the shortcomings and omissions of the existing methods,this paper designs and implements a set of data preprocessing systems,including From data fetching,persistence to a series of automated pre-processing flow for microblogging data and similar characteristics of the data set provides a highly automated,highly specialized data pre-processing system.The work of this paper mainly includes the following aspects:Firstly,from the start of network information capture,describes the basic method of crawling the existing information,from the crawling tool classification description,and further analysis of information in the process of crawling information filtering method,that in this article The proposed application scenarios.Secondly,in order to solve the shortcomings of general web-based crawling methods,this paper proposes an algorithm to search for specific parts of web pages.By analyzing the type distribution of the main tags and the position and characteristics of the most important information,Search for critical data.In this paper,we propose an algorithm for searching the content of a web page based on the canonical value in a single web page.In order to avoid jumping to irrelevant pages and filter out only similar pages for the outer chain between web pages and the similar data that need to be fetched in the jumps,a characteristic curve based on label distribution is drawn for each page,The degree of fit of the filter operation.The feasibility and validity of the algorithm are verified experimentally on several mainstream websites.Thirdly,we analyze the word segmentation algorithms commonly used in Chinese text data processing,and its performance in practical application.On the basis of word segmentation,the text features and processing methods of the specific data similar to microblogs are studied.The importance and optimization effect of data normalization in preprocessing are expounded,and it is incorporated into the process of preprocessing as de-noising part in practical operation.For the main part of preprocessing,the existing algorithms are studied,and their characteristics are analyzed.Simhash algorithm is selected for further experiments and improvement.The optimization of distance detection and the optimization of duplicate entry search in Simhash algorithm are proposed and verified by experiments.Finally,according to the requirements of micro-blogging and related data fetching and pretreatment,the architecture design of the micro-blog data preprocessing system is proposed.It is divided into online grasping part,persisting part and preprocessing calculating part,three parts coupling degree Lower can work independently to ensure the diversification of the actual demand.It implements micro-blogging and related data preprocessing system to support users to extract and process data from various data sources,view and save the intermediate results,and configure the parameters in the main deduplication algorithm.The system has been verified.The implementation and verification of the system show that the designed and implemented micro-blogging and related data capture and pretreatment system can effectively capture,persist and pre-process the network data,and the crawling algorithm The application effect is obvious,the effect and efficiency of pretreatment are better than the original algorithm,there is a certain reference value and application prospect.
Keywords/Search Tags:Mircoblog, data fetching, text preprocessing, text deduplication, Simhash
PDF Full Text Request
Related items