Font Size: a A A

Research On Key Approaches Of Similar Detecting Based On Massive Text Data Set

Posted on:2017-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T WangFull Text:PDF
GTID:1108330482994775Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of industry for Internet and Internet of Things, data is growing and increasing at an unprecedented scale, and becoming a kind of strategic resource which is just as important as the natural resource or human resource, the ability of data resource mastery represents the national digital initiative capability. So, the massive data collecting, storing, processing, analyzing and the resulting information service is becoming the mainstream of the global information technology development, the research and application on big data have become an important driving force for the industrial upgrading and new industrial emerging.As a commercial capital and strategic resource, the big data not only gives us advancing force, but also a challenge. How to explore the valuable resource in the massive data is the most important task for researchers. However, the massive data also included a lot of duplicate or similar, this fact occurred not only wasted the storage resource and reduced the speed of transmission, but also directly affected the overall performance of searching engine, added the burden of searching work.The target of big data processing is to mine and obtain the deep valuable resource from the data sets, offer the high-value added application and service for a field through the effective information technologies and computing methods. So, how to find the potential knowledge and information quickly and efficiently, reasonably classify and accurately locate these data resource is a hot research topic at the current time through the duplicate or similar content checking or eliminating technologies.So, this paper offered a similar checking approach based on massive text data set and applied on de-duplication work through research on data classification and mining, feature selection, similar checking, Map Reduce computing model and so on. Specifically, the main research work and innovation viewpoints listed as follows.Firstly, research on the relevant theory and technology of text categorization. The task of text categorization is to build the formula and rules of category determination based on data info of each class sample, then classfies the indeterminate text to a specific class through the formula and rules built. The stages of categorization consist of text preprocessing, feature selecting, text expressing, categorization algorithms etc. this paper emphasizes on classifier devising, evaluation criterion of categorization. The above knowledge research founded the theory basis for similar checking work.Secondly, aiming to the low precision ratio at the course of categorization, this paper presents a multi threshold value categorization approach for massive text sets through combining byes classification and mining associative rules, improve the precision ratio for documents text. The byes classification has the advantage of simple calculation process, however, ignores the mutual linkages between the texts. Adopting the method of associative rule mining sets the appropriate confidence threshold value for text class owing relation, the classifier will obtain the high precision ratio for text categorization, therefore settle the defect of byes classification. This method make the preprocessing text data turn into the associative rule through CBA-RG method designed, then determine whether the categorization precision ratio of the first rule is higher than the confidence threshold value of the specific rule or not, if the former is higher than the latter, then eliminate the specific rule class data from the train data set and store it into rules set. Otherwise, eliminate the specific rule class data from classifier. Repeat the above course until all the order rules are determined and obtain the associative rule that all support degree is higher than the minimum one. The experiment shows the approach this paper offered can obtain the high categorization precision ratio and recall ratio compared to using single classifier.Thirdly, aiming to the defects of low accurate degree of selection feature vector, this paper offered a word frequency text feature selection method based on mutual information(FSTM), which applied on feature vector and text feature obtaining. The initial condition of this method is the text class set, the number which each word emerge in category, first of all, make the text entered word segment and index, repeat to traverse the word of text, then calculate the number of text that the frequency feature word emerged is more than or equal to min for each kind of text in the train set, subsequently calculate the average times of feature word which emerged in all text, finally, calculate the mutual information value of word in all classes, pick the word owned the most value and put it into feature set until the number of feature word reach the threshold value and finish the text feature selection.Fourthly, aiming to the defeats in the course of massive web pages similar checking, such as complicate parallel designing approach, low efficiency of work and massive date volume etc, this paper presents a similar detection approach based on Mapreduce, which utilized the simhash algorithm and paragraph weighting long sentence to obtain the paragraph fingerprint, then compute the similar degree. First of all, using the feature selecting method which the former chapter offered, obtain the paragraph fingerprint of web page, order and index, then conduct the index course in web data based existed through paragraph fingerprint of web to be detected, obtain the web page which is probably duplicate or similar to the exist one, finally, according to the detection result, compute the similar degree and determine the whether the web to be detected is similar to the one existed or not. Through the hadoop experiment platform, utilize three kinds scale of data set to test the feasibility of project designed, implement the operation time and accelerated speed ratio test respectively, the experimental data shows that the operation time and efficiency is improved remarkably in the course of similar web page detection, especially, with the data volume and the number of hadoop cluster scale increasing, it’s extremely obvious to promote the algorithm efficiency and outstanding advantage of similar detection for massive data sets.
Keywords/Search Tags:Big Data, Similar Detection, Categorization, Feature Selection, Cloud Computing, Text Set
PDF Full Text Request
Related items