Font Size: a A A

Research Of Key Technologies Of Network Writeprint Recognition Based On Mapreduce

Posted on:2013-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WeiFull Text:PDF
GTID:2247330371491466Subject:Education Technology
Abstract/Summary:PDF Full Text Request
Network writeprint refers to feature set of the users’unique writing style left in the network text(such as a word usage habit, grammar structure). Network writeprint can be labeled as an author’s unique identifier of one’s writing characteristics, just like one’s fingerprints. With the deepening of the research, the number of the author in research increases, the total data size need to be dealed with also increases, the time spent on data processing begin to hinder the progress of the study. In addition, we also found that, in the course of running the program, the resources such as memory and CPU have not been fully utilized. This paper attempts to study the key data processing algorithm’s parallelization, in order to make full use of the computer resources to improve the efficiency of data processing.Ngram refers that given a sequence of text, to get the cotinuous sequence in isometric or of variable length. As the research shows, Ngram feature extraction is an important technology constructing the individual characteristics set of network writeprint. Improving the efficiency of data processing is an important content in the Ngram feature extraction process. In this paper, we design Hadoop-Ngram algorithm, and realize it on the Hadoop platform. The expriment’s results shows that, compared with Ngram feature extraction that has not been parallized, Hadoop-Ngram has a better data processing efficiency, at the same time, the utilization rate of computer resources such as CPU and memory is also improved, the comouter resource get a more sufficient use. In the experiment, we compare the data processing efficiency by configuring the Hadoop general parameters. The experiments’s result shows that the algorithm running efficiency can be further improved by configuring the Hadoop generic parameter flexible according to the size and the characteristics of the processing task.Ensemble learning based on feature selection is a key technology constructing the classification model of network writeprint. It first select features to remove the redundant and ineffective characteristics, then divide the feature set to feature subsets by an algorithm, the feature subset would be assigned to individual classifier for processing, each individual classifier’s result would be obtained to get the final classification model or the results of classification. In the process constructing the classification model using ensemble feature seletion, the increasing data amount and the low data processing efficiency are also problems it faces. Based on this, this paper design Hadoop_F_Ensemble based on MapReduce. The results show that using Hadoop_F_Ensemble, the classification model efficiency is increased, system resources also be utilized more fully, a better performance would be shown through the Hadoop generic parameters adjustment. This shows that the application of MapReduce in Network writeprint recongnition research is meaningful.
Keywords/Search Tags:Network Writeprint, Ngram, MapReduce, Ensemble learning based onfeature selection
PDF Full Text Request
Related items