Font Size: a A A

Online Writeprint Identification Based On Ensemble Feature Selection

Posted on:2012-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W SunFull Text:PDF
GTID:1228330368480748Subject:Education Technology
Abstract/Summary:PDF Full Text Request
The ubiquitous and anonymity nature of Internet is diminishing users’ sense of responsibility, and the Internet is suffering a crisis of security and credibility increasingly. By analyzing textual identity cues people often leave behind their online messages, i.e. online writeprint, Internet users can be identified individually.From a machine learning point of view, writeprint identification can be seen as a multi-class single-label text classification problem. Feature extraction and technique for writeprint identification in Chinese online messages are two major issues discussed in this study. Character N-gram feature, ensemble feature selection technique, and dynamic ensemble selection strategy are employed to enhance the performance, and improve scalability and interpretability of identification model.According to the characteristic of Chinese language, variable length character n-gram feature is applied in this study. The performance of character n-gram feature and the optimal value of N in Chinese language context are investigated through experiments. Afterwards, a Three-stage Tandem Combined N-grams Extract Method (TTCNEM) is presented to deal with the high dimensional, redundant, and sparse problems with character N-gram features. These three stages include dimension reduction based on feature frequency and distribution information, redundancy removal by using LocalMaxs algorithm, and sparse reduction based on individual level feature set. At the end, based on short words, the most common form of words in Chinese language, non-contiguous character N-gram feature is proposed to represent author’s writing style, and an integrated feature extraction strategy is carried out including both contiguous and non-contiguous character N-gram features.Regarding online writeprint identification model construction, individual level feature set type and identification technique based on ensemble feature selection are applied in this study, aiming to train the base classifiers with the same number of authors to be identified and give each base classifier the local optimal identification performance on its correspondent author. Giving priority to performance and effectiveness, Hybrid Genetic Algorithm based Ensemble Feature Selection (HGAEFS) and Semi Random Subspace based Ensemble Feature Selection (SRSEFS) are presented in this study. HGAEFS is based on simple genetic algorithm framework which utilized feature weight information based on individual author level feature set to guide the search process, which include recoding one chromosome in initial population and changing crossover and mutation operators using heuristics from the feature weight information. The fitness function of HGAEFS is designed according to the diversity theory of ensemble learning, and the diversity among base classifiers is measured by Kappa statistics. Kuncheva’s probability model is first used to determine the size of feature subsets and the number of important features in SRSEFS. Then feature weight information based on individual level feature set is utilized to guide the feature space partition, turning random subspace partition to semi random subspace partition.On the basis of the model built by HGAEFS and SRSEFS algorithm, Hybrid Dynamic Selection based on Oracle (HDSORA) is presented to apply dynamic selective ensemble strategy based on local performance evaluation. HDSORA incorporates two types of dynamic strategy, Dynamic Classifier Selection (DCS) and Dynamic Ensemble Selection (DES). First combine K-Nearest Neighbor (KNN) and Behavior Knowledge Space (BKS) to determine the local neighbor area of test sample in feature space. Then choose between DCS and DES according to the credibility of optimal local classifier. Oracle and Local Class Accuracy (LCA) are applied in DES’ensemble process, that class information is effectively utilized in choosing base classifier subset.To verify the methods mentioned above,50 users’ data is collected from a campus bulletin board system. Topic, time, and other elements are eliminated from the data set to reduce their interference to writing style recognition. Results of the experiments are as follow:(1) Character N-gram feature is effective in Chinese online writeprint identification, with an equivalent performance to basic feature set. And the optimal value of N is 2 for fixed length character N-gram feature. TTCNEM is effective in dimension reduction and sparse reduction without diminish identification performance. Non-contiguous character N-gram feature enriches feature information and improves identification performance. (2) Compared with SVM and EDS, two typical techniques in current online writeprint identification area, HGAEFS obtains significant performance enhancement and SRSEFS gains better performance than SVM when the number of authors is a little large. Moreover, HGAEFS and SRSEFS improve the scalability of identification model. HGAEFS is better than SRSEFS in terms of interpretability. (3) Compared with SMV (Simple Majority Voting), DCS, DES and several other typical integration methods for ensemble, HDSORA improves identification performance, and enhances the scalability and interpretability of identification model at a certain level.Finally, the proposed methods in this study are applied to deal with the problem of online behavior subject identification in a National Key Technology R&D Program in the 12th Five-Year Plan, and a prototype system is also developed and implemented.
Keywords/Search Tags:Online Writeprint, Ensemble Feature Selection, Character N-gram, Genetic Algorithm, Dynamic Ensemble Selection
PDF Full Text Request
Related items