Font Size: a A A

Research On Text Classification Method Based On Transfer Learning

Posted on:2020-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2428330572473715Subject:Information security
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile internet technology,the text-type information grows explosively,which has promoted the rapid development of information security research based on text information,including in mail filtering,network security event tracking,and network public opinion analysis.Time lapse,changes in data acquisition conditions and other factors lead to continuous changes in data distribution,resulting in traditional machine learning with the assumption that "training/test data has a common feature space and the same data distribution"is severely limited in real-world scenarios.Transfer learning enables the system to adapt to changing learning tasks with data distribution by appropriately processing data,models,etc.,so as to better solve the text classification task in the actual information security application.According to whether the target domain and the source domain feature space are the same,the transfer learning is divided into homogeneous transfer learning and heterogeneous transfer learning.This paper focuses on how heterogeneous transfer learning technology explores the relevant knowledge between heterogeneous domains,so as to further promote the learning of text classification tasks in the target domain.First of all,this paper proposes a feature space construction method based on semantic relevance.Using word2vec to establish the vector of the two-domain feature vocabulary,the similarity of features is represented by the cosine similarity between vectors.The homogeneous feature space consists of the same feature vocabulary of two domains and the feature pairs with similarity higher than the preset threshold.The classification accuracy obtained by this method is increased by 3%on average.At the same time,this paper proposes a locally preserved segmented heterogeneous projection algorithm.After the projection of the traditional linear discriminant analysis projection algorithm,the feature dimension is reduced to(the number of classification categories-1),and the application of the algorithm to the text field will result in a large amount of useful information loss.Through the segmented proj ection,the feature dimension after projection is controllable,and is more suitable for the text field with high dimensionality of the feature space.By introducing a local retention projection algorithm,the problem of information destruction of local structure between samples in the projection is solved.This paper also proposes to weight the labeled samples of the target domain to balance the problem of excessively large proportion difference between the two domains.The experimental results show that the classification accuracy obtained by this method is nearly 10%higher than that of the traditional machine learning method,and the projection method based on the original linear discriminant analysis projection algorithm has significant improvement.Finally,this paper combines the first two points,and expands the source of training data in the target domain by connecting the Internet API to form a text classification system based on heterogeneous transfer learning that can be applied to the actual scene.
Keywords/Search Tags:Heterogeneous transfer learning, Feature space construction, Dirichlet projection, locality Preserving
PDF Full Text Request
Related items