Font Size: a A A

Parallel Multi-Label Text Classification Based On Word2vec

Posted on:2019-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:S K YanFull Text:PDF
GTID:2428330590465802Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the data on the Internet presents explosive increasing as the rapid development of information technology.A huge amount of data information exists in the Internet such as texts,images and audios,and text data has less resource usages,which presents the data information in a form of text.To organize,manage and utilize these textual data effectively and discover valuable information,automatic text classification technology based on machine learning receives much attention.Supervised learning can be divided into two categories: single-label learning and multi-label learning,according to the number of sample labels.In addition,multi-label text classification belongs to multi-label learning,for example,each document may have one or more labels corresponding to this document in the field of multi-label text classification.In recent ten years,multi-label learning has made great progress,but so little literature aimed at the multi-label learning algorithm for text data,and the performance of multi-label text classification is difficult to achieve satisfactory results.The main problems are as follow:(1)Higher dimension of feature space contains more redundancy features;(2)Larger dimension of sample output space makes learning task more difficult and causes the complexity of the multi-label learning algorithm higher,especially when the dimension increases to 100,000.To solve the above problem,the research work in the present research is divided into two aspects as follow:1.This thesis proposes a weighted multi-label k-Nearest Neighbor(wMLkNN)based on Word2vec for multi-label text classification method by introducing Word2vec into the classic multi-label classification algorithm ML-kNN.Based on the correlation between calculate characteristic and labels using Word2vec,and the features of high association degree with labels,this method increase the weight in the training of ML-kNN model and reduces the weight of redundant features with low label association degree to improve the accuracy of multi-label text classification.2.This thesis also studies a MPI-based parallel ML-kNN algorithm.This method first improves the distance measurement formula in ML-kNN algorithm without affecting the accuracy of the algorithm to improve the parallel efficiency,and then improves the parallelized ML-kNN algorithm based on MPI to improve the efficiency of multi-label text classification.It is worth mentioning that the parallelization method with supporting segmentation of datasets based on feature datasets for higher dimensionality of textual data has more efficient than single datasets that are cut in sample units.Experiments on multi-label text data sets demonstrate the effectiveness and superiority of the proposed model.
Keywords/Search Tags:multi-label learning, multi-label text classification, ML-kNN, Word2vec
PDF Full Text Request
Related items