Font Size: a A A

Dynamic Weighting Of Word Embedding And Distributed Learning Strategies

Posted on:2019-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:J Q XuFull Text:PDF
GTID:2348330542487541Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The increasing popularity and rapid development of the Internet has spawned large-scale unstructured data,of which text data is an important category.At present,how to mine and learn effective information from massive text data is an important research topic.Among them,text classification is an important research in the field of text mining,and it has a wide range of application scenarios.The text classification is a very challenging issue.First of all,the traditional text representation model considers the text consist of innumerable words,and it uses the one-hot encoding based on the feature dictionary to represent a document with high-dimensional sparse vectors,which limits the text accuracy of classification due to its high sparsity of vectors.Secondly,most of the traditional text classification algorithms directly apply the basic machine learning classification algorithms,and the result of classification is not very good for the model is too simple.Faced with the above challenges,this paper firstly proposes a new text representation model.Secondly,based on this text representation model,we construct an efficient text classification model by using the method of classifier integration,which achieves a more accurate result of the text classification.This paper studies the principles and steps of the text classification.Firstly,the Chinese documents are preprocessed,including word segmentation and stop-word deleting.Then the feature selection of words is performed.According to the selected feature word,a text representation model based on word vectors Word2Vec dynamic weighting is proposed,which fully considers the different importance of different semantic features in different articles.Based on the text representation model,this paper presents a text classification algorithm based on classifier integration,which improves the accuracy of text classification.In the text representation model proposed in this paper,the Word2Vec algorithm is used to convert the document feature word into a fixed-dimension word vector.Then the TF-IDF value of each feature word in the document is calculated and the TF-IDF value of each feature word is taken as the weight,The word vectors of all feature words are dynamically weighted.This text representation model can make full use of the semantic information of different feature words in different documents and word vector to realize the effective representation of documents.The experimental result shows that the text representation model based on word vector Word2Vec proposed in this paper has better text feature description than the traditional text representation.Based on the idea of integrated classification and the text representation model proposed in this paper,we propose a new SVM integration method based on the word vector dynamic weighting model through Bagging algorithm,and compare the integration of different base classifiers with the single SVM classification effect.The experiments prove that the classifier integration algorithm based on word vector dynamic weighting has a high classification efficiency,and we also find out the optimal solution of the number of base classifiers.Based on the above model,this paper validates the experiment on the real WeChat public article dataset.We also design and implement a distributed text classification system based on WeChat public article.The system's functions include the crawling of WeChat public articles,the automatic labeling of documents and the classification of texts,and the application of text categorization to the actual scenes.
Keywords/Search Tags:Text Classification, Text Representation, Word Embedding, Classifier Integration
PDF Full Text Request
Related items