Font Size: a A A

Research On Text Classification Method Based On Feature Vector Construction

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q GuFull Text:PDF
GTID:2428330596479671Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text is the source of diverse information,and because of its unstructured nature,getting insights from it takes too much and is relatively difficult.Text categorization is a classic theme in the field of natural language processing and is the process of assigning predefined labels or categories based on content.As a scientific research trend under the massive data environment,neural network is an automated predictive analysis method.Representation method in text representation model based on neural network is easy to exhibit high sparsity,and the classification model often has problems such as poor classification performance for specific situations.Faced with the above problems,this paper has carried out the following research:(1)Text representation.Aiming at the problem that the GloVe model has large number of irrelevant words in the process of word vector representation training,this paper proposes a WT-GloVe-based word vector weighting model.Firstly,the feature word extraction is carried out by means of feature weighting algorithm based on word spacing and inter-class contribution degree.Secondly,according to the shortcomings of GloVe model,a filtering irrelevant word method is proposed to improve the quality of word vector training.Finally,combined with the feature weighting algorithm based on word spacing and inter-class distribution and GloVe filtered by irrelevant words,a weighted word vector model is generated to effectively obtain the importance degree and semantic information of feature words,and form a new word vector weighting model.Reference to other models in the same environment,The word vector weighting model based on WT-GloVe can effectively improve the classification effect.(2)Text classification.Aiming at the problem that the fasttext model is classified in Chinese text,the effect of the word information obtained by the sub-word embedding method is not obvious and a large number of redundant terms are generated.This paper proposes a text classification model based on STL-fastText.Firstly,based on the TF-IDF algorithm,a low-frequency word weighting algorithm based on correlation is proposed.Secondly,the whole corpus is used as the input of the LDA model.Perform a topic analysis on the text content to learn the distribution of its subject words,the obtained result is supplemented by the low frequency high discrimination feature.Finally,the dictionary is reconstructed from the input layer of the fastText model,and the new dictionary obtained by the feature is added as the input of the hidden layer to complete the construction of the STL-fastText model.Reference to other models in the same environment,the experimental results show that the text classification model based on STL-fastText can effectively improve the classification effect of Chinese short texts.
Keywords/Search Tags:Neural netword, Text classification, TF-IDF, WT-GloVe, STL-fastText
PDF Full Text Request
Related items