Font Size: a A A

Text Classification Based On TF-IDF Matrix And Caps Net

Posted on:2019-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhangFull Text:PDF
GTID:2428330626452414Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text classification is a very classical topic in Natural Language Processing.Text classification is also a very complicated processing that requires a large number of relevant professional knowledge about content filtering and feature extraction.With the fast development of network and multimedia,abundant data is transmitted through network.These valuable textual data is full of information about users' reading action.In practice,textual data is usually classified into many data sets according to different needs.It's very useful to extract the deep information in a subsequent step.So studying an effective text classification method has theoretical importance and practical applications.There are many words in a text,some of which are the weak relevant words.Weak relevant words play little role in the text categorization.TF-IDF(Term Frequency-Inverse Document Frequency)is an effective algorithm to analysis the word frequency in current information searching field.In information retrieval and text mining,TF-IDF weight,a statistical measure,is often used to calculate how important a word is to a document in a collection or corpus.CNN(Convolution Neural Network)which is a typical structure of neural network is very common in text classification.However,there are some limitations in BP(Back Propagation)of CNN affect the classification results to a certain extent.In order to eliminate the drawbacks,this paper uses the dynamic routing between capsules in CapsNet(Capsule Network).The main work and innovation are as follows.(1)According to the characteristics of textual data,an algorithm based on TF-IDF matrix is proposed to remove weak relevant vocabularies.The algorithm is used to remove some words which have little effect in the text categorization to cut down the amount of the feature.This helps reduce the size of the text embedding vector and improve algorithm efficiency.(2)CapsNet is used for classification after removing the weak relevant words from the text.Dynamic routing is useful to avoid the limitations of BP to remote the accuracy of text classification.(3)Experiment verify the effectiveness of the removal algorithm based on Term Frequency-Inverse Document Frequency.Besides,this paper summarize and analyze the deficiencies of the text classification algorithm based on CNN and future prospects.
Keywords/Search Tags:Weak Key Words, TF-IDF, Capsule Network, Text Classification
PDF Full Text Request
Related items