Font Size: a A A

Text Classification Method Based On Maximum Frequent Sequence Pattern

Posted on:2019-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:C J LiFull Text:PDF
GTID:2348330542955287Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the arrival of big data era,varieties of text data,which is exploding rapidly,comes into our sight.It costs a lot of time and effort to deal with text information,derived from text data.For the sake of managing text information,relevant measures about text classification should be paid attention to,including text preprocessing,text feature selection/extraction and text classification.Classical text preprocessing methods cope with text data in a direct way,leading to a restriction on the property of the algorithm to a large extent.Generally speaking,text preprocessing methods regard terms as basic units,so the relationship between terms will not be taken into account.In the traditional sense,text feature selection and text feature extraction are implemented separately,incapable of giving consideration to original features and new features at the same time.The majority of text classification models which abide by some rules like“terms-based”,“theme-based”and“emotion-based”divide documents into two kinds of categorization:positive and negative,but apparently ignore the fuzziness of the classification boundary.By solving the above problems,the following works are put into practice.Firstly,an innovative method on text term frequency statistics,in which the term distribution is explored by utilizing Zipf's Law and Butz's Law,is introduced with the suitable term frequency,and it aims at improving the performance of traditional text preprocessing methods.Secondly,a novelty means concerning text preprocessing is presented,choosing patterns as basic units based on the theory of frequent itemsets.The novelty text processing means,which can ameliorate the association between terms of text preprocessing methods,not only removes the useless information,but also obtains frequent patterns accompanying more useful information.Thirdly,a creative measure is exhibited to produce compound features,which contain original features and new features,through the combination of feature selection and feature extraction.Finally,a neoteric technique with regard to text classification,referring to centroid solution and three-way decision,is proposed to dispose the issue of the fuzziness existing in the classification boundary,and then determines the categories of documents in the main area and the boundary area for achieving the ultimate objective of text classification.The paper mainly studies text classification method in the light of maximum frequent sequential patterns.Four algorithms are put forward:Term Statistical Method based on Law of Term Distribution?TSMLTD?,Pattern Mining Method based on Maximum Frequent Sequence?PMMMFS?,Feature Selection Method based on Maximum Frequent Sequential Pattern?FSMMFSP?and Text Classification Method based on Maximum Frequent Sequential Pattern?TCMMFSP?.The paper uses maximum frequent sequential patterns to achieve the intent of feature selection based on semi-feature and R-feature,and realize the goal of text classification based on centroid solution and three-way decision,with the final purpose of dealing with some problems in text classification which consists of term statistics,pattern mining,feature selection and text classification.Docker and PadddlePaddle are used as the experiment environment.A large number of experiments are carried out in view of 20Newsgroups,Reuters-21578 and RCV1.A variety of comparison models are added,with accuracy,precision,recall and1 as the evaluation criteria.The experimental results reveal new approaches hold superior performance in terms of accuracy,precision,recall and1,compared with the conventional models.The introductions of the paper are as follows:?1?Term Statistical Method based on Law of Term Distribution?TSMLTD?.The method discovers the law of term distribution,and chooses terms with appropriate term frequency,accompanying data cleaning.?2?Pattern Mining Method based on Maximum Frequent Sequence?PMMMFS?.The method takes advantage of algorithms about frequent itemsets to get frequent sequential patterns,and extracts maximum frequent sequential patterns on the basis of requirements of maximum frequent itemsets.?3?Feature Selection Method based on Maximum Frequent Sequential Pattern?FSMMFSP?.The method combines feature selection and feature extraction,in order to define semi-features and R-features,and generates compound features based on maximum frequent sequential patterns.?4?Text Classification Method based on Maximum Frequent Sequential Pattern?TCMMFSP?.The method calculates four centers of all documents according to centroid solution,and then count the distance between documents to four centers.The comparison of the distance and the threshold decides the category of a single document.In accordance with three-way decision,the fuzziness of the classification boundary is taken into consideration,contributing to the judgment of document category in the boundary area.The synthesis of the two processes realizes the purpose of text classification.
Keywords/Search Tags:term frequency, frequent pattern, semi-feature, R-feature, centroid solution, three-way decision, text classification
PDF Full Text Request
Related items