Font Size: a A A

Text Classification Method Based On The Longest Closed Frequent Sequential Patterns

Posted on:2018-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ChiFull Text:PDF
GTID:2348330515974732Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,it is a great challenge to screen useful information from mass content with explosive of text information.There are two approaches in the field of classification: term-based methods and pattern-based methods.Term-based approaches are favored because of its simplicity.However,problems with synonyms,polysemy and noise information restrict the performance of term-based modus.On the contrary,pattern-based approaches can solve these problems by using more useful information.Algorithm Text Classification Method Based on the Longest Closed Frequent Sequential Patterns(TCMLCFSP)is put forward in this paper.The algorithm aims at dealing with the problems in the process of text classification method based on the longest closed drequent sequential patterns,including data preprocessing,longest closed frequent sequential patterns mining,feature selection and text classification.The main work is as follows:(1)Text Data Preprocessing Based on Term Frequency Statistics RulesIt is obvious that feature terms are faced with the challenge of “high dimension and sparse” in text mining.Thus,it is crucial to perfect preprocessing data before mining text.In order to improve process of data preprocessing,algorithm Data Preprocessing Based on Term Frequency Statistics Rules(DPTFSR)is proposed.To begin with,expression about terms with identical frequency is given based on Zif's Law and Rule of Maximum Area.What's more,regularities of distribution based on terms with identical frequency is explored.We discover that proportion of low-frequency terms in documents reach up to 2/3,but there is little relevancy between them.Finally,data processing is based on terms frequency regularities of statistics,conducting to remove low frequency terms and reduce feature dimension.Term frequency regularities of distribution is verified on data sets from Reuters-21578 and20-Newgroups.Experimental results show that running time is shortened obviously under the premise of classification performance.Thus,efficiency of text mining is significantly enhanced.(2)Longest Closed Frequent Sequential Patterns Mining MethodTerm-based methods are faced with many problems.However,pattern-based methods can solve those issues,thus pattern-based methods is superior to term-based methods.Due tolarge number of patterns owning low effectiveness,the most discrimination and representative patterns form large-scale patterns become important to our study.Algorithm the Longest Closed Frequent Sequential Patterns Mining Method(LCFSPMM)is suggested in this section:First of all,pruning model is used to remove noise models.Next,frequent sequential pattern extension model and frequent sequential pattern suffix set extraction model are raised and used to mining longest frequent sequential patterns.Thirdly,algorithm Filtration of Redundancy for Frequent Patterns Based on Inclusion Degree Theory(FRFPIDT)is presented.Algorithm FRFPIDT measures similarity of frequent patterns is based on inclusion degree,and it removes subpatterns and cross-patterns with high similarity degree.Performance of frequent patterns mining is increased by cutting out redundancy patterns.At last,it is our turn to mine non-redundant longest closed frequent sequential patterns.Effectiveness is found in experimental results on data sets from Reuters-21578 and 20-Newsgroups.(3)Text Feature Selection Base on Longest Closed Frequent Sequential PatternsIn the era of big data,it is a great challenge to screen useful information from mass content with explosive growth of text information.Searching text features is one of critical issues in the field of text mining.It is a great challenge to ensure the quality of features,which are mined from texts,because of the presence of large-scale words and patterns.Pattern mining is faced with the problem that long patterns usually have low matching degree.Considering that problem,it is a great challenge to select and utilize useful patterns.Algorithm Text Feature Selection Base on the Longest Closed Frequent Sequential Patterns(TFSLCFSP)is introduced: In the first palce,Feature Words in the Longest Closed Frequent Sequential Patterns Weighting Model(FWLCFSPWM)is recommanded.We can get over the problem of low matching degree in long patterns.Then,the model calculates support degree of terms based on regularties of distribution in longest closed frequent sequential patterns.Which will be regarded as initial weight to feature selection.After that,Specificity-based Feature Words Classification and Weight Update Model(SFWCWUM)is advised.To give the definition of specificity function based on regularities of distribution of terms,which is shown in forms of positive document set and negative document set.Afterwards terms is divided into two groups: positive and ordinary,based on specificity function.Ultimately,weights of features is updated based on specificity.Eventually,feature set and corresponding weights areobtained based on the longest closed frequent sequential patterns.Experimental results on data sets indicate algorithm TFSLCFSP is superior to the widely used feature selection methods.(4)Text Classification Method Based on the Longest Closed Frequent Sequential PatternsIt is a mean that is widely used to measure similarity of documents in text classification.Algorithm Text Classification Method Based on the Longest Closed Frequent Sequential Patterns(TCMLCFSP)is put forward: Firstly,Algorithm Text Similarity Measure Based on the Inexistence-Existence of the Feature Words(TSMIEFW)is put forward,aims at measuring similarity of documents through subjection relationship between feature words and documents.Algorithm TSMIEFW separates words into total subjection word sets,partial subjection word sets and none subjection word sets based on the subjection relationship,and defines subjection function based on three subjection word sets.Word sets of full subjection belong to the same two documents,but subjection degree should decrease with the increase of the weight difference.The words belonging to only one of the two documents are subsumed into partial subjection word sets,in which subjection degree is a definite value.Subjection degree of none subjection word sets is zero,because the words subject pertains to neither of two documents.The number of total subjection feature words is much more than partial subjection feature words between two documents in the same category,and higher similarity between two documents possess more similarity weights among features.On the contrary,partial subjection feature words is much more than total subjection feature words,and otherness of weights among total subjection feature words is obvious.Secondly,algorithm TSMIEFW is extended to measure similarity between document and document set.Thirdly,features are applied that selected from algorithm TFSLCFSP to algorithm TSMIEFW.Lastly,every document is parted into a certain category with the highest similarity degree.Experimental result on data sets indicates that the proposed approaches hold superior performance compared with classical and new methods.The novelty of our paper lies in following aspects: algorithm Data Preprocessing Based on Term Frequency Statistics Rules(DPTFSR)can reduce feature dimension effectively;Algorithm the Longest Closed Frequent Sequential Patterns Mining Method(LCFSPMM)improves performance of frequent pattern mining through removing redundancy mode.Algorithm the Text Feature Selection Base on Longest Closed Frequent Sequential Patterns(TFSLCFSP)takes both regularities of distribution and specificity of feature words into account,which makes the relationship among documents more clear;Algorithm Text Similarity Measure Based on the Inexistence-Existence of the Feature Words(TSMIEFW)defines different contribution grade through alien relationship between features and documents.Also partition of feature category is more reasonable,and classification accuracy is enhanced evidently.Feature words selected from algorithm TFSLCFSP used into algorithm TSMIEFW highlights the relevance between feature words and documents,clears document category more distintly,and promotes classification accuracy.
Keywords/Search Tags:data mining, text classification, data pre-processing, frequent pattern, feature selection, similarity measure
PDF Full Text Request
Related items