Font Size: a A A

Research On Text Representation And Feature Extraction Methods Based On Conditional Co-occurrence Degree

Posted on:2019-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:W WeiFull Text:PDF
GTID:1368330545469077Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As the main carrier of information,textual data is the source of important information and knowledge.How to urgently and effectively obtain the information and knowledge people need from the massive textual data is a serious problem that needs to be solved.Text representation and feature extraction are important and basic tasks in the whole text mining process,and they can provide useful data processing methods and technical support for the successful implementation of the subsequent text mining tasks in varous areas.However,with the development of data science,there are higher requirements for text mining technologies,especially for the text semantic mining.By analyzing the advantages and disadvantages of the existing researches,we study the text representation methods and feature extraction methods combined with text semantic modeling,where we study feature extraction methods from two different aspects,i.e.,word ranking and topic discovery.And then we apply these proposed methods in policy text data mining.The contributions of this research are summarized as follows:(1)In order to obtain intuitive,comprehensive and easy-to-understand text representation results,we propose a novel text representation method,called conditional co-occurrence degree matrix,based on the word co-occurrence method.This method represents a document by a matrix formed by conditional co-occurrence degree between any two words.And the calculation of the conditional co-occurrence degree between two words is based on the semantic field theory in linguistics,with considering the size of the semantic structure when the document is organized,the semantic relevance and conditional dependency of two words within the same semantic structure simultaneously.Besides retaining the statistical information of words in text,this representation method also makes a clear distinction between the co-occurrence information of two words so as to highlight the semantic information that the original text focuses on.This method preserves more semantic and structural information of the original text and it is an effective promotion of the existing word co-occurrence representation method.The effectiveness of the proposed method is verified by a series of numerical experiments on several public datasets compared with several other text representation methods.(2)In order to rank the feature words in combination with the textual statistics and structural information,we propose a word ranking method based on the conditional co-occurrence degree word network of text.Words in the text are organized orderly according to a certain organizational structure so as to convey specific semantic topic information,so there is a potential manifold structure formed by the relationship between words in a specific natural language text.This method takes the term frequency statistics of words as the initial weights and initial ranking result.Combining with the idea of manifold ranking,we construct the text's conditional co-occurrence degree word network,which can reflect the semantic and structural information of text,and the network is treated as the potential manifold structure.And then the words' weights and ranking are reevaluated and optimized by using the similarity learning of words with the graph learning theory.Numerical experiments in both public datasets and supplementary corpus verify the effectiveness of the proposed feature selection method.In addition,this method broadens the application of graph learning theory in the field of text mining,and it also provides a new method and strategy for word ranking in single document.(3)In order to solve the problems that traditional topic models face,such as the loss of the semantic,the indistinct of topic concepts,as well as the crossover and coverage of topic semantic,we propose a text topic discovery method based on the conditional co-occurrence degree.First,the document is divided into several sub-documents according to the semantic structure of the document and its independence determination rules,and every sub-document describes a single topic.Second,combinatorial feature words with strong semantic relevance in sub-documents are extracted based on the value of conditional co-occurrence degree within the sub-documents.And new sub-documents are formed by feature expansion and content reconstruction of sub-documents based on these combinatorial feature words.Third,all of the"topic-words" distributions of the new sub-document set and the "document-topic"distribution of each new sub-document are obtained by topic modeling on the new sub-document set.Finally,the "document-topic”distributions of the original documents are obtained by merging the new sub-documents' distributions with some strategies.And in combination with the "document-topic" distribution of the new subdocument,the"document-topic" distribution of the original document is merged.A series of experimental results show that this proposed method can improve the efficiency of topic discovery for corpus.Besides,the generated combinatorial feature words can effectively avoid the problem of polysemy,and they can also assist semantic induction and topics summarization.(4)We apply the above three algorithms to the feature extraction of the policy text and the research of social transformation.We use the government work reports of the State Council from 1954 to 2018 as the policy text corpus and carry out the policy text data mining.First,we propose some feature selection methods in order to identify and extract the common issues,key contents,hot topics and emerging content in policy text,and we study the change of social vitality from the aspect of the emerging content in policy text.Second,we propose a diachronic document clustering method,which is used to divide the entire time period of the policy into several stages according to the policy content,and we get the same division results as the previous studies.Third,we discover the different sequence patterns of features in the policy corpus combining with the complex network theory and the results of the time division.Finally,we carry out topic discovery on the policy corpus,and study the topic evolution rules in the whole period considering the time variable.The conditional co-occurrence degree-based text mining methods can make the text representation method,word ranking method and topic discovery method be capable to analyze and process complex textual data,by integrating textual statistical information,semantic information and structural information from the text,which can effectively improve the quality of text mining and also provide new technologies and tools for text mining.Through extracting common issues,key content,hot topics and emerging content from policy text,and exploring changes in social vitality,time period segmentation,time series patterns discovery of words and topic evolutions based on text content,the mining results can be used to improve the efficiency of knowledge acquisition and provide appropriate decision supports for makers and researchers of policy.
Keywords/Search Tags:Text Mining, Text Representation, Word Ranking, Topic Discovery, Policy Text Data Mining
PDF Full Text Request
Related items