Font Size: a A A

Construction Of Hierarchical Semantic Graph And Its Application In Text Mining

Posted on:2020-11-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:T T ZhangFull Text:PDF
GTID:1528306725974569Subject:Management Science and Engineering (Major in Information Management Engineering)
Abstract/Summary:
With the rapid development and increasing popularity of the computer network,the text information on the internet is growing exponentially.Much of this textual information exists in an unstructured form and may hide valuable and useful knowledge.How to effectively mine the unstructured text information to discover the useful knowledge and potential patterns hidden behind these information has become the focus of extensive research and application.Thus,text mining comes into being under such background.Text mining is the process of exploring and analyzing large amount of unstructured text data aided by software that can identify concepts,patterns,topics,keywords and other attributes in the data.Text mining covers many technologies,including natural language processing,information extraction,information retrieval,statistical analysis and data mining.Although the research on text mining has made great progress,the existing text mining technology still faces some practical problems due to the context complexity presented by unstructured data.At present,the representation methods of text mining are mostly related to the semantic and grammatical content based on words,which cannot effectively reveal the deep structural relations between words.In addition,the relationship between words is mostly measured based on a single relationship,which cannot effectively represent the multiple complex correlations between words.These problems call for improvements of current text mining technology in terms of the effectiveness and accuracy of knowledge acquisition,such as the accuracy of text keyword extraction,the effectiveness of text topic mining,and the integrity of text semantic duplication checking.Therefore,it is necessary to innovate the existing technologies and explore the forms of effective information representation,which can improve the effectiveness and accuracy of knowledge extraction and further promote the rapid development of text mining technology.This thesis follows on the research idea of "text representation-knowledge extraction".Firstly,the text representation in the text mining is studied to construct a hierarchical semantic graph that can effectively represent the text content and structure.Then,the hierarchical semantic graph is applied to the practical applications of knowledge extraction to improve the accuracy of text mining,which is analyzed from three aspects: text keyword extraction,text topic mining and semantic duplication checking.(1)Construction of hierarchical semantic graph.In order to effectively represent the knowledge information of the text,this study constructs the text semantic graph based on the hierarchical extraction of feature terms and the multiple relationships between feature terms.Firstly,the content correlation between feature terms is calculated based on term cooccurrence.Then,the semantic graph is constructed by hierarchical extraction of feature terms.Finally,the semantic graph is modified based on word embedding information.The hierarchical semantic graph in this work can not only effectively reveal the hierarchical structure distribution among features,but also make the extracted feature words have a high degree of conditional dependence correlation.This graph provides an basis for the further experimental research of various application problems in text mining.(2)Text keyword extraction based on hierarchical semantic graph.This part extracts the keywords of a single document based on the constructed hierarchical semantic graph.Firstly,according to the content and structure relationships in hierarchical semantic graph,the important feature terms are selected to generate candidate keyword network.Then,based on the candidate keyword network,the joint probability of each candidate keyword set is calculated,and the set with the largest value of joint probability is selected as the result of text extraction keywords.Experimental results show that the proposed method can effectively improve the accuracy of keyword extraction results.The method can effectively overcome the shortcomings of literal matching in traditional keyword extraction algorithms,and effectively reveal the inner relevance of keyword sets by mining deep hidden structures between feature terms.(3)The multidimensional topic mining of text based on hierarchical semantic graph.This part aims to explore the multiple topic of document corpus based on the constructed semantic graph.Firstly,the semantic graph is segmented into subgraphs by spectral clustering technique to generate the preliminary results of topic mining.Then,based on the structural score of feature terms,the contribution of text feature terms to each subgraph is calculated,so as to realize multidimensional topic mining.Experimental results show that the method proposed in this study can effectively carry out topic mining for massive document corpus,which has a higher accuracy and recall than the existing topic mining technologies.It can quantitatively describe the many-to-many mapping relations between the topic and feature terms by analyzing the structure of the feature terms in each topic.(4)Text semantic duplication checking based on hierarchical semantic graph.This part aims at the semantic duplication checking of two texts based on the constructed semantic graph.Firstly,the maximum common subgraph of each text semantic graph is extracted.Then,the similarity between two texts is calculated based on the proposed graph similarity calculation method.Finally,the appropriate similarity threshold is selected to check and analyze the texts.Relevant experimental results show that the proposed method is obviously superior to the existing duplication checking methods.It can detect the duplication from not only the semantics of feature terms but also the structure information between feature terms.Based on the experimental comparison between different semantic graphs,we can get the similarity of text at different granularity by comparing the similarity between different depths of hierarchical semantic graphs.In this paper,the key issues of text mining are studied by the proposed effective and robust algorithms.Traditional text mining mainly focuses on the revealing of text content.However,this study reveals the hierarchical relationship between feature terms from the deep structural analysis,which enriches the relevant theory and research method of text mining.Applying the hierarchical semantic graph to the practical research of text mining can improve the validity and accuracy of text mining in terms of keyword extraction,topic mining and semantic duplication checking.Admittedly,there are still many problems in this research,such as model parameters setting,sample size selection,semantic model construction and so on,which need further exploration and research.
Keywords/Search Tags:Text mining, Hierarchy, Semantic graph, Text representation, Keyword extraction, Topic mining, Semantic duplication checking
Related items