Font Size: a A A

The Research On Graph Structure Representation Method Based Chinese Text Clustering

Posted on:2010-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q F LiuFull Text:PDF
GTID:2178360275457860Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and popularity of information technology,an increasing number of electronic texts come forth,people have experienced from an information resource lack time to an information abundance one.Facing to massive information resource,people can hardly find quickly and effectively the information needed.Therefore,how to organize and manage document information rationally and effectively has become a very important research task in information processing field.In recent years,as the prerequisite to ensure the quality of text mining methods,the text representation method research has attracted more and more scholars.Starting with the text representation method,our research applies graph theory technology to text mining,and then put forward a new graph-structure-based text representation method.Comparing with the traditional vector-based text representation,graph structure is propitious to representation of the text structure information.While retaining the characteristics of the text,it also can describe the information of the relationship such as location and strength of association between terms.The research mainly includes the following sections.First a textual representation model is proposed.The graph structure based Chinese textual document representation model is proposed on basis of the analysis of traditional text representation model.Afterward the text is represented as a graph whose nodes are the selected terms and edges are the corresponding relationships respectively.Therefore more semantic and ordering information among terms as well as the structural information of the text are stored.Followed by a similarity measuring algorithm is introduced.The similarity measuring algorithm used for text classification is accordingly proposed based on the semantic graph structure model by measuring the maximum common subgraph between each pair of semantic graphs.The mcs similarity measuring algorithm considers not only the content similarity but also the structure similarity of text which is more comprehensive.Assume that the more common part the two graphs have,the similarity between them is larger,therefore utilize the characteristics of the mcs to measure the similarity of graphsThen an improved clustering algorithm is proposed.We used an improved K-means algorithm to clustering,a concept of median graph is introduced to measure the distance between single graph structure and the graph set,which enhance the clustering algorithm for graph structure based textual document clustering.Finally is experimental verification.We use the test data with category tag to clustering and three indexes precision,recall and F-Score are introduced to evaluate the effect of the clustering results.
Keywords/Search Tags:Graph Structure, Test Representation, Test Similarity Computing, Max-Common subgraph, Text Clustering
PDF Full Text Request
Related items