Font Size: a A A

SOM-Based Textual Clustering Model Research

Posted on:2012-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2248330392958248Subject:Software engineering
Abstract/Summary:PDF Full Text Request
People nowadays can have more access to all kinds of information than ever beforedue to the rapid development of information technology. Search engine has become anextremely important tool in daily life as people have to find out what they need from sucha huge amount of resources. Although they are helpful and powerful, search engines todayonly focus on content analysis without deep understanding of semantic information, thislimits the ability of search engines so that they may return lots of useless data. Onepossible solution to improve the performance will be re-analysis the feedback of searchengine to find out the latent topic information.Self-Organizing map (SOM) is one kind of artificial neuron network, it can clusterdocuments accurately and effectively, and visualize the results in a low-dimension waywhich is more intuitive. With its powerful tools, SOM can find out the latent pattern in thecorpus and present it with a stable neuron in the network. The distance of documents inthe low space is proportional to their distance in the original space, which keeps theoriginal topological structure. However SOM does not take the semantic information ofthe documents into consideration. To solve this problem, we decide to introduce LatentDirichlet Allocation (LDA), a probabilistic topic model. LDA is a typical Bayesian model,it gives a detailed description of the generation procedure of corpus. We can utilize LDAto get the topic proportion of each document where each topic is a distribution over thewords. Then we can use this kind of data as the input of SOM, when the networkconverges, we can have the cluster of documents based on their semantic information. Atthis moment, the vector associated with each neuron is the distribution over the topics forthat particular neuron, the topic with the highest proportion will be its main topic.
Keywords/Search Tags:Self-Organizing Maps, Latent Dirichlet Allocation, Textual Clustering, Topic Model
PDF Full Text Request
Related items