Font Size: a A A

Web Text Mining Based On Latent Semantic Indexing

Posted on:2014-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WuFull Text:PDF
GTID:2268330401454743Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the lifestyle of everybody has greatly changed with the arrival of the Internetera, and the Internet has already become an excellent assistant for everybody when he islearning, working or resting. However, the negative influence of the Internet also exists. Forexample, with the explosion of information from the Internet, people are overwhelmed in theinformation.Therefore, how to extract the helpful information from all the textual informationhas become a difficult problem that the researchers are faced with.In recent years, web text mining technology has become a new research focus in the fieldof information technology. While preprocessing the web text, we should analyse theconstructure of the web pages and understand the content of noises firstly. Then we canpreprocess the web pages using the proposed cleaning algorithm which is based on the htmltags. In this way, the impurities can be removed while the main text which is related to thetopic can be preserved. In this paper we redefine the method for computing the weightsconsidering the weights of the html tags. Finally, the original matrix named ’text-featurewords’ can be obtained.Considering that the Chinese words have some unique characteristics such assynonymity and polysemy, Latent Semantic Indexing (LSI) and Probabilistic Latent SemanticIndexing (PLSI) model are introduced. Meanwhile, an optimization model of PLSI can beobtained through the probabilization of LSI. With the obtained model, a new algorithm for theprobabilistic latent semantic analysis of the web pages named WPLSI is proposed. The newalgorithm can project the vector space of the ’text-feature words’ matrix to another space,called the Space of the Page’s Probabilistic Latent Semantic Vectors (PLSVS).In the probabilistic semantic space with a low dimension, we can compute the semanticsimilariy between different text vectors.Then we can cluster all the semantic vectors using theproposed HAK-mediods algorithm in order to reduce the dimension of the semantic featurefor a second time.At last, the experiment platform for clustering web text is created based on thefolksonomy system and the excavation of the users’ interest when they are browsing webpages.Then the different experiment results of the partitioning method, the hierarchicalmethod and the HAK-mediods algorithm proposed in this paper are compared andanalyzed.The experiment results demonstrate that the clustering effect of the proposedalgorithm is better than those of the other two methods. Furthermore, the proposed algorithmis beneficial to excavate the users’ interest so as to provide every user with a much moreaccurate personalized recommendation.
Keywords/Search Tags:Latent Semantic Indexing model, an optimization model of Probabiistic LatentSemantic Indexing, interest points mining, Web text clustering, Folksonomy
PDF Full Text Request
Related items