Web Text Mining Based On Latent Semantic Indexing

Posted on:2014-12-10

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Wu

Full Text:PDF

GTID:2268330401454743

Subject:Computer application technology

Abstract/Summary:

Nowadays, the lifestyle of everybody has greatly changed with the arrival of the Internetera, and the Internet has already become an excellent assistant for everybody when he islearning, working or resting. However, the negative influence of the Internet also exists. Forexample, with the explosion of information from the Internet, people are overwhelmed in theinformation.Therefore, how to extract the helpful information from all the textual informationhas become a difficult problem that the researchers are faced with.In recent years, web text mining technology has become a new research focus in the fieldof information technology. While preprocessing the web text, we should analyse theconstructure of the web pages and understand the content of noises firstly. Then we canpreprocess the web pages using the proposed cleaning algorithm which is based on the htmltags. In this way, the impurities can be removed while the main text which is related to thetopic can be preserved. In this paper we redefine the method for computing the weightsconsidering the weights of the html tags. Finally, the original matrix named â€™text-featurewordsâ€™ can be obtained.Considering that the Chinese words have some unique characteristics such assynonymity and polysemy, Latent Semantic Indexing (LSI) and Probabilistic Latent SemanticIndexing (PLSI) model are introduced. Meanwhile, an optimization model of PLSI can beobtained through the probabilization of LSI. With the obtained model, a new algorithm for theprobabilistic latent semantic analysis of the web pages named WPLSI is proposed. The newalgorithm can project the vector space of the â€™text-feature wordsâ€™ matrix to another space,called the Space of the Pageâ€™s Probabilistic Latent Semantic Vectors (PLSVS).In the probabilistic semantic space with a low dimension, we can compute the semanticsimilariy between different text vectors.Then we can cluster all the semantic vectors using theproposed HAK-mediods algorithm in order to reduce the dimension of the semantic featurefor a second time.At last, the experiment platform for clustering web text is created based on thefolksonomy system and the excavation of the usersâ€™ interest when they are browsing webpages.Then the different experiment results of the partitioning method, the hierarchicalmethod and the HAK-mediods algorithm proposed in this paper are compared andanalyzed.The experiment results demonstrate that the clustering effect of the proposedalgorithm is better than those of the other two methods. Furthermore, the proposed algorithmis beneficial to excavate the usersâ€™ interest so as to provide every user with a much moreaccurate personalized recommendation.

Keywords/Search Tags:

Latent Semantic Indexing model, an optimization model of Probabiistic LatentSemantic Indexing, interest points mining, Web text clustering, Folksonomy

Related items

1	A Latent Semantic Indexing Differences Model And Its Application
2	Research On Document Clustering Technology Based On Latent Semantic Indexing
3	Research On Text Classification Based On Ontology And Latent Semantic Indexing Algorithm
4	The Research Of Optimization Technology In Latent Semantic Indexing Based On Pseudo Text
5	Research On Text Clustering Algorithm Based On Latent Semantic Indexing
6	Text clustering using latent semantic indexing
7	Research On Text Classification Filtering Technology Based On Latent Semantic Indexing And Support Vector Machine
8	Research And Improvement Of Latent Semantic Indexing Classification Model
9	Research Of Chinese-Text Retrieval Based On Latent Semantic Indexing
10	Text Classification Based On Latent Semantic Indexing