Font Size: a A A

Application Researches On Independent Component Analysis Based Semantic Clustering In Information Retrieval

Posted on:2011-08-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q PuFull Text:PDF
GTID:1118360308465854Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Along with the rapid development of computer and Internet, the data is growing at extremely surprising speed. Though there are abundant and tremendous information resources available, people are increasingly depending on external help to locate the information related to their needs in the vast information sea. How to rapidly, accurately obtain relevant information, thus, becomes an important research question. Information retrieval techniques help people from different domains find what they want from giant amount of data, which include texts, images and sounds. Web search engines, which rely on information retrieval techniques, have been the most commonly used tools for people to find their information need on the Internet. Enormous amount of users show an attractive market value and economic benefits. Information retrieval technique has been actively studied in literature for many years.The study in this dissertation was based on the combination of statistical signal processing techniques and information retrieval techniques. Such kind of combination relied on the view that a document is considered a mixture of different topic signals. The independent component analysis was utilized to analyze document in the dissertation, and the semantic structure of document could be represented by the demixed independent components. Therefore, the dissertation studied semantic clustering theory model and its applications according to the ICA semantic clustering, the estimations of relevance model and query model based on ICA semantic clustering, along with a lot of experiments. The methodologies used in the dissertation include probabilistic model, information theory, linear algebra and some related statistical methods. The main contributions of this dissertation are as follows:1.Proposed the concept of ICA semantic clustering's activation and proved that the distance of semantic topic estimated from the documents in activated semantic cluster approached to the real semantic topic than the semantic topic estimated from all feedback documents did. The documents in semantic cluster are semantically related with query by the query-guided activation, decreasing the risk caused by topic distraction of feedback documents. The process of query expansion or language model estimation could avoid setting the parameter of the number of feedback documents just because the number of documents in an activated cluster was already determined, thus the expanded query and estimated model became more robust.2 . Proposed semantic smoothing concept using ICA semantic clustering information along with collection to smooth a document model in the process of estimation of relevance model or query model. Semantic keyword clustering models were employed to enhance the estimation of feedback documents'topic model. The traditional assumption that the probability of an unseen term occurred in different documents should be the same according to the same smoothing of collection model. This unreasonable assumption could be cured by additional probability that an unseen term occurred in a semantic keyword cluster.3.Instead assuming all documents had a uniform distribution in collection, a document had a prior that was determined by the probability of a document belonged to a semantic cluster when estimating a language model. Thus the contribution of each document in estimation of the relevance model or the query model could be differentiated. Another advantage of using semantic clustering model to estimate a language model was that it expanded the universe of language models from a single document model in traditional model estimation.4.A dynamically semantic mapping relation between user interest and documents was established using semantic clusters. Documents and users could be organized into a same group and combined tightly by a semantic mapping relation on which an information recommendation system could positively find new information in the same semantic cluster for users.5.Based on the principle of term cooccurrence can be found in latent semantic space, when the latent semantic indexing and independent component analysis were used together, term cooccurrence could provide a solution to the low term overlapping problem in short documents, meanwhile, the classification accuracy of short documents could be improved in ICA semantic space.
Keywords/Search Tags:semantic clustering, semantic space, independent component analysis, pseudo relevance feedback, query expansion, language model, relevance model, query model
PDF Full Text Request
Related items