Font Size: a A A

Research On Web Text Clustering And Retrieval Technology

Posted on:2010-01-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J MengFull Text:PDF
GTID:1118360332457773Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet, the volume of the text based informationincreases day by day with the high speed, and there is urgent need for people to effectivelyaccess the information. The text mining tasks try to solve the problem of"informationoverload".Text is the semantic representation of natural language, so if some natural languageprocessing (NLP) techniques are adopted into the text mining process to handle the se-mantic features in text, some improvements in text mining algorithms can be foreseen.This thesis put the research focus on some applications in text clustering and informationretrieval using NLP techniques. For the text mining tasks in environment of Web andsearch engine, this thesis propose a series of NLP based methods to improve the qualityof text clustering algorithm and the accuracy in the relevance of search results related touser's query in web based information retrieval systems. The major contents of this thesisinclude the following four parts.Firstly, this thesis proposed an NLP based semantic feature reduction method usedin text clustering algorithm. Compared with the supervised text categorization algorithm,text clustering is an unsupervised data mining method, and there are little effective featurereduction methods yet. The different kinds of features that can affect the quality of textclustering results are hard to be controlled. If the dimension of feature space is too huge,the accuracy of clustering results can be easily affected by the noise features. This the-sis proposed a feature reduction method based on lexical analysis by choosing the nounrelated features, which can significantly reduce the dimension of feature space and mean-while reserve most of their discrimination power. Because there are lots of synonymousnouns that different words share the same meaning, which can cause inaccuracy in docu-ment similarity measure. To solve this problem, this thesis uses the semantic dictionary totransform each remained feature to its upper semantic categorization, leading to a smallerfeature space and meanwhile promoting the accuracy of clustering results.To tackle the deficiency in ranked results list returned from search engine, cluster-ing search results is a more suitable result representation. The content of search results issimple and concise, but short in length. The similarity measure based on this kind of short texts usually leads to poor results because of the sparseness in feature space. This thesisuses tolerance rough set to extend the original feature space to its semantic approximateupper feature space based on the words co-occurrences. In the new feature space, thelatent similarity between documents is intensified. And this thesis also presents a new la-bel based search results clustering algorithm according to the correlation between words,and transform the problem of search results clustering to query sense disambiguation.This method can generate more descriptive and indiscriminate labels for each cluster andmeanwhile make documents in the same cluster consistent in contents. Experiments showthat this clustering method can help users to find the different senses in their queries at thesearch results, and easily locate the subset of results that according to their informationneeds.The VSM (Vector Space Model) is usually adopted as the text representation in textclustering, where the features are supposed to be independent. This assumption makes alot of useful information lost in similarity measure between documents. Compared withthe single independent features, the frequent wordsets occurred in many documents canimply the similarities between documents with strong indication. This thesis measures thesimilarities between documents based on contextual constraint closed frequent wordset,which is a more suitable feature unit to re?ect the latent relations in documents. Fre-quent itemset mining is a technique adopted from data mining, which used in associationanalysis in structural transaction database. In this thesis, it is modified for text clusteringalgorithm, and constrained with different contextual proximity to make the wordset moreconsistent in semantic. The experiments results show that the clustering algorithm basedon this new documents similarity measure can get more accuracy in results of clustering.Ranking of search results by relevance is a very important topic in information re-trieval. Different with the traditional text documents, there is lots of noise informationin Web pages which has strong impacts on the relevance of results. So in this thesis, theWeb pages were purified through page analysis and content extraction method based onthe concept of content unit firstly, which can reduce the impact of the noise informationexist in the structure level of Web pages. Most of the information retrieval systems laytheir relevance computing techniques on the full-length text analysis, but there are moreinconsistent contents which are topic irrelevant existing in Web pages which can also de-teriorate the relevance of results. This thesis proposed a summarization based relevance promoting method computing the relevance between query and summarization instead offull text. Summarization is the core of full text document and more consistent in topic rep-resentation, which has the characteristics like concise, accuracy and clear. Experimentsshow that summarization based relevance computing method can lead to a more accuratesearch results in relevance ranking.
Keywords/Search Tags:natural language processing, text clustering, information retrieval, tolerance rough set, frequent wordset mining
PDF Full Text Request
Related items