Font Size: a A A

Research On Web Information Retrieval Technology Based On Text Categorization

Posted on:2009-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2178360272963239Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the blooming of Internet information, the information-processing is becoming a necessary tool for people to have access to useful information. Traditional Search Engine can't meet the demand that people increasingly need the characteristic information search service. Topic-oriented text categorization search engine emerged with the tide of the times providing the more precise search service in recent years.Text is the main information carrier on Internet. Text categorization is the key technique that the Topic-oriented search engine. Text auto-categorization system can organize and manage the text information availably, locating the information accurately and rapidly, supporting the information extracting effectively.The paper researches in detail into the technology of text classification based on vector space model. Term weighting, term selection and classifier construction, the keys of text classification system, are introduced and analyzed in the paper. The paper discusses the traditional system algorithm of term weighting: TF-IDF, the introduction of factor in term weighting is presented. In feature selection, the paper indicates the reason that mutual information shows low precision in classification and proposes the improved algorithm.A new weight adjustment method was proposed through analysis of TF-IDF .Because the TF-IDF method could not deal with words weight properly; the new method introduces feature evaluation function in the feature weight computation and adjusts the features contribution. The accuracy of categorization was improved using the new method. Experimental results show that the improved algorithms experiments outperformed the traditional methods in classification precision.The Lucene package and open-source tools used by search engine test system have been introduced. Then three basic components (Crawler, Indexer and Searcher) are implemented by java technology.
Keywords/Search Tags:Search Engine, Text Categorization, Spider, Lucene
PDF Full Text Request
Related items