Font Size: a A A

Research On Essential Technologies Of Automatic Classification In Search Engine

Posted on:2007-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:G YuFull Text:PDF
GTID:2178360212995479Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The search engine is the network information retrieval important tool, but the existing search engine retrieves result too many, it is very difficult to find the material who wanted truly. How to enhance the accuracy ratio of the search engine is a question waited to be solved urgently. Automatic text classification is an important application field of natural language process, an efficient means and necessary trend to substitute the troubled traditional manual classification. Introducing the computer automatic text classification into search engine may enhance effectively the retrieval precision of the search engine, provide users high grade and high degree of correlation inquiry result.Firstly, the vector space model concept and the related technology are introduced in detail. On foundation of analyzing the web page characteristic thoroughly, an improvement to characteristic weight formula TF-IDF through union the web page construct characteristic is made, it manifests the different ability of the different characteristic words in the web text to the text classification.Secondly, the feature selection algorithm which is an essential technology in the automatic text classification is researched. The feature selection steps and the commonly used feature selection algorithms are taken an analysis, on this foundation, a thorough analysis on the mutual information algorithm is carried on, then is improved considering the different category proportion and the negative value situation which hided big influence to feature information, it makes the similarity to be more accurate between the texts.Thirdly, the categorization algorithm which is the core part to the automatic text categorization system is researched deeply. The existing commonly used categorization algorithms are introduced. And on the foundation of analyzing theessence of k-nearest neighbor, in view of its insufficiency, an improvement to the k-nearest neighbor classification algorithm is made considering the merge the words which have the same contribution to the classification, and the characteristic words connection with altogether presently and so on the factors.Finally, the experiment to the above research technology is carried on using the 20_Newsgroups test collection and the libsvm system, and then the experimental results are analyzed. At last, the forecast to the next research work are carried on.
Keywords/Search Tags:Search engine, Automatic text classification, Vector space model, Feature selection, Categorization algorithm
PDF Full Text Request
Related items