Font Size: a A A

Research And Implementation Of Online Classifying And Clustering Systems Of Web Search Results

Posted on:2008-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2178360212497016Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text categorization not only is a research field of subject of artificial intelligence integrated with information retrieval,but also is a core technology of automatic information managing based on contents. Now, network transport is developing rapidly, innumerable information is flow on line. How to select information from the large information flow and drop useless info? The answer is automatic text categorization. Text categorization makes use of computer to decide text's category given the probable categories, while text clustering is no teacher supervised learning from samples and doesn't know their categorizations, text clustering's target is separate collections of text into clusters and make the text of the same clusters'comparability as large as possible while the text of the different clusters'comparability as small as possible, so clustering is also called classifying.Text categorization is widely applied in spam mails'filter, network information analysis and real time news'classify. Now, with the development of agricultural information in China, a lot of agricultural websites are coming into existence but there aren't effective agricultural search engine. In the other side, general search engines doesn't classify search results, people have to use mouse to hit next page to find ideal web pages and often one kind of search engine can't include all the web pages, the biggest search engine can only consist of third web pages at most. On the purpose of find agricultural information or other info rapidly, the paper realize an online classifying and clustering system of web search results, which are another application of text categorization in daily life.This thesis first introduces the general process of text categorization and text clustering, in the next narrates text process beforehand, feature extraction and some classifying methods, then talk about string process technology which will be used in clustering system, at last, introduce the core part of the thesis , the design and implementation of online classifying and clustering system of web search results.The process of text categorization is like this, first get some samples, and then process the samples, which consists of English and Chinese language segmentation, denote the text using vector space model and extract the features of the text based on feature extraction technologies, after that, train and gain the models of samples, in the end, classify the unlabeled sample using models. There are a lot of classified methods, for example, vector space distance, KNN, Na?ve Bayes, Decision Trees, Neural Network and Support Vector Machine and so on. Na?ve Bayes is a simple and effective algorithm based on the dependence of features. Support Vector Machine is a statistical learning method, which based on minimum of structural risk, since SVM's result is global optimize, so it has indescribable priorities over other classified methods. At the same time, SVM has some disadvantage, for example, the selection of the kernel functions and the low velocity of training.Online classifying system of web search results consists of two processes, one is training, the other is categorization. For the sake of advancing the performance of classification, the paper puts forward a new method—Vote Support Vector Machine, which integrates the text feature with semantic features, and vote in the classified results from two SVM which are based on text features and semantic features separately. In order to compare the VSVM to other categorized method, the thesis test four different methods on the same agricultural dataset, which are Na?ve Bayes method based on features of class discriminating words, SVM method using class discriminating words as features, SVM method based on semantic features and Vote Support Vector Machine. As a result, VSVM exhibited the best performance in precision, recall and F value. The thesis applied VSVM method to classification of web search results, users enter the searching keywords, categorization system first retrieval all results from search engine, and then classify the results into scientific research units, experts, utilities, richness experiences, agricultural news, medical cyclopedia, breed technologies, planting technologies, prevention and cure of plant diseases and insect pests and supply and demand info, at the same time display categorization with the style of tree in browser. In this way, the thesis not only applies experimental fruits to real project, but also provides users with convenience. People who use the classification system can rapidly locate the idea web pages without pressing next page repeatedly.Online clustering system of web search results makes use of a novel method, which first extracts the completed string that may be class labels from search results called snippets based on suffix arrays and the longest common prefix, and then allot web pages to correspond categorizations based on semantic association between snippets and classes, finally finishes the job of classification. The traditional clustering methods often first cluster texts, and then use strategies of allocation and combination to form categorizations, at last extract the class labels, the formative class labels are generally inaccurate and ambiguous. Contrary to traditional clustering methods, the new clustering method can produce more unambiguous class labels, which are pivotal in info selection. The thesis applies design patterns to the implementation of clustering system, which make the system rebuilding more easily and flexibly.At present, there are some classifying systems of web search results, but they are English oriented, and existed agricultural search engine include only a few of agricultural web pages. The classifying and clustering systems involved in the thesis can satisfy both Chinese and English speaking users, they collect search results from multi search engine and classify them into categorizations of agriculture or cluster them, by which they increase search results, make people find idea web pages swiftly and provide convenience for searching agricultural info on internet. Now online classifying system of web search results a part of Jilin Agriculture Information Project is used for testing.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items