Font Size: a A A

The Research Of Main Technologies In Chinese Clustering Search Engine

Posted on:2010-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:P ChenFull Text:PDF
GTID:2178360302966039Subject:Computer system architecture
Abstract/Summary:
The 21st century is the century of knowledge, high technology, information and Internet.In this electronic age of information explosion, the one who can extract accurate knowledge in these vast amounts of information will become the biggest winner of the Internet industry. Faced with a large number of mass electronic information, it is not lack of rich data, but lack of the strong analysis tools. How to find the information we need quickly and accurately in these mass information is the problem we need to solve. The current mainstream search engines are all based on hyperlink analysis as the basis for ranking of search results. However, with the development of artificial intelligence, data mining, and neural networks, the single Hyperlink analysis technology has been far less representative of today's information science and cutting-edge technology. So new search engine algorithms has aroused great concern in the IT industry. Clustering search engine is one of a new type of engine. This paper discusses two major algorithms - the Chinese word segmentation techniques and clustering techniques applied to the Chinese search engine.Search engine is defined as the system with a certain degree of strategy and the use of specific computer programs to gather information on the Internet, to organize and process information rightly, in order to provide search services. It is based on the user's query request, according to a certain algorithm to find information from the index data returned to the user.The biggest difference between Clustering search engines and traditional search engine is that the search results of traditional clustering search engine are re-treatment. The clustering data mining techniques applied to the search engine results of the treatment, which is not like a traditional search engine, as outcome the search pages in order of priority in accordance with PageRank out simply, but according to the specific content of the website and bring them into several different categories. According to these categories the user can further find the information they need belongs to which group, and then only to browse the pages in this category to enhance the user retrieval efficiency and accuracy.The main contents of this paper is the clustering of the Chinese search engine is different from the traditional search engine, part of the study, the most important part of the two parts is the Chinese word segmentation algorithm and processing of search results clustering algorithm.Chinese word segmentation techniques are areas of natural language processing technology. People can understand whether a string is a word or not by their own understanding of the issues and the science knowledge. But how to let the computer can understand? The word segmentation algorithm is the process. Now Chinese segmentation algorithms can be divided into three categories: the string-matching-based method, the method based on understanding and the statistical method.The Chinese and the English approach are different in Web analytics process, which is because there is a clear difference between the Chinese information and the English information: there are Spaces between the English words, while no separator between the Chinese words. This requires before analysis of Chinese language web site, the page of the sentence will be cut into a sequence of one word firstly. Chinese word segmentation, clustering technology is a necessary prerequisite, the results of which directly impact on the performance and accuracy of the clustering algorithm. In this paper, we are primarily concerned with automatic Chinese word segmentation speed and accuracy.The accuracy of segmentation is important for the search engines, but if the segmentation is too slow, even if the accuracy of segmentation even higher, it is also not available for the search engines. Because the search engines need to deal with hundreds of millions of web pages, they requires a high-line and real-time. If the time of segmentation is too long, it will seriously affect the content of the updated search engine speed. Therefore, how to further improve the accuracy and speed of Chinese word segmentation is our main research contents.Therefore, we improve the old Chinese word technology. In basis of the traditional segmentation based on string matching algorithm, we introduce the suffix array to carry out repeat the phrase recognition. It can actually use the suffix array to do the work which the statistical segmentation can do. But it is simpler to use the suffix array implementation than the method based on statistics and more efficiency and accuracy are also better than the method based on statistics, and is not as complex as based on statistical methods to deal with ambiguity. Therefore, adding the suffix array of the Chinese word in the word can effectively improve the efficiency and accuracy. In addition, the suffix array can recognize not only words but also a lot of phrases. The phrases and phrase vocabulary contains more than semantic information, which plays an important role in clustering algorithm behind the exact text of feature vector extraction, text clustering accuracy of the results.In the part of the string-matching-based method, we also made some improvements, and strive to make the Chinese word segmentation to further enhance the effectiveness and speed. Firstly we improve the storage structure of the segmentation dictionary. As the string-matching-based methods require searching words frequently in the dictionary to find words to get the word information, but in order to meet the back of clustering and information retrieval needs, our word segmentation algorithm must be as fast as possible. Thus, an efficient organization and the dictionary word search mechanism are crucial for improving the speed of the system. Chinese dictionary generally have more than 10000 entries, accounting for about the 2M memory. If you have retrieved all the dictionaries for each match, then the algorithm is matching the speed to be low. In This paper, we design a mechanism based on double word hash dictionary, with a simple and efficient features, which makes a contribution to the improvement of the performance of the words. We do not have a single match using a method but rather two kinds of matching forward and reverse mode together for the string matching way. In this paper, we scan Word of a sentence twice. First time, we pass from left to right for a positive match mark the words which match the success of the dictionary word; Second time ,for the right-to-left reverse word to match the success of the dictionary word, mark them; Then compare the two segmentation results, according to our set of rules to determine what the results of selection.The main idea of text clustering is to use some kind algorithm, with the basis of the text data in accordance with the similarities and differences, the text is divided into several groups, the same group as similar as possible and the different groups are as different as possible. Following is the main process. Firstly, preprocess the clustering text, Segment words, remove stop words, Statistics Frequency, generate each text to feature vector space, and vector space for each feature extraction, Extracted to represent the text of the feature vectors, and calculate the similarity between the text. Then select the model to test, evaluate the clustering quality, and finally feedback to the user.This article we use text clustering technology to search engine results of treatment. As the web network complexity, data complexity, as well as web query, linear, clustering algorithms should be semantics, on-line and tree's.Based on the above requirements, we analyzed the advantages and disadvantages of the major existing clustering method and ultimately to an improved K-means algorithm to process the search results. K-means algorithm works is: First, select k objects randomly, each . object represents a cluster of Initial average or center. According to their distance from the center, the remaining objects would be assigned to the nearest cluster. And then re-calculate the average of each cluster (i.e., center of gravity). This process is repeated until the criterion function converges. In This paper our improvement of K-means algorithm is mainly reflected in the words of the document on the calculation of the weights as well as K-means algorithm on the selection of k. In terms of calculating weights, we have improved the classical TF-IDF formula and increased at an important position in terms of weight; in the selection of k values, we used a genetic algorithm, which can be faster to find a more appropriate k-value, to improve clustering performance. Chinese word segmentation and clustering algorithms we improved have achieved better results than ever before, which has been Proved by experiments. We also test the entire clustering search engine we built up, whose search results are clustering, but they are still not quite perfect, however, they already could be divided into several categories correctly.
Keywords/Search Tags:Search Engine, Clustering Search, Chinese word segmentation, Text Clustering
Related items