Font Size: a A A

The Realization Of Chinese And English Clustering Engine Based On The Improved Suffix Tree Algorithm

Posted on:2009-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:H L HuFull Text:PDF
GTID:2178360242980119Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of Internet's information, WEB has been developed into a huge dynamic information service network. and it contains a variety of information resources, it's sites around the world. It provides users a valuable source of information, search engines are becoming an increasingly important role. But every search engine has a specific database index ,the unique features and the use of it and the expected direction. A search engine can not satisfy everyone's needs, and it can not satisfy a person's all needs. People often use several search engies for search results's comparison and screening. Traditional search engines is difficult to meet the user's requirements in search overall accuracy rate. It became an important issue in limiting network optimization and information retrieval technology development, how to reduce the burden of learning and operation, How to make use of multiple search engines "integrated" resources and retrieval capabilities. Clustering-search technology have a solution for this key issue.Clustering-search technology focused on the processing and integration of query results. Clustering-search flexible choose an independent search engine, it choose a typical, high-performance search engine, these search engines guarantee the authority and reliability of meta-search's results. It also make full use of an independent search engine in a functional areas to make up for an independent search engine information broad coverage limitations. Clustering-search engines reduce weight, cluster, rank to search results, return to the user more intuitive result.Through the traditional clustering method study,we found that the wider use of data mining clustering algorithm is not applicable for the online clustering engine.Now, Document Clustering methods commonly used are based on the contents of documents.If it applicate in the Web search engine,it clusters the full text according to members search engines'results providing links documents list.And it is a recursive algorithm for clustering. Although this method can be accurate for clustering search results-into multiple categories, but it needs a lot of time and space to complete the full-text retrieval and recursive clustering. Experiments show that it is more than a user needs time limits. So it can not apply to the high efficient network search. Now ,meta search clustering technology refers to treating technology for the members search engine's results. In essence, it is for the convenience of users browsing.So clustering technology is used for Web information retrieval results'visual output. Under normal circumstances, the Clustering-engine are summarized on the title and summary. This method makes viewing the search engine's the documents list become very convenient.We propose a new view to these problem, We have done to simplify and improve the feature extraction, indexing model, the similarity calculation, the results of cluster formation in the search results clustering process on the basis of the analysis.We use interactive methods to avoid the recursive algorithm for Achieving linear time complexity and improving the efficiency of search engine. We designed a the clustering method ,it use phrase as characteristics, it applies Chinese and English information processing. Facts have proved that it is feasible and highly effective use improved suffix tree algorithms real-time interactive cluster.Improved suffix tree algorithm has the advantage over the traditional suffix tree algorithm:1,Node string is changes from storage in path of the root node to this node into this node.The advantage of such changes: Node definition is more flexible than the path definition, we have increased the index of the documents and items when we definite node structure, including all the necessary information of building suffix tree and clustering operations.It has provided the conditions for hierarchical traversal suffix tree.2,It uses a method that compare string with node contents..When build and traversal suffix tree,we do not need a fully depth ergodic through operating these guidelines. it is necessary to find the location that can be inserted in corresponding layer nodes while we build suffix tree.We must not read the contents of the whole tree. If a node does not meet the requirements of extraction, then all his sub-node must do not meet when view logo in the cluster,there isn't need to traverse and judge them, so it saved traversing time.The advantages of interactive clustering thinking:In the cluster, the way that users have a choice to clustering instead of the traditional recursive clustering. We have only the first layer clustering.For example, a class label-S includes N documents,recursive algorithm cluster N documents again until clustering can not terminate, then show the tree structure to the users. However, interactive clustering do not do continuous clustering for N document, only to return to the user category label S. Users see only the tree structure of no-child nodes. Only when users want to see the children of S logo and make interaction, interactive cluster will cluster N document again,then show child node label to the users. This method has reduced the clustering process that users do not want to see, which has shortened the response time and improve the efficiency of enquiries. We achieved an IC-Engine algorithm base on improving the suffix tree algorithm and interactive clustering,and design IC-Engine framework.In the IC-Engine algorithm, we have an document decomposition of a number of cluster document, there is a rough segmentation befor build suffix tree. (This segmentation need not spend time dealing with the Chinese word segmentation, in this algorithm, we do not require such accurate, it only enhance probability that suffix tree component word).In the establishment of characteristic matrix, we do not use words and documents as building elements, but use phrase and documents to establish characteristic matrix, effectively evaded the Chinese segment, it save time that the Chinese word segmentation processing, and also improved the readability of Chinese labels. We join Chinese characters and English words distinction in the system,it can be achieved Chinese, English system at the same time.IC-Engine algorithm is inadequate, we need to improve it: For example, we did not use the approach to further improve the coverage of the logo when categories class logo. We can use latent semantic indexing method cluster document in the semantic level, it can enhance the accuracy and coverage of the class.
Keywords/Search Tags:Realization
PDF Full Text Request
Related items