Font Size: a A A

Research Of Key Technologies In Chinese Search Engine

Posted on:2008-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:H N JiangFull Text:PDF
GTID:2178360215480823Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, Search Engine has become a necessary tool for people who want to obtain information from Internet. How to obtain the useful information from vast contents quickly and accurately is a problem for people who are enjoying the convenience of the Internet. This thesis will focus on these problems of taking good advantage of the information and providing users a more effective way to more efficient searching. Those are heated problems being discussed in the field of search engine technology.The thesis centers on the key technologies of Chinese search engine system. The following points are concerned:I Put forward a new algorithm of new words recognition by using webpage hyperlink texts as corpus. After segmentation and frequency statistics for the parsed hyperlink texts, the Mutual Information (MI) of the two neighboring words is calculated. If the MI value are higher than a defined level, the combination of the neighboring words are considered as new words, then exclude the verbal mistakes by automatic and manual methods.II Present an algorithm of eliminating duplicated webpage based on the extraction of key words of the webpage. After extracting the key words of webpage titles (TIKW: key words in title) the other key words of the text (TEKW: key words in text) are found by means of window searching. The TEKW are closely relevant to the TIKW. After all the key words having been found, the repetition rate of the key words is calculated. If the duplicate rate is over the defined level, the two texts can be considered as duplicated.III Design a sort algorithm of search engine. The weights of words are calculated by structure information. The search engine system offer two searching methods: AND search mode and OR search mode.IV Put forward two text classification algorithms (PSOSVM and PSOKNN) based on the Particle Swarm Optimization method, which has random and directed global search ability. The core problem of SVM text categorization is a constrained optimization problem of high dimension. PSOSVM search optimization solution based on the PSO technology. PSO reduces computational time and improves the traming speed. During the procedure for searching K nearest neighbors of tested sample, the particle swarm move jumpily and randomly, and save largely the classification time. PSOKNN has the same classification performance as that of KNN classification algorithm.
Keywords/Search Tags:Search Engine, New Word Recognition, Duplicated Webpage Deletion, Text Classification, Particle Swarm Optimization
PDF Full Text Request
Related items