Font Size: a A A

Based On The Theoretical Study Of The Digital Organism Database Search Engine

Posted on:2009-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:L F CengFull Text:PDF
GTID:2208360245961709Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of internet and the dramatic growth of people's requirements for useful information, search engine technology has made great process in the last decade. Most commercial search engines such as google and Yahoo only focus on hypertext, lacking wide coverage over other resources of information .As is known, database play an essential role in information storing and accessing, search engine for database has became an attractive field of computer science in recent several years.This paper has designed a database search engine based on Digital Organism Database System, which is a new generation of distributed database developed by our research office. Digital Organism Database System has been designed to arrange the distribution of databases and dispatch retrieves in a wide area of network, which is made up of multiple server nodes. The search engine based on Digital Organism Database System allows users to retrieve relevant records stored in multiple databases via a serial of keywords.On the basis of popular technology prevailing in traditional search engine, such as word segmentation, text classification and information compression, this paper has improved some algorithms and engineering methods to promote the performance of database search engine. This thesis enhances the innovations and improvements we contributed in theory and engineering of search engine for databases. The major work includes:1 Propose an improved Chinese word segmentation algorithm for large-scale Chinese information processing, which is the basic phase of the building of Chinese search engine. Using prefix tree and dynamic programming, this algorithm boosts the speed of Chinese word segmentation and guarantees relatively high precision. This algorithm also provides a flexible approach to handle out-of-vocabulary words such as person names, place names and organization names.2 Traditional text classifier based on SVM needs abundant labeled training documents, both positive class documents and negative class documents. To resolve the lack of negative training data, this paper propose an effective approach, which integrates Rocchio method and K-means clustering to fetch adequate negative training data for classifier building. Experiment show that our new method could promote the accuracy of documents classifier.3 Propose a well-defined software architect called distributed thread pool technology, which is essential to task dispatching among distributed server nodes.Finally, conduct rigid experiment was conducted to verify the performance of the algorithms proposed by this paper and the functions of the search engine based on Digital Organism Database System.
Keywords/Search Tags:Digital Organism Database System, search engine, text classification
PDF Full Text Request
Related items