Font Size: a A A

Research And Implementation Of Integrated Risk Vertical Search Engine

Posted on:2008-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhouFull Text:PDF
GTID:2178360215464861Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth and diversification of Internet information, general search engines have been unable to provide professional services. Vertical search engines are gradually as hot as current general search engines. The current model and algorithm of web information retrieval are analyzed and a number of key issues are discussed in the text classifier based vertical search engine of integrated risk information. There are three core modules: integrated risk information classification, information crawling and information indexing. The research includes:1. A web text classifier model combined genetic algorithm and support vector machine is proposed. Vector space model is constructed with HTML tag weights, which offset the distribution differences of text terms. Genetic algorithm with improved crossover operator is used for feature selection which lowers vector dimensions. Support vector machine's advantages are analyzed, and SVM is used for web text classification. The experiment's results show the effectiveness of this model.2. Make improvements in Fish algorithm, the dynamic search algorithm of integrated risk crawler. The middle part of Fish algorithm is eliminated and the URL sorting, algorithm is updated. Crawler design is based on Strategy pattern and its expansibility is improved.3. Establish a single Chinese character indexing database. Indexing models base on words splitting and single character is analyzed. Since the words of integrated risk is updated rapidly and drawbacks exists in words splitting indexing model, the indexing database is constructed based on single Chinese character indexing model and inverted List technology. Clients use the method of "first character determine the position, then find the whole word" to query information, and the query efficiency is improved.4. Design and implement an integrated risk vertical search engine, providing professional risk information query services with excellent expansibility.The research work is supported by key national science and technology project of the "11th Five-year" plan, "Key technology research and demonstration of Integrated Risk Guardians"(No. 2006BAD20B02).
Keywords/Search Tags:Vertical Search, Integrated Risk, Genetic Algorithm, Support Vector Machine
PDF Full Text Request
Related items