Font Size: a A A

The Theory And Application Research On Intelligent Search Engine

Posted on:2004-01-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z P ChenFull Text:PDF
GTID:1118360122466972Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet, web information increases exponentially. How to automatically deal with the huge information has become a very important research topic. Due to the information returned by traditional search engine varying broadly, the result of user' s query may include a mass of irrelevant information, which leads to degradation of the precision. In order to improve the query precision, domain-based intelligent search engine combined with the topic-driven and intelligent technology has come to be a new research trend. It uses machine learning to guide dynamic collection of web information, automatic information extraction, automatic text classification, and etc. Therefore the efficiency and the precision of information procession can get improved. In this dissertation, following the research of theory, algorithm and technology used in intelligent search engine, some new algorithm are presented in spider, information extraction, text preprocessing, information retrieval, and etc. Furthermore, a simulation platform and a prototype system are set up. The theory analysis and experimental results show the better performance.How to get information from the Internet is the prime problem to be solved in intelligent search engine. Based on the survey of current intelligent search engine, combined with reinforcement learning technique, the characteristic of similar web pages' distribution is used to present a heuristic search algorithm based on simulated annealing. According to the relevancy between web pages and topic, this algorithm first divides the web pages into two types: topic-relevant web pagecluster and transitional web page cluster, which determined by simulated annealing algorithm. During searching the topic-relevant web pages, immediate reward is used as evaluation criterion for exploitation to speedup mining information. While searching the transitional web pages, future reward is used as evaluation criterion for exploration to speedup locating. With the experiment on four university web sites, the results show higher search efficiency.How to efficiently extract relevant information such as title, author' name, abstract and reference from the web pages for query is one of main tasks of intelligent search engine. Recent research has demonstrated the strong performance of hidden Markov model applied in information extraction. However, the information extraction based on hidden Markov model generally takes a token as a basic extraction unit, and the information of format and list separators is not taken into account. Based on the natural structure of text, a block-based Hidden Markov Model is provided. The experiment using this new algorithm also shows the better performance than the original one.The text information is always depicted using vector space model. Therefore, the text should be preprocessed to degrade the number of words. With the research of two common methods: feature filtering and feature selection, a new algorithm based on minimum class difference is proposed. Under the observation of distribution and devotion of each feature in each class, the feature can be divided into three types: single-class feature, multi-class feature and general feature. According to the different distribution in each class, the general feature with less difference among classes will be filtered. The experimental results show that the precision of text classification is improved due to efficient filtration ofsubstantive irrelevant or weak features.Information retrieval is the query interface of intelligent search engine. Using the character of denotation of Web information, an N-level vector space model is proposed. It partitions a document into N text paragraphs according to their position, then their similarities are calculated respectively. Both the theory analysis and experimental results show that it has higher recall and precision using the new model than using traditional vector space model.Due to the huge Web information and limited storage, it is impossibl...
Keywords/Search Tags:Search engine, text classification, information extraction, reinforcement learning, Hidden Markov Model, Naive Bayesian
PDF Full Text Request
Related items