Font Size: a A A

The Study Of Key Technologies For Chinese Domain-Oriented Search Engine

Posted on:2007-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:L L ChengFull Text:PDF
GTID:2178360212979988Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The domain-specific search engine has been an important research branch of information retrieval and achieved rapid development in recent years. However, there are still some issues need to be studied further for boosting its practical application and improving its effectiveness and efficiency. This paper provides a more detailed study for several issues in the domain-specific search engine, including crawling policies, text keyword extraction and text classification.The information crawling is the foundation for search engine. At first the crawling policies and strategy are studied. Then some common crawling algorithms are analyzed in great detail. In the end, an improved algorithm based on Shark algorithm is proposed.Keyword extraction is one of important steps for text pre-processing. Based on Na?ve Bayes Theorem, this paper establishes a valid keyword extraction model by taking the traditional weight, the first occurring position and the average deviation of spacing of the candidate words in a text as feature terms. Experimental results show that this model achieves higher accuracy than the traditional keyword extraction method based on word's weight. In addition, for reducing the adverse effect of value discretization of feature terms, this paper re-adjusts the relative importance of the above-mentioned three feature terms by presenting different correction factors for them, so as to further improve the accuracy of this model.Text classification is one of important techniques for grouping Web documents for effective information retrieval in some search engine. This paper improves the traditional Na?ve Bayes Classification Model by taking the document length and structure into consideration when modifying the classifier's formula. In addition, in view of the various factors including frequency, centralization and decentralization of words in a document, this paper provides an effective feature terms selection algorithm. Experiments show that compared with the traditional model, this improved model gets a better result in terms of precision, recall and F-Measure value.
Keywords/Search Tags:Search Engine, Crawling, Keyword Extraction, Text Classification, Na(?)ve Bayes Theorem
PDF Full Text Request
Related items