The Study Of Key Technologies For Chinese Domain-Oriented Search Engine

Posted on:2007-09-04

Degree:Master

Type:Thesis

Country:China

Candidate:L L Cheng

Full Text:PDF

GTID:2178360212979988

Subject:Computer application technology

Abstract/Summary:

The domain-specific search engine has been an important research branch of information retrieval and achieved rapid development in recent years. However, there are still some issues need to be studied further for boosting its practical application and improving its effectiveness and efficiency. This paper provides a more detailed study for several issues in the domain-specific search engine, including crawling policies, text keyword extraction and text classification.The information crawling is the foundation for search engine. At first the crawling policies and strategy are studied. Then some common crawling algorithms are analyzed in great detail. In the end, an improved algorithm based on Shark algorithm is proposed.Keyword extraction is one of important steps for text pre-processing. Based on Na?ve Bayes Theorem, this paper establishes a valid keyword extraction model by taking the traditional weight, the first occurring position and the average deviation of spacing of the candidate words in a text as feature terms. Experimental results show that this model achieves higher accuracy than the traditional keyword extraction method based on word's weight. In addition, for reducing the adverse effect of value discretization of feature terms, this paper re-adjusts the relative importance of the above-mentioned three feature terms by presenting different correction factors for them, so as to further improve the accuracy of this model.Text classification is one of important techniques for grouping Web documents for effective information retrieval in some search engine. This paper improves the traditional Na?ve Bayes Classification Model by taking the document length and structure into consideration when modifying the classifier's formula. In addition, in view of the various factors including frequency, centralization and decentralization of words in a document, this paper provides an effective feature terms selection algorithm. Experiments show that compared with the traditional model, this improved model gets a better result in terms of precision, recall and F-Measure value.

Keywords/Search Tags:

Search Engine, Crawling, Keyword Extraction, Text Classification, Na(?)ve Bayes Theorem

Related items

1	Classification System Based On The Theme Of Information Acquisition In The Pages
2	Research On The Topical Search Engine Based On Semantic
3	Research On Keyword Extraction Technology Oriented To Conversational Text
4	Web Text Mining Research Based On Subject-oriented Search Engine
5	Research And Implementation Of The Strategy-Extensible Search Engine
6	Research On Web Crawling Technology In Image Search Engine
7	The Theory And Application Research On Intelligent Search Engine
8	The Design And Research Of Topic Web Crawler In Vertical Search Engine
9	Applications Of Hierarchical Keyword Extraction And Automated Text Classification In Bulletin Board System
10	Chinese Keyword Extraction Method Based On Word Span And Its Application In Text Classification