Font Size: a A A

Research And Realization On Correlation Techniques Of Topic Search-Specific Engine

Posted on:2011-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ZhouFull Text:PDF
GTID:2178360305488673Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the appearance of the search engine has changed the mankind's way to obtain information fundamentally, which brought the great facility for human life, study and working. When people depend more and more on the search engine as tool for searching information, the inquiring result is often less than satisfactory. The development of the search engine has come into the bottle-neck period. How to offer the satisfied services for user fast and accurately becomes a goal for a search engine to go ahead. And the topic search-specific engine begins to become the new developing direction of the search engine. The focal point here is the research on the correlation techniques of topic search-specific engine and how to improve the searching quality.This paper introduces the developing history, classification and progressive trend of the search engine firstly. Next, the frame, workflow of organization of common search engine and topic search-specific engine have been described in detail separately, and the deficiency existing in the common search engine are pointed out. Finally, centering on the main thread of key of the topic search-specific engine technology, the article make the detailed research on the subject search strategy and searching webpage technique in the theme spider, the text preprocessor, feature extraction technology and text categorization technology in text categorization.The theme spider needs to t judge the correlating degree on the topic of text while gathering information, involving the correlation technique of the text categorization, which the focal research point here lies in. The text categorization technology includes: webpage purification, Chinese word segmentation, feature extraction and the categorized algorithm. The key research object of this paper is feature extraction technology in text categorization. In this paper,a comprehensive feature extraction algorithms for words and phrases is proposed by improving the commonly used feature extraction method and combining the Chinese grammar norm. The algorithm after improving considered the connection between semantemes, relatively remedied the deficiency of the existing method in, and verified the feasibility of this method with the experiment.Through researching on various fields of text categorization technology, this paper designed and realized an intact text categorization system, and select the Chinese morphology analyzing system--ICTCLAS word segmentation system, which is developed by institute of computing technology of Chinese academy of sciences--for use. Construct the classifying device through learning the language material training base at first, then classify the test file. In the experiment done on system, it has been proved in KNN(K Nearest-Neighbor) algorithm that the impact of different K values on categorised system function. The conclusion can be drew that the selection of the best K value should depend on concrete categorized system. And then, text was classfied by separately selecting IG,MI and the synthetic method of word and phrase in this paper as the feature extraction method in three groups of experiments. Indicated by the data got from the experiments, the algorithm proposed in this paper shows the advantage on the word segmentation result, and has achieved the goal of studying.
Keywords/Search Tags:Topic Search-Specific Engine, Web Spider, text categorization, feature extraction, KNN
PDF Full Text Request
Related items