Font Size: a A A

Design And Implementation Of The Character Classification System Used In Search Engine

Posted on:2012-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:R B XuFull Text:PDF
GTID:2218330362958140Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, in large Web resource repositories gradually, quickly and accurately search information become more and more difficult, therefore, the effective information according to different search demand, need to adopt the special retrieval methods and search engines provide personalized services to achieve high efficiency and search. Through the study of the structure and character search engines work process, this paper presents a character classification system used in search engine, the system can realize text set of information extraction and clustering analysis, this paper expounds the process of realizing the character classification system, and researches the key technologies of the system: Web information extraction technology and text clustering technique, and through the system test confirmed system key technologies of practicality and system effectiveness.Web information extraction technology intended to automatic extraction the effective information Web document. This paper puts forward the information extraction algorithm, the characters of Web documents can extract the high frequency vocabulary and related characters important properties (birth year, occupation, place and institutions, etc), and detailed describes information extraction algorithm described the design methods and implementation process.Text clustering technique is one of text mining core technology, aims to partition text set into several clusters and to realize the low similarity between clusters, the text in the ensemble high similarity of the text. This paper analyzes the key technology clustering process vector space model, features weight and text similarity, for subsequent clustering algorithm provides realizing in advance. Through the analysis of the conventional K-Means clustering algorithm process, and found that the algorithm is main disadvantages artificially select initial cluster number need, therefore, this paper expounds a kind of cluster number K adaptive K-Means algorithm which can automatically select heart, and to determine the optimal cluster number K, avoid the serious influence caused by the blindness of cluster number selecting. In a certain extent, optimize the K-Means algorithm.Finally, in view of the key technology character classification system reviewed and summarized, and expounds the key technologies of further optimize the related research work.
Keywords/Search Tags:Named entity, Vector space model, Text similarity, K-Means, Adaptive clustering
PDF Full Text Request
Related items