Design And Implementation Of The Character Classification System Used In Search Engine

Posted on:2012-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:R B Xu

Full Text:PDF

GTID:2218330362958140

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, in large Web resource repositories gradually, quickly and accurately search information become more and more difficult, therefore, the effective information according to different search demand, need to adopt the special retrieval methods and search engines provide personalized services to achieve high efficiency and search. Through the study of the structure and character search engines work process, this paper presents a character classification system used in search engine, the system can realize text set of information extraction and clustering analysis, this paper expounds the process of realizing the character classification system, and researches the key technologies of the system: Web information extraction technology and text clustering technique, and through the system test confirmed system key technologies of practicality and system effectiveness.Web information extraction technology intended to automatic extraction the effective information Web document. This paper puts forward the information extraction algorithm, the characters of Web documents can extract the high frequency vocabulary and related characters important properties (birth year, occupation, place and institutions, etc), and detailed describes information extraction algorithm described the design methods and implementation process.Text clustering technique is one of text mining core technology, aims to partition text set into several clusters and to realize the low similarity between clusters, the text in the ensemble high similarity of the text. This paper analyzes the key technology clustering process vector space model, features weight and text similarity, for subsequent clustering algorithm provides realizing in advance. Through the analysis of the conventional K-Means clustering algorithm process, and found that the algorithm is main disadvantages artificially select initial cluster number need, therefore, this paper expounds a kind of cluster number K adaptive K-Means algorithm which can automatically select heart, and to determine the optimal cluster number K, avoid the serious influence caused by the blindness of cluster number selecting. In a certain extent, optimize the K-Means algorithm.Finally, in view of the key technology character classification system reviewed and summarized, and expounds the key technologies of further optimize the related research work.

Keywords/Search Tags:

Named entity, Vector space model, Text similarity, K-Means, Adaptive clustering

PDF Full Text Request

Related items

1	Study On Similarity-based Text Clustering Algorithm And It's Application
2	Research Of Automatic Summarization Based On Named Entity
3	Entity Linking Model Base On Integrated Training
4	Text Similarity Computing Theory And Applied Research
5	Research On English Text Clustering Method Based On Vector Space
6	Text Classification Based On Word Vector And Topic Vector
7	Study Of Chinese Text Clustering On Improved K-means Algorithm
8	The Research Of Clustring Analysis's Application In Web Text Mining
9	Research And Implementation Of Chinese Text Clustering Algorithms
10	Research And Implementation Of Named Entity Retrieval Based On Ontology