Research And Implementation Of Text Clustering And Classification Based On Subject Search Engine

Posted on:2015-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:M L Wu

Full Text:PDF

GTID:2298330467463032

Subject:Signal and Information Processing

Abstract/Summary:

With the development of Internet and big data to come, the available information springs up different characteristics. Information contents and forms present diversity and the number have sharp increased. Also there are various types of short text with shorter length and refined content like titles and micro blog. Now in the information overload era, general search engines canâ€™t meet the userâ€™s requirement for information retrieval pertinence and accuracy, more and more subject-oriented and specialization vertical search engine arises at the historic moment. Whether it is a comprehensive search engine or targeted search engine, the text mining plays an important role there, especially as text classification and clustering, which can shunt and position to the information the user need, can recognize similar texts and can put text information in the high degree of standardization and modularization. The subject search can deal with short text in strong randomness well within text clustering and classification will play a twice the result with half the effort.Specific to the above situation, the main work and achievements in this paper are as follows:First, as to the present title text classification methods problems, like need complex semantic analysis, or require a complete professional glossary or additional training corpus, put forward a kind of unsupervised feature selection based on the LDA model of classification algorithm. The algorithm overcomes the above problems in title text categorization and has better classification effect and strong operability.Second, as to the problem of great dependency for initial values in K-means clustering algorithm, meaning different randomly selected initial clustering seeds causes the algorithm has unstable clustering result, puts forward an optimized initial center in the K-means clustering algorithm. The algorithm is based on the fine feature selection matrix. Experiments on corpus show that the algorithm can converge more accurate and stable results with less iteration.Third, design and implement a tender subject search engine system, apply the above text classification and clustering algorithm to the system classification module. The system main function is to collect tender or bid web information from good tender website seed, and then extract useful information from these pages such as bidding time, tender title tender contacts and tender body, etc. Finally, classify the extracted information according to certain criteria like industry or region.

Keywords/Search Tags:

title classification, feature selection, K-means clustering, subject search engin

Related items

1	Clustering Methods And Applications For High-dimensional Data Based On K-harmonic Means
2	Title Classification Research Of Collected Documents Based On Subject Matching
3	Image Classification Using Convolutional Neural Network Based On Feature Selection By Means Of Clustering
4	Preliminary Research On Classification And Clustering Of Chinese Web Page Involved In Intelligent Search
5	The Research And Application Of Clustering Feature Selection Methods
6	A Study Of Subject Web Classification Algorithm Based On Machine Learning
7	Research On Feature Selection Based K-means Algorithm In Text Classification
8	Research On Robust Fuzzy Clustering Algorithm Based On Feature Selection
9	Research On Network Traffic Classification Based On Clustering Analysis
10	Network Flow Classification Study Based On Model Clustering And Feature Selection Strategy