Font Size: a A A

Research And Implementation Of Text Clustering And Classification Based On Subject Search Engine

Posted on:2015-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:M L WuFull Text:PDF
GTID:2298330467463032Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of Internet and big data to come, the available information springs up different characteristics. Information contents and forms present diversity and the number have sharp increased. Also there are various types of short text with shorter length and refined content like titles and micro blog. Now in the information overload era, general search engines can’t meet the user’s requirement for information retrieval pertinence and accuracy, more and more subject-oriented and specialization vertical search engine arises at the historic moment. Whether it is a comprehensive search engine or targeted search engine, the text mining plays an important role there, especially as text classification and clustering, which can shunt and position to the information the user need, can recognize similar texts and can put text information in the high degree of standardization and modularization. The subject search can deal with short text in strong randomness well within text clustering and classification will play a twice the result with half the effort.Specific to the above situation, the main work and achievements in this paper are as follows:First, as to the present title text classification methods problems, like need complex semantic analysis, or require a complete professional glossary or additional training corpus, put forward a kind of unsupervised feature selection based on the LDA model of classification algorithm. The algorithm overcomes the above problems in title text categorization and has better classification effect and strong operability.Second, as to the problem of great dependency for initial values in K-means clustering algorithm, meaning different randomly selected initial clustering seeds causes the algorithm has unstable clustering result, puts forward an optimized initial center in the K-means clustering algorithm. The algorithm is based on the fine feature selection matrix. Experiments on corpus show that the algorithm can converge more accurate and stable results with less iteration.Third, design and implement a tender subject search engine system, apply the above text classification and clustering algorithm to the system classification module. The system main function is to collect tender or bid web information from good tender website seed, and then extract useful information from these pages such as bidding time, tender title tender contacts and tender body, etc. Finally, classify the extracted information according to certain criteria like industry or region.
Keywords/Search Tags:title classification, feature selection, K-means clustering, subject search engin
PDF Full Text Request
Related items