Font Size: a A A

Research On Short Text Classification For Tender Project Name

Posted on:2018-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:H H ShiFull Text:PDF
GTID:2348330542468708Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Messages,the assessments of online products,or weibo texts show explosive growth trends.It is significant that short essay plays a important role in the process of information transmission.Long text has rich semantic characteristics.However,short text has no such characteristics and sparse matrix make classification and deep data mining difficult for us.It is relatively mature that topic model be used in data mining technology of long text.Although,short text processing has been still in the framework of long text processing.Relevant external information is expanded to short text in recent theses,topic model included.It is not generally that when difficulty of search relevant corpus of short text and dependence on quality of related information be considered.Corpus of bidding project names is a typical Chinese short text data set.In recent years,bidding websites relying on manual collection and processing can't match the increasingly fierce market environment.It is urgent that automatic bidding websites be developed.The website associated with this thesis can realize automatic acquisition,processing and analysis of bidding project names.This thesis emphasizes the classification issue.Compared with long text corpus,shorter text data set made up of bidding project names from a wide variety of websites is sparse.Specific experimental processing details will be shown in this thesis.Firstly,TF-IDF and IG are selected in feature selection methods.Bayes classification method is integrated with feature selection methods.Classification results are evaluated by F value.Secondly,this thesis puts forward rules-based feature selection methods,including the whole phrase,all words deleted before the first key word and all words weighted in the phrase.Weight assignment is the best among three rules.Precision rate increases though recall rate decreases.Last but not least,this thesis improves LDA.Result of IG and result of LDA fuse together.It reveals that precision rate increases and also recall rate improves.Validation of this method be certified in this thesis.For further promotion,this method can be put into practice of Chinese short text data set classification.
Keywords/Search Tags:LDA topic model, TF-IDF, IG, native bayes, feature selection
PDF Full Text Request
Related items