Font Size: a A A

Research On Vector Representation And Classification Of Bidding Project Names

Posted on:2020-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y FengFull Text:PDF
GTID:2428330578483460Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Short text categorization is a common demand of Internet information systems.A lot of research and practice work has been done in both academic and industrial circles.This paper focuses on the classification of bidding project names.Although such shorttype headline texts can directly reflect the content of the project,it involves a wide range of fields,and there are often cases of interfering words and polysemy in the name.The meaning of the words is highly dependent on the context.Existing short text processing techniques have some problems in dealing with bidding project names.Text representation has a direct impact on the results of text categorization.Due to the sparse,high-dimensional,and lack of semantic information,the Bag-of-Word(BOW)is not suitable for short text representation of bidding project names;the mean word2 vec model overcomes the shortcomings of Bag-of-Word and can better express the semantic information of text because it is based on word2 vec.However,due to the existence of interference words and polysemy in the name of the bidding project,the mean word2 vec model cannot express the vector well.In this paper,according to the characteristics of the name of the bidding project,the improvement is based on the mean word2 vec model.Firstly,according to the problem of interfering words in the name of the bidding project,this paper proposes the TF-IDF weighted word2 vec model,which uses TF-IDF algorithm to improve the keyword weight,reduce the weight of the interference words,and make the weight assignment reasonable.Since the TF-IDF algorithm gives higher weight to the rare words in the text set,this paper proposes a TF-IDF-CDW weighted word2 vec model that fuses the feature word categories.Based on the TF-IDF algorithm,two indicators,distribution degree(DD)and Concentration Degree(CD),which characterize the distribution of feature word categories,are incorporated.Thereby overcoming the problem that the TF-IDF algorithm is unreasonable for the assignment of rare words.There are many multiple meaning words in the name of the bidding project,which will lead to a polysemy problem and reduce the accuracy of text classification.Aiming at this problem,based on the TF-IDF-CDW weighted word2 vec model,this paper further proposes the TF-IDF-CDW weighted word2 vec model for splicing LDA topic vectors.LDA is a main technique for extracting text topic information.By splicing the topic vector of text,the potential information of the text and semantic information of the text are combined to alleviate the polysemy problem to a certain extent.However,if LDA is applied to title-type short text,the effect is not good.Because the documentlevel word co-occurrence information is too small,LDA is difficult to function.In the end,the high-dimensional lexicon mapping model is used to solve the polysemy problem.The high-dimensional lexicon mapping model greatly enriches the semantic distribution of short texts through the high-dimensional lexicon,which solves the problem well.Finally,the paper compares the classification effects of these text representation models on the text dataset of bidding project name,and verifies the effectiveness of the proposed method.
Keywords/Search Tags:word2vec, TF-IDF, LDA, High dimensional lexicon
PDF Full Text Request
Related items