Research On Vector Representation And Classification Of Bidding Project Names

Posted on:2020-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y Feng

Full Text:PDF

GTID:2428330578483460

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Short text categorization is a common demand of Internet information systems.A lot of research and practice work has been done in both academic and industrial circles.This paper focuses on the classification of bidding project names.Although such shorttype headline texts can directly reflect the content of the project,it involves a wide range of fields,and there are often cases of interfering words and polysemy in the name.The meaning of the words is highly dependent on the context.Existing short text processing techniques have some problems in dealing with bidding project names.Text representation has a direct impact on the results of text categorization.Due to the sparse,high-dimensional,and lack of semantic information,the Bag-of-Word(BOW)is not suitable for short text representation of bidding project names;the mean word2 vec model overcomes the shortcomings of Bag-of-Word and can better express the semantic information of text because it is based on word2 vec.However,due to the existence of interference words and polysemy in the name of the bidding project,the mean word2 vec model cannot express the vector well.In this paper,according to the characteristics of the name of the bidding project,the improvement is based on the mean word2 vec model.Firstly,according to the problem of interfering words in the name of the bidding project,this paper proposes the TF-IDF weighted word2 vec model,which uses TF-IDF algorithm to improve the keyword weight,reduce the weight of the interference words,and make the weight assignment reasonable.Since the TF-IDF algorithm gives higher weight to the rare words in the text set,this paper proposes a TF-IDF-CDW weighted word2 vec model that fuses the feature word categories.Based on the TF-IDF algorithm,two indicators,distribution degree(DD)and Concentration Degree(CD),which characterize the distribution of feature word categories,are incorporated.Thereby overcoming the problem that the TF-IDF algorithm is unreasonable for the assignment of rare words.There are many multiple meaning words in the name of the bidding project,which will lead to a polysemy problem and reduce the accuracy of text classification.Aiming at this problem,based on the TF-IDF-CDW weighted word2 vec model,this paper further proposes the TF-IDF-CDW weighted word2 vec model for splicing LDA topic vectors.LDA is a main technique for extracting text topic information.By splicing the topic vector of text,the potential information of the text and semantic information of the text are combined to alleviate the polysemy problem to a certain extent.However,if LDA is applied to title-type short text,the effect is not good.Because the documentlevel word co-occurrence information is too small,LDA is difficult to function.In the end,the high-dimensional lexicon mapping model is used to solve the polysemy problem.The high-dimensional lexicon mapping model greatly enriches the semantic distribution of short texts through the high-dimensional lexicon,which solves the problem well.Finally,the paper compares the classification effects of these text representation models on the text dataset of bidding project name,and verifies the effectiveness of the proposed method.

Keywords/Search Tags:

word2vec, TF-IDF, LDA, High dimensional lexicon

PDF Full Text Request

Related items

1	Research On Sentiment Classification Of Weibo Based On Word2vec And SVM
2	Research On The Construction Method Of Domain Sentiment Lexicon In The Field Of Chinese Social Media Comments ��Based On Conditional Random Fields And Ensemble Learning Rules
3	Research On Fake Comment Recognition Method Based On LDA And PW-Word2vec
4	Construction And Application Of A Chinese Emotion Lexicon
5	Research On Text Sentiment Classification Of Hotel Field
6	Research On Emotion Classification Features Based On Keyword Weighting
7	Research On Word2vec Algorithm Based On Context Distance
8	Study Of High-Dimensional Modulation Mapping Technology Based On Subset Selection
9	Research On Bilingual Lexicon Construction Between Chinese And English From Comparable Corpora
10	Expanding the interaction lexicon for three-dimensional graphics