Font Size: a A A

Research On Feature Extraction And Text Classification Of Online Commodity Based On Word2Vec

Posted on:2020-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhangFull Text:PDF
GTID:2428330578461536Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the data growth of information industry,disseminated content is becoming more fragmented.Information overload has increased the difficulty to acquiring knowledge.By identifying text feature information automatically,text classification technology can quickly extract the core content from huge amount of text data,meanwhile improve the efficiency of information retrieval.Automatic text classification is an effective method to deal with unstructured data.It has been gradually becoming an important research field in data mining.Most text data in the field of e-commerce is presented in short form,such as product titles,product reviews,etc.The management of online commodities first depends on the category attributes.In process of launching new products,due to the difference of users' domain knowledge,the product categories will be misplaced.It will not only make network retail market disordered but also damage the interests of businesses.In order to maintain the order of online sales market and raising the managing efficiency,a new approach for short text categorization based on Word2Vec text representation is proposed.Text with unbalanced categories often exists in real life.This paper train skip-gram model on the online commodity title text to get word vector representation and construct semantic connection between context.Aiming at the shortcoming of traditional feature selection methods in unbalanced samples,an improved information gain algorithm is proposed.In this thesis,the improvement is to consider the category's distribution factor and the feature's distribution factor on skew category.After verifying the effectiveness on Sina news corpus by experimental comparison,the improved algorithm is applied to the title corpus to achieve the automatic classification of online commodities.TextRank keyword algorithm is introduced in this thesis to obtain category topics.It constructs a graph model by using co-occurrence matrix of words in online commodity title text.As traditional TextRank algorithm did not consider the importance of nodes in network,a new method named S-TextRank is proposed.Our method introduce Revealed Comparative Advantage measure the importance of node.And word vector clustering result is fused to obtain the probability transfer matrix of nodes.Text keyword weight can be obtained by iterative computation of algorithm.We apply a novel method to online commodity classification work,the experimental results show that S-textrank method can improve the performance of feature extraction and classification.
Keywords/Search Tags:word vector, E-commerce, unbalanced corpus, information gain, TextRank, revealed comparative advantage
PDF Full Text Request
Related items