E-commerce live broadcasting and delivery is a newly emerging industry in China in recent years.It has greatly promoted the development of the national economy,brought great convenience to the people,and also provided a more conducive platform for businesses.However,this new industry is developing very rapidly.In the face of huge benefits,the platform ignores the shortcomings in data processing,that is,the automatic processing of data.Therefore,it is still in the manual processing stage in this field,which greatly reduces the efficiency of data processing.Therefore,based on this,this thesis studies the commodity Title category classification generated by the e-commerce live broadcast industry.The contents are as follows:(1)By using the web crawler technology,the product information of Taobao and Tiktok’s top ten anchor stations in 2021 was captured.A total of 31237 pieces of data were obtained.The data set identified four categories of product Titles: clothing,beauty,life and food,including 10035 items of "clothing" category,8954 items of "beauty" category,6845 items of "life" category and 5403 items of "food" category.According to the total data volume,it is randomly divided into three data sets with different proportion of training set and test set: data sets I,II and III.The data volume is 10000,10000 and 11237 respectively,and the proportion of training set and test set is 60% and 40%,70% and 30%,80% and 20%respectively.(2)For the data,Jieba tool is used for preprocessing,that is,denoising and word segmentation.Due to the characteristics of the data itself,there is no need to stop the operation.Then TF-IDF and word2 vec word vector models are used to extract text features respectively,and the output results are input into machine learning model and LSTM self attention hybrid model for classification.(3)Four machine learning algorithm models of decision tree,random forest,naive Bayes and xgboost are used to classify goods.Through analysis and calculation,the classification accuracy of data set II is higher than that of the other two data sets,and xgboost has the best effect,with an accuracy of 90.89%.(4)The LSTM self attention hybrid model is constructed.The model is composed of three layers: self attention weighting layer,long-term and short-term memory network classification layer and softmax normalization processing layer.The self attention weighting layer gives attention weight to the word vector input by word2 vec model,and then inputs it into the long-term and short-term memory network classification layer for classification,Finally,through the normalization processing of softmax layer,the final category of commodity Title Classification is obtained.Compared with the machine learning algorithm model,the LSTM self attention hybrid model has better classification effect,and the classification accuracy for dataset II is 92.09%. |