Font Size: a A A

Research On Undersampling Algorithm Based On Word2vec And Vector Space Model

Posted on:2023-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:M Y DongFull Text:PDF
GTID:2530307031459064Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Classification of imbalanced data has always been an important research topic.Generally,there are two research methods: data level research and algorithm level research.Imbalanced data in many fields have textual attributes,such as economics,health care,industry,etc.At present,simple digital encoding or one-hot encoding is generally used in the research on imbalanced data with text attributes,which can not reflect the difference between different features well,resulting in a certain deviation in classification effect.Therefore,in order to better deal with the classification problem of mixed data sets containing text attributes,Word2 vec method,a text mining tool,is combined with undersampling of imbalanced data to explore the similarity of mixed data with text attributes.The specific contents are as follows:1.Clean the data and separate the text data and vector data in imbalanced data.In order to better reflect the correlation between text data,text data in imbalanced data is converted into vector data by using skip-gram model in Word2 vec method.The original vector data is normalized and combined with the converted text data.2.For the combined imbalanced data,cosine similarity,Euclidean distance,Manhattan distance and Chebyshev distance in vector space model are used to seek the optimal similarity of majority samples,and undersampling is carried out to make it a relatively balanced data set.3.Logistic regression model is used to analyze the effect of undersampling,compare the accuracy of the model,and conduct empirical analysis through different data sets.The results show that the text data is converted into vector data by skip-gram model in Word2 vec method,and the classification accuracy of the model obtained by undersampling with cosine similarity,Euclidean distance and Manhattan distance decreases with the increase of category proportion,the classification accuracy of the model obtained by using Chebyshev distance undersampling has little variation in different category proportions.The results show that this method has a good application effect on imbalanced data containing text attributes with a low category proportion.Figure 9;Table 39;Reference 55...
Keywords/Search Tags:imbalanced data, Word2vec model, vector space model, undersampling, logistic regression
PDF Full Text Request
Related items