Font Size: a A A

Research On Chinese Short Text Classification Based On Improved FastText

Posted on:2019-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:B H QuFull Text:PDF
GTID:2428330545954763Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rise of WEB2.0 and the booming development of mobile smart devices,Chinese short text data,such as microblogs and online commentaries,is showing a large-scale growth trend.Accurately classifying Chinese short text data is valuable to research on user interest and habits.The traditional Chinese short text classification method is based on short text data,which select emotional words as related features and use word bag models to establish classification models.But there are problems such as ignoring links of terms and semantic information,and using vector space model to establish classification models is not applicable to short texts.The short texts have the disadvantages of sparseness and non-standardity.To compensate for the shortcomings of short text sparseness,it is common practice to supplement short texts with the help of an external knowledge base.However,due to the large volume of external knowledge base corpus and the wide distribution of topics,the computational overhead is very large and the performance of the algorithm is affected.Aiming at the above problems,this thesis proposed a Chinese short text classification method based on improved FastText,which was improved by TF-IDF and LDA thesis model with time window.It was suitable for Chinese short text classification.The existing classification methods based on semantic rules has high classification cost and poor scalability.The classification method based on machine learning still needs to be improved in accuracy.The classification method based on deep learning will consume a lot of time and computing resources when training the model.This thesis focuses on these problems and deficiencies.It studied the Chinese short text classification based on improved FastText and proposed the TL-FastText method.This method firstly improved the LDA model based on time influence factors and variable time windows,and proposed a TIF-LDA topic model,and used the variable time window to select the most valuable data for model training.In the input stage of the FastText model,the TF-IDF values of the words in the dictionary generated by the n-gram was calculated and the words with high frequencies and low discrimination were filtered out,words with low frequency and high discrimination are retained,and non-meaningful words that have not appeared in the document were also screened out.The reserved dictionary was constructed.The words in the reserved dictionary compared with the result whicn was gotten by TIF-LDA.If there was the same word,the subject word sequence containing the word in the TIF-LDA result was added to the reserved dictionary,and the reserved dictionary was supplemented and reconstructed.When TL-FastText calculated the mean value of the input word sequence vector,it shifted to a high-differentiated term,making it more suitable for Chinese short text classification.This thesis contrasts with the mainstream text classification models including the classic FastText classification model,in the aspect of the classification accuracy,the training time of the model,and the classification accuracy under multiple classifications.The experimental results show that the proposed method has better classification performance.
Keywords/Search Tags:FastText, TF-IDF, LDA, short text classification, time window
PDF Full Text Request
Related items