Research On Chinese Short Text Classification Based On Improved FastText

Posted on:2019-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:B H Qu

Full Text:PDF

GTID:2428330545954763

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rise of WEB2.0 and the booming development of mobile smart devices,Chinese short text data,such as microblogs and online commentaries,is showing a large-scale growth trend.Accurately classifying Chinese short text data is valuable to research on user interest and habits.The traditional Chinese short text classification method is based on short text data,which select emotional words as related features and use word bag models to establish classification models.But there are problems such as ignoring links of terms and semantic information,and using vector space model to establish classification models is not applicable to short texts.The short texts have the disadvantages of sparseness and non-standardity.To compensate for the shortcomings of short text sparseness,it is common practice to supplement short texts with the help of an external knowledge base.However,due to the large volume of external knowledge base corpus and the wide distribution of topics,the computational overhead is very large and the performance of the algorithm is affected.Aiming at the above problems,this thesis proposed a Chinese short text classification method based on improved FastText,which was improved by TF-IDF and LDA thesis model with time window.It was suitable for Chinese short text classification.The existing classification methods based on semantic rules has high classification cost and poor scalability.The classification method based on machine learning still needs to be improved in accuracy.The classification method based on deep learning will consume a lot of time and computing resources when training the model.This thesis focuses on these problems and deficiencies.It studied the Chinese short text classification based on improved FastText and proposed the TL-FastText method.This method firstly improved the LDA model based on time influence factors and variable time windows,and proposed a TIF-LDA topic model,and used the variable time window to select the most valuable data for model training.In the input stage of the FastText model,the TF-IDF values of the words in the dictionary generated by the n-gram was calculated and the words with high frequencies and low discrimination were filtered out,words with low frequency and high discrimination are retained,and non-meaningful words that have not appeared in the document were also screened out.The reserved dictionary was constructed.The words in the reserved dictionary compared with the result whicn was gotten by TIF-LDA.If there was the same word,the subject word sequence containing the word in the TIF-LDA result was added to the reserved dictionary,and the reserved dictionary was supplemented and reconstructed.When TL-FastText calculated the mean value of the input word sequence vector,it shifted to a high-differentiated term,making it more suitable for Chinese short text classification.This thesis contrasts with the mainstream text classification models including the classic FastText classification model,in the aspect of the classification accuracy,the training time of the model,and the classification accuracy under multiple classifications.The experimental results show that the proposed method has better classification performance.

Keywords/Search Tags:

FastText, TF-IDF, LDA, short text classification, time window

PDF Full Text Request

Related items

1	Research On The Method And Its Application Of Short Text Classification Based On FastText
2	Research On Chinese Text Classification Based On Improved FastText
3	Research On FastText Text Classification Algorithm Based On TF-IDF
4	Research On Group Classification Technology Based On Chat Content
5	Research On Fast And Precise Classification Algorithm Of Long Text Based On FastText
6	Research On FastText-based Classification Of News Texts And Its Application In Agricultural News
7	Research On Text Classification Based On Improved TF-IDF And FastText Algorithm
8	Research On Short Text Classification Method For Intelligence Analysis
9	Research On Short Text Sentiment Tendency Analysis Algorithm
10	Research On Text Classification Method Based On Feature Vector Construction