Font Size: a A A

Research On Language Identification Of Social Media Short Text Based On N-Gram Vector Feature

Posted on:2021-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:B DingFull Text:PDF
GTID:2518306308470124Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
For social media platforms,identification and tagging specific language texts is of great practical significance for emotion classification,trend extraction,prediction movie scoring and other research.Traditional language identification methods are mostly based on long text with good grammatical format,but the identification effect of short text such as social media text is not ideal.In this paper,the short text of social media is taken as the research object,and the methods of corpus annotation and language identification are studied.The main research contents can be summarized as follows:(1)Research on the method of maximum automatic annotation combined with domain related informationIn this part,we propose and implement a maximized automatic annotation method combining domain related information with base classifier voting.This method first studies the linguistic relevance of domain related information in social media texts,and proposes a method to calculate the weight of user specific language based on the assumption of user specific language,which is simple and can effectively improve the accuracy of corpus annotation.Then,based on the existing language identification method as the base classifier,the prediction probability of each classifier is calculated and weighted voting is carried out,and the corpus is automatically labeled with the user specific language weight.Finally,the method is tested on open datasets and real social media text datasets.The experimental results show that the method can achieve higher annotation accuracy,precision,recall rate and F1-score value in the automatic annotation of corpus compared with the single language identification method and simple voting annotation method.(2)Research on language identification method based on N-Gram vector featureIn this part,we propose a language identification method based on N-Gram vector feature,which improves the cascaded forest method,and uses the text feature representation method based on N-Gram weighted frequency and sentence vector to improve the accuracy of language identification.Experiments are carried out on real social media short text datasets.The experimental results show that this method can identify the language types of social media short text better.Compared with the other five existing language identification methods—TextCat,Langid.py,Language Detector,Language Detect and FastText,the method has achieved significantly higher accuracy,precision,recall rate and F1-score value through two different experimental methods of pre training language model and retraining language model.Finally,this paper designs and implements a language identification system,and tests it on the open short text dataset.The test results show that the corpus annotation and language identification methods proposed in this paper are effective in short texts of social media.
Keywords/Search Tags:language identification, corpus annotation, social media short text, N-Gram, cascade forest
PDF Full Text Request
Related items