Research On Language Identification Of Social Media Short Text Based On N-Gram Vector Feature

Posted on:2021-06-12

Degree:Master

Type:Thesis

Country:China

Candidate:B Ding

Full Text:PDF

GTID:2518306308470124

Subject:Computer Science and Technology

Abstract/Summary:

For social media platforms,identification and tagging specific language texts is of great practical significance for emotion classification,trend extraction,prediction movie scoring and other research.Traditional language identification methods are mostly based on long text with good grammatical format,but the identification effect of short text such as social media text is not ideal.In this paper,the short text of social media is taken as the research object,and the methods of corpus annotation and language identification are studied.The main research contents can be summarized as follows:(1)Research on the method of maximum automatic annotation combined with domain related informationIn this part,we propose and implement a maximized automatic annotation method combining domain related information with base classifier voting.This method first studies the linguistic relevance of domain related information in social media texts,and proposes a method to calculate the weight of user specific language based on the assumption of user specific language,which is simple and can effectively improve the accuracy of corpus annotation.Then,based on the existing language identification method as the base classifier,the prediction probability of each classifier is calculated and weighted voting is carried out,and the corpus is automatically labeled with the user specific language weight.Finally,the method is tested on open datasets and real social media text datasets.The experimental results show that the method can achieve higher annotation accuracy,precision,recall rate and F1-score value in the automatic annotation of corpus compared with the single language identification method and simple voting annotation method.(2)Research on language identification method based on N-Gram vector featureIn this part,we propose a language identification method based on N-Gram vector feature,which improves the cascaded forest method,and uses the text feature representation method based on N-Gram weighted frequency and sentence vector to improve the accuracy of language identification.Experiments are carried out on real social media short text datasets.The experimental results show that this method can identify the language types of social media short text better.Compared with the other five existing language identification methods—TextCat,Langid.py,Language Detector,Language Detect and FastText,the method has achieved significantly higher accuracy,precision,recall rate and F1-score value through two different experimental methods of pre training language model and retraining language model.Finally,this paper designs and implements a language identification system,and tests it on the open short text dataset.The test results show that the corpus annotation and language identification methods proposed in this paper are effective in short texts of social media.

Keywords/Search Tags:

language identification, corpus annotation, social media short text, N-Gram, cascade forest

Related items

1	Research On N-gram Based Hierarchical Text Language Identification
2	Categorization Corpus Construction And Research On Classification Method For Short Text
3	Language Independent Text Categorization
4	Research On Language Independent Text Categorization
5	Research And Application Of Multilingual Text Embedding Model
6	Study On Short Text Data Mining Based On Social Media
7	Research On Active Learning Based Automatic Corpus Annotation
8	Research On Short Text Emotion Classification Method Based On Word2Vec And N-Gram
9	A Study For Classifying Short Text In Social Media
10	Chinese New Word Identification Based On Large-scale Corpus