Font Size: a A A

Chinese Lexical Analysis Research For Social Media

Posted on:2019-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:S H HeFull Text:PDF
GTID:2438330551956365Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,people could get more information on the Internet.There has appeared more and more website or forum focus on social interactions,such as Weibo,Twitter,and Facebook,people began to express their own emotions,attitudes,and feelings on the Internet.As a result,social media has grown in size and the number of users.At the same time vast amounts of information being generated and spread on the Internet.In face of such massive media corpus,there is not an effective lexical analysis system currently.It is difficult to achieve effective processing only by manual annotation.Therefore,two mainly lexical analysis tasks in social media,which are Chinese word segmentation and named entity recognition,have been dealt with in this thesis.In the first two chapters of this thesis,the related and basic techniques about these problems are introduced in detail.Then for the shortcomings of existing research,in the third and fourth chapter,two solutions about the Chinese word segmentation task and named entity recognition task have been proposed:(1)For the task of social media Chinese word segmentation,a feature engineering based on dictionary resources and new word detection is proposed.The feature engineering is designed by the study on the characteristic of social media.Then LSTM-CRF algorithm has been employed for character-based Chinese word segmentation,which can take advantage of both feature engineering and deep neural networks and solve this problem effectively.Compared with the traditional CRFs algorithm,the proposed method has achieved significant improvement on the OOV-Recall(the recall rate of out-of-vocabulary words in test dataset).(2)For the named entity recognition task in social media,an improved conditional random fields algorithm with embedding representation(EMB-CRF)has been proposed in this thesis.And the character-position embedding,which is a compromise of character embedding and word embedding,is used as a feature in EMB-CRF.The algorithm treats the embedding representation as a dense real-value vector feature,which has the same status as the traditional feature functions.In addition,based on the statistics and analysis of named entity in the corpus,a pre-tag feature based on point-wise mutual information is proposed.Finally,by using PMI pre-tag feature,the proposed EMB-CRF algorithm achieved a better result for named entity recognition.The method proposed in this thesis has achieved slightly improvement on the related dataset.
Keywords/Search Tags:Social media, Chinese word segmentation, named entity recognition, CRFs, Deep learning
PDF Full Text Request
Related items