Font Size: a A A

Application Research Of User Typing Behaviors In Chinese Word Segmentation

Posted on:2019-09-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:D K ZhangFull Text:PDF
GTID:1488306470493594Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The user typing behaviors not only refer to the operations of keystroke,modification,search,selection,and confirmation,but also include the general law behind these operations,when users type in texts with Chinese Input Methods(IME).The user typing behaviors give valuable annotation information,which has not been well saved and utilized.This thesis studies user typing behaviors,reveals useful word segmentation annotation hidden in these user typing behaviors.These annotation information can be easily used to build word segmentation corpus in real time,and provide new perspectives and ideas for solving existing problems in Chinese word segmentation tasks.The main contributions of this thesis include:(1)We propose the concept of Natural Typing Annotation,and generalize three User Typing Patterns.Observing lots of collected experimental texts,we find out that the user typing behaviors provide valuable word segmentation annotation.During the process of Chinese text-typing by Pinyin,people usually need to type numeric keys or "space key" to choose the words due to homophones,which can be viewed as a cue for segmentation.The traditional text saving format does not contain these word segmentation annotations,but we have designed new text saving format to record these useful annotations,which is named as Natural Typing Annotations(NTAs)text.Studying all collected NTAs texts,we abstract three User Typing Patterns: Discrete Pattern,Adhesive Pattern,and Acceptable Pattern.Text following the acceptable pattern can provide concreted cue for word segmentation.Users,whose NTAs texts are following the acceptable pattern,take our primary concern,and are named Good Typing Users(GTU),because they have good and stable typing habit.The good typing users can provide high-quality NTAs texts,which can be easily used as training corpus for word segmenter.(2)We study on the reason why high-quality NTAs texts have their own nature.This thesis studies the reasons that influence user typing habits from three aspects: language development,theoretical derivation and practical operations.We prove that the acceptable pattern is “the most economical and efficient” way to type texts.The good typing users exactly take this way during their process of typing.This is the reason why high-quality NTAs texts have their own nature,and why we can view high-quality NTAs texts as stable segmentation corpus.(3)We design an algorithm to choose high-quality NTAs texts.We design an algorithm based on the idea of Collaborative Filtering to identify high-quality NTAs texts,find good typing users.In order to continuously obtain a large number of segmentation corpus,we constantly track and collect the NTAS text,which the good typing users have published.(4)We design a new architecture of word segmentation that can fully take advantage of NTAs texts.The “representation learning” and Bi-LSTMs network are used to be segmenter in our architecture.The most significant feature of this architecture is that users can participate in the evolution of the segmenter,therefore,the segmentor is in the cycle of“users?data?algorithm?system?users”,and can be a real self-evolution system.
Keywords/Search Tags:user typing behaviors, Natural Typing Annotations (NTAs), Chinese word segmentation, segmentation corpus, typing pattern
PDF Full Text Request
Related items