Cambodian syntactic analysis has great theoretical significance and practical value for Cambodian language ontology research,NLP research and teaching practice.From a linguistic point of view,Cambodian syntactic analysis is the key link of upper word analysis and lower semantic analysis,and is the hub for connecting surface grammatical structure and deep semantic structure.From the perspective of NLP,the effectiveness of Cambodian syntactic analysis directly affects the operational efficiency of advanced tasks such as question answering system,machine translation,and information extraction,and is the focus and difficulty of Cambodian NLP research.From the perspective of teaching practice,Cambodian syntactic analysis is a necessary skill to truly understand Cambodian,and it is also an important basis for judging the teaching effect.However,at present,the research on Cambodian syntax in the academic community is still relatively weak,there is a lack of systematic description and interpretation in linguistics,and it is still mainly in the stage of lexical analysis in terms of NLP,and the syntactic analysis in teaching practice is also superficial.Taking Cambodian syntax as the research object,this paper uses dependent grammar theory to discuss the syntactic problems of Cambodian from the two levels of linguistics and NLP,in order to promote the development of Cambodian syntax research,and focus on promoting the breakthrough of Cambodian NLP research from lexical analysis to syntactic analysis.According to the characteristics of the task,according to the research order of "first upstream and then downstream,first theory and then practice",it mainly focuses on solving the following four problems:(1)solve the problem of lack of large-scale high-quality annotated corpus in Cambodian,and create basic conditions for subsequent Cambodian lexical and syntactic research by constructing a large-scale annotated corpus that conforms to specifications.(2)Solve the problem of Cambodian lexical analysis,mainly including the two main tasks of word segmentation and part-of-speech labeling,so as to lay the foundation and premise of syntactic analysis;(3)solve the problem of lack of theoretical analysis,and use dependent grammar theory to interpret and explain Cambodian sentences at the linguistic level,so as to form a theoretical basis for Cambodian syntax research;(4)Solve practical application problems,take dependent grammar theory as a guide,propose a strategy for automatic syntactic analysis of Cambodian,and develop a small syntactic analyzer for verification.The main solutions and research contents are as follows:(1)Taking the zero-width space(ZWSP)existing in the Cambodian text as a breakthrough,the crawler is used to obtain corpus data containing ZWSP on a large scale,and the data is standardized using regular expressions.By converting ZWSP into half-width spaces,skipping the manual word segmentation link,a large-scale word segmentation database is finally built as the data support of this study.This session aims to solve the problem of the lack of large-scale highquality labeled corpus in Cambodian,and create basic conditions for subsequent research on Cambodian lexical and syntactic research.(2)Through statistical analysis,data such as the distribution of commonly used words in the corpus and the frequency of use of monosyllabic words were obtained.Based on these corpus statistics and the official Cambodian dictionary,a high-quality vocabulary list is constructed.Using the three methods of "two-way maximum matching algorithm","regular expression" and "Khmer character cluster",and adopting the processing idea of "rule + statistics",a high-quality Cambodian word segmentation model was developed.Based on the word segmentation model,a joint model of word segmentation and part-of-speech annotation was developed by using the Cambodian official dictionary and the ternary collocation information based on the N-gram mode and the frequency of use.This session aims to solve the problem of Cambodian lexical analysis,especially the insufficient accuracy of word segmentation and part-of-speech labeling.By comprehensively analyzing the causes of problems,proposing solutions and developing joint models,the accuracy of Cambodian word segmentation and part-of-speech labeling finally reaches a high level,which can support the development of downstream tasks.(3)The development of high-quality joint models of word segmentation and part-of-speech labeling laid a solid foundation for the study of Cambodian syntax.In this study,Cambodian syntactic structure is systematically and comprehensively described and explained using dependent grammar,and examples are given in detail from three aspects: "multi-word structure","basic syntactic structure" and "special syntactic structure".At the same time,according to the linguistic characteristics of Cambodian and according to the dependency specification of Universal Dependencies,various dependencies between Cambodian sentence subcomponents are stipulated,and it is believed that Cambodian contains 27 dependency syntactic relationships.This session aims to solve the problem of lack of theoretical analysis in Cambodian syntax research.This paper uses the theory of syntactic analysis in dependency grammar to comprehensively describe and explain Cambodian syntax,and discusses the application of dependency grammar in Cambodian syntax research for the first time from the theoretical level.(4)On the basis of the above research,based on the "divide and conquer strategy",it is proposed that the elements of "part of speech","position","collocation" and "syntactic function" in traditional linguistics can be used to realize the transfer of syntactic analysis from the lexical level to the syntactic level according to the ideas of "layer-by-layer merging" and "transfer of dominance".Finally,the main process of Cambodian dependency syntactic analysis is designed in a rule-driven manner,which puts forward a strategic idea for the development of Cambodian dependency syntax analyzer.This session aims to solve the practical application of Cambodian syntactic analysis.This paper puts forward the design of Cambodian dependency syntax analyzer,and points out the practical line of Cambodian dependency syntactic analysis from the perspective of rules.After example testing,the feasibility of the idea is verified,and a simple dependency syntax analyzer is developed for corpus testing and training.It can be seen that,first,ZWSP can effectively play an effective role in constructing a largescale high-quality annotation corpus of Cambodian,which greatly saves the time and energy of manual labeling,especially to ensure the unification of labeling standards and achieve high accuracy;Second,the dependency grammar theory can be well applied in response to the problem of Cambodian syntax analysis,both from the level of linguistic theory and the level of NLP practice,which is a theory worthy of further attention and research.Third,Cambodian syntactic analysis research,under the support of large-scale annotated corpus and dependent grammar theory,has truly realized the combination of theory and practice,Cambodian NLP research has substantially shifted from lexical analysis to syntactic analysis. |