| Semantic role tagging is one of the most critical tasks in natural language processing research.Currently,there are few scholars studying semantics in the field of Tibetan natural language processing,and most of them focus on studying morphology and syntax,which poses great challenges and difficulties to the study of Tibetan semantic roles and their annotation.Currently,English and Chinese information processing has relatively mature theories and technologies in semantic research,greatly accelerating the pace of research in natural language processing,which means that Tibetan natural language processing research has further development space in the field of semantic roles and annotation.Therefore,it is necessary to build a corpus with both traditional Tibetan grammar attributes and informational attributes,and conduct in-depth research on the syntactic and semantic relationships between the surface and deep layers of Tibetan sentences.Tibetan semantic role analysis and tagging is one of the important methods for Tibetan natural language processing research to move towards semantic research.Its research results are of great value for Tibetan text automatic extraction and analysis,machine translation,and the development of automatic question answering systems.Therefore,based on the development and application prospects of natural language processing research at home and abroad in the current era,this article focuses on the research of Tibetan semantic role measurement.Based on the previous research results,and guided by theories such as syntactic metrology,corpus linguistics,semantic case theory,and frame semantics,and based on Tibetan language textbooks for primary,junior,and senior high schools in five provinces and regions related to Tibet,this paper collected and collated Tibetan texts in fields such as history,novels,walks,reviews,and folklore,and extracted 50000 sentences as corpus samples and made part of speech tagging.On this basis,8600 sentences were annotated with semantic roles,and various annotation formats were converted to facilitate subsequent proofreading,statistics,and machine learning.Visual data analysis and statistics were conducted on the total number of words in 8600 sentences that completed semantic role tagging,the size of the part of speech tagging set,the semantic roles included,and the distribution of part of speech in the corpus.Based on the semantic characteristics,syntactic structure,and semantic information contained in the annotated corpus of Tibetan logical cases,46 Tibetan semantic roles with semantic orientations of Tibetan logical cases are extracted,and detailed examples and explanations of concepts,meta structures,and features are provided in the relevant chapters of the Tibetan semantic role framework system;In the chapter on tagging tree base and role distribution,based on the content of chapter3,using frame semantics and jurisdiction theory,verbs are used as constraints.According to the distance and valence characteristics between verbs and syntactic components,roles and quasi roles are divided into two types,with 19 types of roles in the real sense dominated by verbs;The remaining 29 categories are modifier elements that have semantic orientation but do not have a direct connection with verbs.They are referred to as quasi roles in this article,and are merged into corresponding attribute categories in the 19 core roles.In the form of database building,this article adopts the format of sequence tagging.The first column is the words of the sentence,starting from the second column is the attribute characteristics of each word,followed by word,word order number,part of speech,core words,role relationships,and descriptions.Finally,the parts of speech included in the nominal units that assume 19 argument roles in the tagging tree database were classified,and detailed data comparisons and analysis were made on the number of nouns that assume semantic roles in the total word frequency,their proportion to the total number of words,the total number of words in 8600 sentences,the average number of words per sentence,and the average number of semantic roles contained in each sentence.The data results show that the nominal units of 19 argument roles include common nouns,numerals,pronouns,gerunds,time nouns,people’s names,rhetoric words,place names,professional terms,and organization names,with a total word frequency of 59626,of which 17380 nouns assume semantic roles,accounting for 29.15% of the total number of nouns.The total number of words in 8600 sentences is 136605,with an average of13 words per sentence and an average of 2 semantic roles per sentence. |