| The importance of Character out the Set in ancient books is not only reflected in the academic research value of the characters themselves,but also plays a vital role in the information processing of ancient books.These sets of Character out the Set have very important research value for the process of Chinese character standardization or historical research in a specific period.Compared with Modern Chinese and Handed Down Literature Databases,the proportion of Character out the Set in the Ancient Turpan Document Database is relatively large.Therefore,when using Digital Humanities to process the texts of Ancient Turpan Documents,the existence form and participation mode of Character out the Set in the database will directly affect the information processing.The statistics in this paper include all the Ancient Turpan Document that have been unearthed and interpreted and published,involving 52 kinds,of which there are 9 kinds in the pre-Qin period,and the proportion of Character out the Set is about 3.5%-8.9%;The proportion is about 0.3%-0.5%;there are 34 kinds in the Han Dynasty,and the proportion of Character out the Set is about 0.1%-0.8%;there are 2 kinds from the Three Kingdoms to the Wei and Jin Dynasties,and the proportion of foreign words is about 0.07%-0.2%.Taking the existing set of Character out the Set as the research object,firstly referring to Inscriptions-Bone inscriptions,Index and other digital text processing methods,the Character out the Set of Qin Bamboo Slips Collection are used as an example to create characters,and the unearthed literature collection of Character out the Set is established;secondly,on this basis,combined with The predecessors have studied the input method of generating Character out the Set in Ancient Turpan Document,and based on this,they have proposed a universal programming method suitable for the processing of text information.This method not only improves the data integrity in the corpus of Ancient Turpan Document,but also enables out-of-set words that could not be used for text information processing in the past to participate in the natural language information processing process.Secondly,this paper uses the Chinese word segmentation method to verify the text information processing of the out-of-set words processed under the universal programming method.Finally,the Character out the Set of Qin Bamboo Slips are perfected into its corpus,and then the corpus is used to conduct word segmentation experiments under three different methods.The word segmentation experiments use the rule-based word segmentation method,the statistics-based word segmentation method and the mainstream word segmentation tool jieba,Hanlp’s method was experimented.The experimental results show that the out-of-set words processed under the universal programming method can be directly applied to the natural language processing process,so this method is effective and feasible for the construction of corpus with Qin Bamboo Slips as an example,and the Try to generalize this method to the text information processing of all characters outside the collection of Ancient Turpan Document. |