Research On Digitalization Of Character Out The Set Based On Ancient Document Database

Posted on:2023-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:J Tang

Full Text:PDF

GTID:2555306845954239

Subject:History of science and technology

Abstract/Summary:

PDF Full Text Request

The importance of Character out the Set in ancient books is not only reflected in the academic research value of the characters themselves,but also plays a vital role in the information processing of ancient books.These sets of Character out the Set have very important research value for the process of Chinese character standardization or historical research in a specific period.Compared with Modern Chinese and Handed Down Literature Databases,the proportion of Character out the Set in the Ancient Turpan Document Database is relatively large.Therefore,when using Digital Humanities to process the texts of Ancient Turpan Documents,the existence form and participation mode of Character out the Set in the database will directly affect the information processing.The statistics in this paper include all the Ancient Turpan Document that have been unearthed and interpreted and published,involving 52 kinds,of which there are 9 kinds in the pre-Qin period,and the proportion of Character out the Set is about 3.5%-8.9%;The proportion is about 0.3%-0.5%;there are 34 kinds in the Han Dynasty,and the proportion of Character out the Set is about 0.1%-0.8%;there are 2 kinds from the Three Kingdoms to the Wei and Jin Dynasties,and the proportion of foreign words is about 0.07%-0.2%.Taking the existing set of Character out the Set as the research object,firstly referring to Inscriptions-Bone inscriptions,Index and other digital text processing methods,the Character out the Set of Qin Bamboo Slips Collection are used as an example to create characters,and the unearthed literature collection of Character out the Set is established;secondly,on this basis,combined with The predecessors have studied the input method of generating Character out the Set in Ancient Turpan Document,and based on this,they have proposed a universal programming method suitable for the processing of text information.This method not only improves the data integrity in the corpus of Ancient Turpan Document,but also enables out-of-set words that could not be used for text information processing in the past to participate in the natural language information processing process.Secondly,this paper uses the Chinese word segmentation method to verify the text information processing of the out-of-set words processed under the universal programming method.Finally,the Character out the Set of Qin Bamboo Slips are perfected into its corpus,and then the corpus is used to conduct word segmentation experiments under three different methods.The word segmentation experiments use the rule-based word segmentation method,the statistics-based word segmentation method and the mainstream word segmentation tool jieba,Hanlp’s method was experimented.The experimental results show that the out-of-set words processed under the universal programming method can be directly applied to the natural language processing process,so this method is effective and feasible for the construction of corpus with Qin Bamboo Slips as an example,and the Try to generalize this method to the text information processing of all characters outside the collection of Ancient Turpan Document.

Keywords/Search Tags:

Ancient Turpan Document Database, Character out the Set, Text Information Processing, Word Segment

PDF Full Text Request

Related items

1	Chinese Semantic Structure Of Noun Phrases Words Containing The Event Information Database Developed
2	The Rearrangement And Research Of The Documents Of The Tianshan Prefecture Of The WuZhou Dynasty Unearthed In Turfan
3	The Study Of Automatic Chinese Phoneticize Label Based On Automatic Word Segmentation
4	Design And Implementation Of Tibetan Ancient Document Recognition System
5	The Role Of Sublexical Information In Chinese Character Recognition
6	The Study On Chinese Text Segmentation
7	Research On The Integrated Processing Technology Of Sentence Segmentation And Lexical Analysis Of Ancient Texts Based On Deep Learning
8	Information Processing-oriented Analysis On Preposition "Dui" And Its Structure
9	The Neural Mechanism Of The Interaction Of Phonetic And Semantic Information In Chinese Character Reading
10	Research On The Construction Of Tibetan Verb Information Database