Font Size: a A A

Research On Copying And Testing Technology Of Tibetan Text

Posted on:2016-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:B T KanFull Text:PDF
GTID:2278330461466097Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
Text is a major form of Internet information resources. With the continuous development of the Internet and rich network of digital resources, to provide people with the resources to facilitate sharing and information exchange platform. It has become an important source of access to information, as well as the majority of researchers, teachers and students to provide a convenient opportunity for academic exchange. After adding a text deletion or change words to say after restatement can form a new text, this behavior is called duplication or copying text. Text copy detection technology is to prevent such acts, the text of intellectual property protection, correct academic atmosphere and an important means of information retrieval to improve efficiency.Currently, the English text copy detection technology is more mature. However, due to natural Tibetan and English language differences, and many English copy detection technology and natural language cannot be fully applicable to the Tibetan language, cannot use them to detect replication rate of Tibetan texts. This gap has led many national universities and Tibetology who appeared low quality papers, poor academic atmosphere and academic innovation and so difficult to improve. So, how to design and implement for this phenomenon Tibetan text copy rate detection system is the focus of this research. After analyzing the test results in English copy, find the smallest unit of General plagiarism by copying the sentence will not be less than this size. Because the sentence is the basic unit of text content with full text semantics. Therefore, this is a copy of Tibetan sentence level detection method using space vector cosine similarity algorithm to calculate the similarity based Tibetan sentence. The key feature of the algorithm is to select a vector, to generate the vector space model with the feature vectors, and finally calculate the cosine similarity. Text text copy detection technology has been studied. According to the basic steps of the Tibetan text copy detection preprocess text, text block, feature extraction, sentence similarity computation, and finally with sentence similarity to measure the rate of the entire Tibetan plagiarized text.When Tibetan text preprocessing, consider the coding of unity and can be stored, respectively encoding and character encoding Tibetan Tibetan texts were studied, it was converted into a unified Unicode encoding.When Tibetan text block, using the Tibetan sentence boundary identification method, according to the sentence of the Tibetan text size into blocks. While building inverted index table sentence and document location information to reduce duplication sentence pairwise comparison and positioning of the sentence.When Tibetan text feature extraction, using the Tibetan automatic segmentation method, using TF-IDF is calculated for each word frequency, word frequency vector set building.Secondly, each text block of text to be detected to calculate the similarity between the text block library text copied to measure the rate of the entire piece of text.Finally, the text to be detected by the test, the test results were compared and analyzed with precision and recall two performance evaluation of the Tibetan text copy detection technology.
Keywords/Search Tags:text copy detection, sentence boundary identification, automatic segmentation, word component decomposition, similarity calculation, the inverted index
PDF Full Text Request
Related items