Textual Similarity Detection

Posted on:2022-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2505306530465494

Subject:Foreign Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

At present,there are a lot of legal disputes on the similarity of literary works both at home and abroad.But most researches are conducted in the field of computer science,and fewer scholars pay attention to the linguistic method which is basic and essential for such language-centered cases.The case that Guo Jingming’s “Meng” violated Zhuang Yu’s copyright has sparked controversy in society.Therefore,based on both the principle of simhash algorithm and Discourse information theory,this paper attempts to analyze the text similarity of two novels in the case of Guo Jingming being sued for infringing upon Zhuang Yu’s copyright,aiming to provide a set of feasible methods to effectively identify the text similarity.Thus,this paper adopts qualitative and quantitative analysis to analyze the text similarity of the two novels from the lexical level and discourse level,specifically analyzing the frequencies and the importance of WO,WA and WF,so as to select a more suitable method for the Chinese texts similarity analysis.It is found that at first,the simhash results show that the two suspected novels are not similar at the lexical level,neither from the perspective of the whole text,nor from the similar plots.But in terms of linguistic method,both in the whole text and similar plots,there does exist a very low percentage of lexical hapaxes in each novel,a comparatively high percentage of shared hapax legomena(words that occur only once in a text)and have similar lexical richness.Second,at the discourse level,both from the whole text and pin-pointed similar plots,the distribution of the protagonists in the two novels is similar,and WO,WA and WF appear most frequently in the novel text.Therefore,based on the frequencies of WO,WA and WF,the author conducts an independent sample t-test,and the statistical results also prove that they are similar.The innovation of this paper lies in the combination of computer linguistics and general linguistics to determine the similarity of two texts.It also provides a new research perspective for text similarity analysis and expands the application of Discourse information theory.What’s more it demonstrates a practical way to test the similarity,such as the frequencies of the main character,WO,WA,and WF.It also provides a set of linguistic features which is effective in similarity analysis.But due to the length of the novel and time limitation,not all the 15 Ws are annotated.So in further study,all the information knots will be tried to test the similarity.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Plagiarism Detection: Similarity Analysis Of Academic Discourse From Discourse Information Analysis And Appraisal Perspectives
2	Research And Design Of The English Essay Similarity Detection System For Chinese College Students
3	Research On Thangka Cultural Element Detection Based On Improved FCOS Algorithm
4	Understanding visual memory: Higher dimensional application of a summed similarity model, theoretical approach using signal detection theory, and neurophysiological measures of its intentional control
5	Research On Few-shot Object Detection Algorithm For Thangka Image
6	Research And Application Of Cultural Recognition Based On Edge Detection Algorithm
7	The Mechanism Of Key Heuristic Information Detection In Prototype Elicitation Paradigm
8	Similarity Representation And Difference In Age Basing On Similarity Information Activity
9	Research And Implementation Of Thangka Image Object Detection Algorithm
10	Research On Thangka Image Object Detection Algorithm Based On Deep Learning