With the popularization of the Internet and the development of multimedia technology,digital comic works have gradually occupied the mainstream of the comic industry,and pirated comic works have emerged.The infringement of digital comic works is mainly reflected in both text and image,however,there are no relevant technical means to identify and certify the textual infringement of digital comic works.Therefore,based on the existing research on text copyright identification,this dissertation proposes internal copyright identification and external copyright identification methods based on the linguistic characteristics of comic characters.First,the original comic corpus and the pirated comic corpus are constructed for the problem of insufficient comic corpus and named the MH100 comic text dataset.The comics corpus is obtained by text recognition for the collected comics works of four different categories.Then the corpus is cut into chapters and pre-processed to complete the construction of the original comic corpus.The MH100 comic text dataset provides data support for subsequent copyright recognition research and improves th e current situation of no data available.Secondly,the method of internal copyright recognition of comic text based on style features is proposed.In the absence of a comic corpus,the corpus is analyzed at the lexical,structural,and syntactic levels re spectively,text features with comic characteristics are extracted,and the offset between a single chapter feature and the average of all chapter features of a single work is calculated as the comic text style feature,followed by an SVM classifier to identify abnormal chapters of the comic text to determine the copyright owner of the work.Finally,an external copyright identification method for comic text based on SBERT and Doc2 vec is proposed.The method is based on a comics corpus,and text matching is performed between suspicious documents and documents within the corpus.Combining the semantic features extracted by the SBERT model and the document features extracted by the Doc2 vec model,a new comic text representation vector is constructed,and the f eature values of the suspicious documents and the target documents are stitched together and input to a 2-layer fully connected network for copyright recognition of comic text based on the similarity between documents.For the experiments of internal copyright recognition of comic text,the style features fused with three features can better identify abnormal chapters,and the accuracy of the SVM classifier is improved by 0.7%,0.6%,and 3.9%,respectively,compared with three models of logistic regression,random forest,and multilayer perceptron.For the experiments on external copyright recognition of comic text,compared with the existing text matching models ABCNN model,RE2 model,ESIM model,and Bi MPM model,the accuracy of the method proposed in this thesis is improved by 11.7%,11.1%,1.3%,2.7%on the Com50 dataset,and 6.9%,3.6%,0.8%,and 3.7%,respectively.The experimental results show that the copyright recognition method of the comic text proposed in this thesis can determine the similarity between abnormal chapters and texts of comic works with or without a corpus,and the constructed comic text features can represent comic texts of different categories and plagiarism ratios. |