At present,under the premise of vigorously promoting the construction of smart courts,case retrieval is the core link to support judicial trials,which is of great significance for promoting judicial justice and avoiding different judgments of similar cases.Because the judgment documents in the case file can best reflect the core content of the case,the current case retrieval research can also be used as a study on the similarity of the judgment documents.Compared with the semantic similarity task of traditional text,the judgment document has the characteristics of long text,strong text professionalism and high difficulty in labeling.At present,the most advanced large-scale language model(GPT-4)can achieve excellent performance in general tasks,but it is not very competent in the field of professionalism.Based on the semantics of judgment documents and the characteristics of evidence images,this dissertation relies on the national intelligent judicial research project to carry out the research and development of court-oriented case retrieval,and realizes the case retrieval based on the semantic similarity of documents and visual information.The research results of this dissertation have been deployed and applied in a court as a sub-module of the project system,which has strong practical value.The main research contents and innovations of this dissertation are as follows :(1)Aiming at the problem that the high threshold of referee documents labeling leads to the scarcity of high-quality training data and the training difficulties caused by the wide type of cases,this thesis designs a semantic similarity model based on data enhancement in low resource scenarios.By constructing the prompt learning template and data enhancement method of referee documents and combining comparative learning to improve the training effect of a small number of labeled documents,and using Bert combined with Text GCN at the model level to further improve the accuracy of the model.In the CAIL2019-SCM data set,this dissertation simulates a small number of labeled judgment documents of other types of cases by dividing the sub-data set for separate training,and proves the effectiveness of the algorithm through experiments.At the same time,it is proved by experiments that a small amount of data after data enhancement can also achieve the effect of a large amount of data training,which provides a feasible idea for the model training of unlabeled case types.(2)Due to the strong professionalism of the judgment documents,there will be problems with the same text but not the same case,resulting in the model being unable to fit important features.In order to solve this problem,this dissertation designs a semantic similarity model based on the introduction of human-designed prior knowledge.Based on the existing technology,this dissertation first constructs the keyword combination and the case-related industry lexicon data information for the class case data set.Through the construction of the industry knowledge base,the TFIDF value of the industry knowledge is calculated,and the weight is weighted to correspond to the document vector expression.The PMI similarity matrix between words optimizes the attention mechanism layer to guide the semantic coding of BERT,and further improves the model effect based on the above data-enhanced model idea.Finally,the effectiveness of the method is verified in the Le Ca RD dataset.(3)For the materials in the electronic case file,this dissertation finds that although the material text except the judgment text has detailed information,the length is too long and the interference information is too much,which is not suitable for the training of the deep learning model,but the evidence image has a small number and representative characteristics.Therefore,on the basis of judging the semantic similarity of the judgment,this dissertation introduces the image-based classification model to classify and count the evidence chain,and analyzes the similarity of the key evidence images,and further selects more similar cases in the similar case candidate pool.In this dissertation,620 electronic files of a provincial court were collected,and the similarity labeling of the data in the files was carried out by legal professionals to construct a class case data set based on electronic files,and the effectiveness of the method was verified on the data set.At the same time,according to the actual application requirements,this dissertation uses the existing OCR text recognition and named entity recognition technology to reduce manual operations and improve the efficiency of case retrieval applications. |