| As China comprehensively promotes the rule of law and vigorously develops the strategy of a strong network,more and more laws,regulations,legal documents,and other information can be published publicly on the Internet.As a system that provides information retrieval,search engines have brought great convenience for Internet users to retrieve legal data and have good research value.With the advent of the era of informatization and big data,the number of web pages containing legal texts on the Internet has grown exponentially,which brings the problem of how to quickly and accurately obtain valuable legal information for search engines.Since ordinary users are different from professional legal practitioners,they may not be able to identify more valid information among the documents provided by search engines when searching for legal information.In response to the above problems,this thesis studies the core technology of search engines for the legal field,mainly focusing on inverted index construction technology and its application in legal case retrieval.The main research works of this thesis are as follows:In order to improve the search engine inverted index construction performance,this thesis addresses the problem that the classical fast inversion algorithm(FAST-INV)cannot build inverted indexes quickly in the face of large-scale legal data.This thesis proposes two new inverted index construction algorithms: FASTER-INV and AC-INV.Firstly,as for the redundancy of four information documents in FAST-INV,FASTERINV is proposed to reduce two unnecessary information documents to build an inverted index.FASTER-INV cuts down redundant information while optimizing the memory space cost.Then this thesis further proposes AC-INV,which combines the process of constructing <Doc_ID,Term_ID> pairs and inverted indexing.AC-INV saves significant memory occupation while ensuring the integrity of information.In addition,it eliminates the time of constructing information documents and improves the algorithm’s scalability.Finally,extensive experiments were conducted on the Chinese AI and Law challenge dataset(CAIL2018).The experimental results show that the two algorithms presented in this thesis are significant effect improvements.The speed of FASTER-INV and AC-INV increased by 1.11~1.14 times and 1.33~1.42 times,and memory saved by 10% and 35%,respectively.In order to improve the performance of legal case retrieval in search engines.This thesis proposes a legal case retrieval method based on BM25 and RoBERTa named BM25-RoBERTa.Firstly,the method uses an inverted index and BM25 sorting algorithm to quickly recall and sort all legal cases based on a query.Then,based on RoBERTa’s Paragraph Aggregation Architecture,long legal texts retrieved from the inverted index and BM25 are encoded,which can learn the semantic relationship between legal texts and calculate the similarity score between query cases and candidate case sets.To improve the precision of the legal case retrieval model,the crime in the legal case is also input into the RoBERTa model to obtain a crime prediction score.The crime and case content scores were weighted and summed to get the final score,and then the candidate case set was accurately ranked according to the scores.Finally,extensive comparative experiments were conducted on the Chinese Legal Case Retrieval Dataset(Le Ca RD).The experimental results show that the legal case retrieval model proposed in this thesis shows good performance results.The mean average precision value reaches 57.8%,which is 19.66%,3.95%,and 1.58% better than BM25,BERT,and RoBERTa. |