Font Size: a A A

Research On Chinese Word Segmentation On Legal Documents

Posted on:2019-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q YanFull Text:PDF
GTID:2428330545951219Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently,with the openness of the legal documents,in order to make the complex cases easily accessible,people are now focusing on building a platform for the big data retrieval,which aims to provide convenient,accurate and intelligent services.Chinese word segmentation is one of the basic tasks in establishing an intelligent retrieval system.The task of Chinese word segmentation is to segment the sentences in legal documents into a series of meaningful words,which makes the computers understand the semantics of the documents much more easily.Therefore,the efficiency and accuracy of the retrieval system can be directly affected by the performance of word segmentation.The purpose of the paper is to study Chinese word segmentation on legal documents,and the main aspects of the study are as follows:First,the paper proposes a novel approach about Chinese word segmentation on legal documents with the active learning,which focuses on reducing the annotation effort for Chinese word segmentation.The main idea is that in the active learning approach,when we consider both the uncertainty and redundancy of the samples,the informative characters can be selected.Furthermore,the local annotation strategy is proposed,which selects substrings around the informative characters rather than the whole sentences to further reduce the annotation.Our experiments results show that the use of the proposed approach can effectively reduce the annotation cost.Specifically,under the same annotation scale,our algorithm achieves better Chinese word segmentation performance than the random selection strategy and uncertainty strategy.Second,the paper proposes a Chinese word segmentation approach with the ILP(Integer Linear Programming)approach,which focuses on improving the accuracy of word segmentation by addressing the Chinese word segmentation on legal documents as a document-level optimization problem.The main idea is first to apply the LSTM(Long Short-Term Memory)model to perform character classification and to get the original word segmentation results.Then,several kinds of global constraints are proposed in the ILP approach with the specific text structure information of legal documents,such as the label transition constraint,the consistency constraint,and the text-specific constraint.Finally,by using these constraints,the results of document-level optimization can be achieved.Empirical studies demonstrate that the proposed approach can improve the performance of Chinese word segmentation on legal documents.Compared to the sentence-level approaches,the proposed approach can obtain better results.Finally,the paper proposes a cross-domain approach on Chinese word segmentation on legal documents with joint learning,which aims to leverage the data from one domain(source domain)to help the other domain(target domain),so as to improve the accuracy of Chinese word segmentation of the target domain on legal documents.The main idea is that in the learning process,the prediction of the target domain on Chinese word segmentation is treated as the main task,while the prediction of the source domain on Chinese word segmentation is regarded as the auxiliary task.Then,a joint learning model sharing decision-making by both the main task and the auxiliary task is proposed,which employs the LSTM model to learn the auxiliary representation between the tasks.Finally,through incorporating the auxiliary representation,the performance of the main task can be better performed.Empirical studies show that our joint learning approach performs significantly better than the single task learning approach.Especially,with a few annotation samples,it can effectively improve the performance of Chinese word segmentation on legal documents.
Keywords/Search Tags:Legal Documents, Chinese Word Segmentation, Active Learning, Integer Linear Programming, Joint Learning
PDF Full Text Request
Related items