Font Size: a A A

Criminal Case Text Information Extraction Research

Posted on:2012-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:H W ChenFull Text:PDF
GTID:2218330338474885Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Information extraction is a research field that aims to satisfy the need of acquiring useful information efficiently out of an immense mount of information in this information explosion era. So far information extraction has been applied successfully in many fields like Medicine, Economics, Library Science, and so forth, but the research of it has been hardly involved in the field of Public Security.Presently the information of law cases has a dramatic increase. In spite of the informatization development of office means in public security department, there is still a fairly large amount of case information existed in the form of free text, which needs information extraction technology to extract structured information and then store it in the database for subsequent data mining research.Taking criminal case text as the object, this thesis makes a thorough research on information extraction on the basis of analyzing features of case text. It mainly involves three aspects, including named entity recognition, criminal case framework system construction and information extraction of atomic events cases. On account of the field specificity, we mainly adopt the method of knowledge-table-assisted machine learning and choose Conditional Random Fields (CRF) as the statistical model.Firstly, named entity recognition is regarded as the basis of information extraction. According to the actual demand of public security field,13 named entities are defined as name, sex, age. native place, address, case title, currency amount, time, location, organization, way, frequency, number of people. In line with the literal feature of this field, the "Criminal Case Text Word List" is created out to help rapidly recognize entities or position entity boundaries. Then based on the text features, the entity recognition task is divided into two levels, in the first of which 12 basic entities are recognized while in the second one the entity of case title is identified.In the second aspect, under the direction of Frame Theory, a frame system is set up in the thesis for criminal case text to divide it into basic information module and event information one, and divide event into a variety of atomic ones, providing data structure support for the structural presentation of case text information.The information extraction of atomic events is mainly accomplished in two phases including the recognition of event type and event element. Three types of atomic events named solved criminal case event, capture event and reporting case event are defined as research objects. By means of artificial extraction and "HIT IR-Lab Tongyici Cilin (Extended)" expansion, the "Trigger Words—Event Type Reference Table" is acquired. Then the case text is filtered by this table to get a candidate event set so as to assist CRF model in event type recognition. We set up template for every type of atomic event and train the classifiers to carry out the task of event element recognition.In addition, based on the three aspects mentioned above, this thesis also develops a prototype system for criminal case text information extraction, which takes the criminal case text in free form as the input and final structured information as the output, and retains intermediate outcomes for future research and improvement.
Keywords/Search Tags:Case Text, Information Extraction, Entity Recognition, Event Type Recognition, Event Argument Recognition
PDF Full Text Request
Related items