Font Size: a A A

Research On Text Clustering Algorithm And It’s Application In Law Enforcement Scene

Posted on:2023-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:M N LiangFull Text:PDF
GTID:2556306911485804Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The legal system concept is deeply rooted in the hearts of the people with the execution of pursuing law-based governance,and the number of accepted cases by law enforcement agencies increased year by year.The record is the important text data recoding personnel information and case descriptions of suspects,victim and other pepole about case in the process of the law enforcement by public security officers,the accepted case will be associated with a large number of the record,information overlap exists in the case descriptions part that made by some pepole of the same case,and the same suspect may have the same illegal facts in different places.All these phenomena will cause heavily data redundancy in the law enforcement database and difficulties in mining similar cases,which is not conducive for the police to merge cases and reduce the efficiency of law enforcement.In this paper,in order to reduce the record data redundancy and research similar cases,use text clustering as a technology method.First,according to the contents of record,establish character information index and case information index.Personal character information index include name,age,sex,ID number,bank card number and mobile phone number.Case information index include location,institution name,time,means of committing the crime and items involved in the case.Then using Named Entity Recognition method extract the characteristics of the record about people and cases.Finally,construct the text similarity measures and proposes the improved clustering algorithms.The main improvements are summarized as follows.(1)At the stage of extracting character features,for the low efficiency of short entities recognition such as gender and age in IDCNN-CRF model,this paper provides a solution,called the IDCNN-CRF model parallel to Attention Mechanism.Features are simultaneously extracted through the IDCNN and Attention Mechanism,and their outputs are contacted in columns as the final text feature inputing to the CRF module for entity label prediction.Without compromising the parallel computing power of the IDCNN network,the Attention Mechanism is used to supplement the local features and interaction information between inputs.We made a comparative experiment using the transcript data set,it is found that the improved model proposed in this paper shows better performance than the benchmark model for the efficiency of short entities recognition.(2)At the stage of extracting case features,for the not ideal result of long entity boundaries recognition such as time and institution name entity in IDCNN-CRF model.This paper proposes the IDCNN-CRF model parallel to sequence network.While the IDCNN network extracts local features,the sequence network consisted by Bi LSTM-Mul Attention is used to supplement the effective information of Long-Term Dependencies,and their output is contacted in columns,then it is inputed into CRF as the final text features for entity label prediction.The long entity recognition efficiency of the improved model is better than that of the baseline model with comparative experiment.(3)The method of similarity measure between texts is constructed,and an improved density clustering algorithm just using sample similarity matrix is proposed based on the similarity measure.The improved density clustering algorithm is called HC-DBSCAN.To improve the clustering accuracy,the clustering evidence in the similarity-based hierarchical clustering is introduced into the density clustering,which provides a priori knowledge of the input parameters in the density clustering.The evaluation analysis based on Silhouette Coefficient verifies the effectiveness of the HC-DBSCAN.
Keywords/Search Tags:Text Clustering, Record, Named Entity Recognition, Density Clustering
PDF Full Text Request
Related items