Font Size: a A A

Research And System Implementation Of Multiple Key Text Mining Technologies

Posted on:2018-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:T Y QinFull Text:PDF
GTID:2348330536481910Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text information mining is information extraction,meaning analysis,classification tagging and relation analysis jobs done by computer on digital text.It can extract useful information or even knowledge from digital text.Development of Internet and informatic process of various industries offer splendid text corpus resources for text information mining.At the same time,it demands remarkable improvement of accuracy,effectiveness,computing efficiency and personalization for text information mining systems.Tet information mining needs to extract valuable information from plain text.Besides,extracting new words of specific semantic type,event type classification,event element identify and text summarization of single or multiple documents are important tasks of text information mining.This paper studied and realized solutions for several core problems of text information mining.Including new words extraction of specific semantic type,event type classification and event element identify towards ACE 2005 event corpus,and automatic summarization towards single document or multiple documents.In our experiments,all of these three tasks maintains good results.On new words extraction task towards specific semantic type,this paper considers that it costs too much for tagging new words on plain corpus.With the premise of new words of the same type have similar context information,we designed a bootstrapping approach for new words extraction using soft patterns.Divide new words to several parts due to the semantic features,and separate sentences into context slots by parts of candidate new words.By statistics methods of tagged new word and candidate new word in slots,we can give every candidate new word a score and further add candidate words with highest score into tagged new words.We made experiment on electronic medical records,divide symptom new words into <part,description>,after iterations,we got a F-value of 81.40%.For event type classification and event element identify on ACE 2005 corpus,this paper improves the performance of SVM-based method from other researchers.In event type classification,this paper adds some features relevant to candidate trigger words and information of trigger event in one sentence in order to include context trigger information in SVM-model.Together,we improve the performance of text pre-processing.Based on Chinese and English corpus of ACE 2005 together with their entity,time,value tags,we evaluated the performance of event type classification and event element identify,F-value of event type classification reached 64.2% and F-value of event element identify reached 63.7%.In task of automatic summarization,this paper combines Text Rank algorithm and clustering methods.We set up edge weights of TextRank graph model by BM25 algorithm and various sentence similarity computing methods,and try to reduce redundant of automatic summarization.We made our experiments on DUC2001 and DUC2002 corpus and gained a good result.
Keywords/Search Tags:Text Mining, New Words Extraction, Event Classification, Event Element Identification, Document Summarization
PDF Full Text Request
Related items