Font Size: a A A

Chinese Information Extraction And The Method Of Summarization Generating Based On HowNet Semantic

Posted on:2016-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2348330542976170Subject:Engineering
Abstract/Summary:PDF Full Text Request
Today,we live in an age of information explosion.A large amount of redundancy and multifarious document makes us tired.When it comes to understand the information of an article,we have to read the entire article,and in the face of large amounts of information,it is hard to get message through reading all documents and at the same time which will take a lot of time.Information extraction and the summarization technology of generating can solve the problem,which is a part of the text-mining technology,namely it analysis the potential semantic of text and so as to extract the core information of the document.Information extraction technologies can be transplanted to the technology of information retrieval.When inputting keywords to search engines,users will be able to get a lot of the short summary of information,and find the interested news quickly.Summarization technology was developed nearly 50 years.In our country,many universities and institutes involved in the study and summarization technology has made a lot of scientific research achievements.But due to the inherent characteristics of Chinese language,the existing technological cannot well combine with feature of the semantics,pragmatics and grammar of Chinese to the Chinese document processing.So this thesis studied on the semantic characteristics,generating the abstract of Chinese documents based on the concept with a HowNet semantic dictionary.Because the document is a linear sequence of sentences,the sentence is a linear sequence of words,and the document processing can translate into the word processing.The document preprocessing,that is divided into three stages: according to the characters of the symbol of Chinese documents,clause process for the document;according to the characteristics of Chinese language,segmentation processing for Chinese text using the positive maximum minus words with a participle phrase table;as there are many words that cannot express the core theme of document after segmentation,we need stop-word processing with a stop words glossary.After preprocessing module,Chinese document is processed into a collection of meaning words,and then we come into the traditional word-frequency statistics processing based on the morphological characteristics of the words.As some of the words express the same meaning,whose form is different,we come into the word-semantic similarity calculation and the word-frequency statistics based on concept through the semanticanalysis of meaning words,as the same time calculation weight of the concept.According to the concept of weight,calculation weight of the document sentence,and extracting the candidate based on the sentence weight at last.Finally,we can obtain a essay of readability,understandability and logical summary after integration.
Keywords/Search Tags:potential semantic, information extraction, semantic similarity, concept, “HowNet” dictionary
PDF Full Text Request
Related items