Font Size: a A A

Study On Methods And Their Applications Of Text Automatic Summarization And Information Extraction

Posted on:2013-01-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:1118330371972793Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With continuous growth of text data especially of web information, how to quickly and automatically extract main or important information that mass text contains, has become a hot research issue of concern, thus stimulating to the rapid development of text information extraction technology. Text summarization technology can extract text discourse structure and main information; automatically generate a single document or multi-document summarization, which is considered as a kind of information extraction technology. In the usual sense, information extraction technologies are to extract specific or important information that text contains.Oriented Evidence-Based Medicine web page and other types of training text, this paper mainly focuses on method of text automatic summarization and information extraction. In view of unsatisfactory information extraction results, unclear topic segmentation, paragraphs clustering algorithm sensitive to initiation, the need of manual set for the number of clusters, this paper provides a series of novel research methods and models.(1) This paper puts forward a method of information extraction that incorporates paragraph feature and hidden Markov Model. The main difference between this method and other information extraction methods is that this proposed method takes paragraph sequence as research object instead of word sequence. Paragraph is a unit of text sequence saved from web pages after preprocessed. Every paragraph is converted into special tokens, and these tokens are the observation symbols of hidden Markov Model. The experiments show that, regardless of precision or recall, information extraction results on the paragraphs as the observed sequence is better than the results on the word as the observed sequence.(2) This paper denotes paragraphs as Vector Space Model, segment text into different semantic units by calculating Mutual Independence between continuous paragraphs. After that, considering the influence of thresholds, we use Genetic algorithm to optimize parameters. The experimental results show that the method can improve precision to some degree. (3) This paper analyses the main step of spectral co-clustering documents and words, finds out its cause of sensitivity to initialization, and presents a modified method of spectral co-clustering documents and words based on fuzzy K-harmonic means. This method consists of two steps. The first step constructs matrix which is insensitive to the initialization. The second step exploits fuzzy K-harmonic means algorithm instead of K-means algorithm to obtain clustering results. Fuzzy K-harmonic means algorithm uses fuzzy weight distance while calculating the distance between each data points and cluster centers. The experiments show that the proposed method not only is insensitive to initialization, but also can improve the clustering results.(4) This paper explores a method based on morphology for determining the number of clusters present in the given dataset and modifies spectral co-clustering documents and words. This method includes three main steps. First, the input matrix generated by spectral co-clustering documents and words is created into VAT gray image. Then, sequential image processing operations are used to filter the VAT image. These processing operations consist of gray morphology, image binarization, distance transform. Finally, we establish signal from filtered VAT image, from which we can extract the number of clusters by major peaks and troughs after smoothing signal. Experiments show that this method can improve the clustering results of spectral co-clustering documents and words.(5) Based on the LDA topic model, this paper proposed Titled-LDA algorithm for multi-document summarization by fusing topic model. In view of the strong indication effect of the title in the summarization, Titled-LDA established corresponding topic model for title and content of each document. In the fusing stage, the algorithm can do weight processing subject to different topics distribution in an adaptive asymmetric learning way based on two kinds of information entropies. In this way, the final model incorporated title information and content information appropriately, which helped the performance of summarization process. The experiments showed that the proposed algorithm achieved better performance compared the other state-of-the-art algorithms on DUC datasets.
Keywords/Search Tags:Text Mining, Automatic Summarization, Information Extraction, Spectral Clustering, Topic Model
PDF Full Text Request
Related items