Font Size: a A A

Research On Key Technology Of Text OLAP

Posted on:2013-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2248330374483119Subject:E-commerce and information technology
Abstract/Summary:PDF Full Text Request
Along with the widespread using of business intelligence system, as the core of business intelligence—the data warehouse has been widely used in the decision support system to provide data. A kind of effective tools of data analysis is online Analytical Processing (OLAP), which can provide data analysis, decision and forecast. The OLAP system is based on data warehouse and can provide flexible presentation of multidimensional data in different granularity.In general, the data exists in two ways:structured data and unstructured data. Normally, the structured data exists in relational database by relation form and unstructured data exists mainly in text documents. According to statistics, only20%of the data is structured data, which can directly be used for OLAP; and the other80%of unstructured data mainly is text in files, which can’t be directly analysis. The expanding of unstructured data in commercial system and Internet makes the extending of traditional OLAP to analyze both structured and unstructured data is more and more desirable.At present, the text OLAP technology has became hot in the research of database field, much research has been done and many kinds of text OLAP prototype have been developed, such as MCX, Topic Cube, etc. These methods have their respective advantages and disadvantages, but in general, they are based on text mining, information retrieval and information extraction technology.After summarizing the meaning of text OLAP and existing research method, we propose a new frame to combine OLAP and text mining in B2C site database. Compared with the former text OLAP method, the frame utilize the information extracting and text mining ways to process the multidimensional analysis of unstructured data. We build a text dimension in text field and mining topics by topic model. We take topics and summary as measure instead of traditional numeric style, which is easy to comprehend. The main contributions of this thesis are as follows:1. Propose a semi-supervise dimension extracting algorithm. Through take the previously defined dimension hierarchy and a small amount dimension member as seeds, discover and extract new dimension member from product description and customer reviews, in the purpose of extending text dimension. The extracting algorithm takes the extracting problem as classified problem which classify the words in the text into dimension/member/unsigned/none four classes, and then, find the relevance between dimension and member. After that, we could construct the text dimension which can be analysis like static dimensions.2. Propose a novel measure integration way by using customer reviews. We use LDA(Latent Dirichlet Allocation) model to integrate objectivity and well structured expert review with subjectivity customer reviews. The expert reviews are obtained from Wikipedia, they are well structured and objectivity. The presentations of measures are topics and summary. This kind of presentation changes the way of numeric style measures, which can be easily understand by analyst.3. Validate the efficiency of the dimension extracting algorithm and reviews integrating algorithm through experiments.
Keywords/Search Tags:OLAP, Text Ming, Comments Integration
PDF Full Text Request
Related items