Font Size: a A A

Mining Patterns And Trends In Data Stream

Posted on:2020-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Tubagus Mohammad AkhrizaFull Text:PDF
GTID:1368330623463953Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Living in the information age,today's people mostly spend their time with gadgets to do(almost)everything,like dealing with the online transactions,updating social media status,commenting one's status or company's products,sharing pictures and videos,applying jobs via web based job search engines,etc.These phenomena make data hence also information are continuously being produced,streamed and received in a big volume and high velocity by(almost)every digital devices and censors over the entire globe.Knowledge discovery from data,therefore,becomes an important task for data stakeholders including mining out the patterns and their trend from data.The challenge increases when the task advances to effectively and immediately find the patterns' changing trend over dynamic incremental streaming dat.One kind of data that is also changing and streaming every day is the job adverts data,especially the information and communication technology(ICT)related job adverts.Set of skills(skillset)required by the ICT industries change very rapidly and massively along with the fast growth of research and development in this sector,and it has attracted many researches to process the job adverts data using some data mining methods in order to discover some information and then to deliver it to the ICT skillset stakeholders,such as higher education institutions(HEIs)'s management,students,and professionals.For a HEI,finding the skillsets being required in the industry becomes an important task that should be performed periodically.The skillsets are used to improve the curricula.Literature studies conclude that it is difficult for HEI's curricula management to counterbalance the changing skillset requirement in ICT industries.This situation creates the gap between skillsets that should be taught to the students and skillsets that are required by the industries.Several numbers of methods both using straightforward and nonstraightforward approaches have been proposed to solve this problem.The straightforward one which is manual approach is not an effective solution,since it is costly and only based by small sample,e.g.by inviting industries' management to HEI's environment and asking them directly the skillsets they required.In the non-straightforward methods,data mining approaches can be performed where the job adverts becomes the main source of the information about skillsets being required by the industries.We see some opened opportunities to solve some problems that still are found in the state-of-art methods that work on job adverts.In particular,we use frequent patterns(FP)and emerging patterns(EP)mining concept as the solution for all identified problems since the patterns represent the information about combinations of skills that are dominating the job adverts and their requirement is emerging in the industries today.The first problem is the clustering method used to cluster the job titles and the required skillset.Agglomerative and k-means methods are used where the support vector model(SVM)is applied as their basis.As based on the vector of skill's term in which each vector elements are numbers,the clustering results cannot directly provide the description which is important for clustering analysis.Alternatively,FP based clustering can be used as the solution.Frequent termsets(FT)are used as the clustering candidate as well as the cluster description.FT is mined out from the dataset using a user-given minimum(minsupp)support threshold.A termset is said frequent termset if its support exceeds the minsupp.However,traditional FP clustering only uses the most frequent but short termset as the description,which makes the description itself is meaningless.On the other side,mining long termsets are tough because the number of FT in the collection will explode.Our solution for this problem is by proposing a n alternative concept about mining the FT,called as the frequent contextual termsets(FCT).Some long FTs can be gotten with an acceptable number of FT.Two algorithms for clustering the job skillset data are proposed.The second problem still comes from the method used to cluster the job title done by current researches.The skillsets required for the job are developed only based on the result of vector distance,and the skills included in the calculation are those with frequency at least 10% in a whole dataset;But there is no further inspection whether the combination of skills also reaches at least 10% frequency,or not.This contradicts the concept of FP,because although two skills are frequent with 10% individually,but combination of them is not always frequent.Additionally,skillset resulted from the clustering cannot answer which skills combination that dominates the job adverts.The method is also performed in static way where the job adverts are collected first for 1 – 2 years(or more),before they are processed using the proposed method.Consequently,the research's results about the statistics of job titles and skillsets may be out of date when they are delivered to academia or professionals,considering the rapid change on skillset requirement in the industries.Another problem,the skillset required for a job is gotten,but the magnitude of the gap between student's skillset and the industrial-required skillset is not quantitatively measured yet;which makes the academia cannot immediately know the skills that should be taught to the students.Solutions for these problems are as follows.FP mining algorithm is used to generate frequent skillsets(FS)which is combination of skills that are dominating the job skillset data.The domination here is determined using the support which is an interestingness measure in FP mining.Using the support,skills are associated because their co-occurrence in the dataset is frequent.FS is mined out periodically,along with the new job adverts downloaded and added to the job skillset dataset periodically as well.This solves the problem of the static processing of the traditional method.Consequently,the FS is always up to date,and it can be used by the academia as the information about skillsets that are required today in the industries.To measure the magnitude,we propose a new measurement,the student's skillset coverage which intersects the student's skillset and the FS.The coverage is actually the magnitude of the gap between these skillsets.The coverage is then visualized onto a new proposed tool,called as the skillset-student matrix.Since FS collection is updated periodically,and students' skillsets are also changing every year due to more courses they taken,the map of gap on the visualization tool changes as well.Knowing the gap,HEI's management can take some immediate actions to anticipate the widening of the gap in the future.While FS represents the popular skillsets in the job adverts today,the skillsets that are going to be popular in near future is not known yet;and this question becomes the third problem to be solved in this dissertation.Concept of emerging patterns – EP – can be used as the basis solution;a skillset is called emerging skillset(ES)if it is frequent and its support growth between previous and current time exceeds the minimum growth threshold.However,since we do not know when a skillset will be emerging,so a number of skillsets and their support found in all time-windows should be maintained for a long term.Time-windows refer to as a block of data(job skillset records)processed at a time stamp.Our experiment focused on the development of a new time-windows model i.e.Fibonacci windows model(Fwin,for short)as the solution to store the skillsets and supports for a long period,efficiently.However,although it outperforms the traditional model,the Logarithmic tilted-time windows model(LWin,for short),in the term of time and memory efficiency,but not for the number of EP found by the proposed model.This finding motivated us to improve the titled-time windows model(TTWM)as the basis of both FWin and LWin.TTWM is proposed to save the memory used to store the support data,by tilting or folding the old windows in such a way so the support found in the most recent windows should be the most accurate,and those found in the old windows can be less accurate.Technically,supports found in n windows are condensed inside m(m ? n)elements of array,using a particular element's updating mechanism.While the old supports are merged,the recent supports are kept as its original value.However,the TTWM's updating mechanism creates many null elements at front,thus some ES cannot be found at several time stamps and this is the main problem attempted to be solved.As the solution we propose a novel Push-front mechanism to TTWM,so not only it avoids the null elements creation,but also provides some supports with the most accurate value at element's front.Push-front approach is applied to Fibonacci windows model and thus develop the new Push-front Fibonacci windows model.Experimental works show that the new model outperforms both LWin and FWin in the number of ES found from the streaming data.
Keywords/Search Tags:data mining, data stream, emerging patterns, frequent patterns, pattern trends
PDF Full Text Request
Related items