elmoukhi2015.pdf | PDF Host

Data Warehouse State of the art and future challenges Nawfal El Moukhi*/Ikram El Azami/Aziz Mouloudi Dept. of computer science FSK Kenitra, Morocco elmoukhi.nawfal@gmail.com akram_elazami@yahoo.fr mouloudi_aziz@hotmail.com Abstract — The evolution of the computer world and the proliferation of storage devices has led to an explosion of available information, so that made their exploitation difficult. In addition, it has been found that a large part of this data will not be stored or used; however, it constitutes a precious raw material to understand the activity and anticipate its evolution. To cope with these problems, professionals implemented a decision making support systems (or decision support systems) and which allow a synthetic and multidimensional treatment of all available and stored information, hence the birth of the data warehouse and data mining techniques. This paper introduces the data warehousing, gives researchers terminologies necessary for understanding the process, and summarizes its evolution. It also presents an analysis of the latest research in the field, and explores the future trends. And therefore, this study provides a reference for researchers wishing to contribute in this area. Keywords-Business intelligence; Data Warehouse; Data Mining; Multidimensional databases; OLAP. I. I NTRODUCTION As two main concepts of business intelligence, data warehousing and data mining became essential elements of any computer strategies, yet very few companies succeeded to set up such a system that aims to centralize and organize all the company data in a perspective of discovering unexpected information but could help in decision making. In 2003, Gartner, Inc. reported that more than 50 percent of data warehouse projects fail, and the other 50 percent are either delivered late or over budget [1]. This low success rate is due to the concepts that still pose great problems, hence the launch of a series of research. The diagram below [2] shows the growing interest that researchers awarded to this area since the 90s, by representing the evolution of the number of different articles submitted to conferences "Knowledge Discovery and Data Mining" (KDD) and "international Conference on Data Mining" (ICDM), organized respectively by the "Association for Computing Machinery" (ACM) and the "Institute of Electrical and Electronics Engineers' (IEEE), which are among the best international conferences in the field. Figure 1. Number of articles submitted to KDD and ICDM II. B IRTH AND EVOLUTION OF THE TWO CONCEPTS The data mining concept appeared for the first time, at the conferences of the IJCAI (International Joint Conferences on Artificial Intelligence) in 1989 [3]. This appearance has triggered a series of experiments that resulted to exploration algorithms based only on statistical calculations, and functioning on digital data collected from a single database. But, after the emergence of new computer techniques such as artificial intelligence and machine learning, these algorithms have greatly evolved to cover non-numeric data and relational databases. Regarding the data warehouse concept, it has been used for the first time in a paper entitled "An architecture for a business and information systems", published in 1988 by Barry Devlin and Paul Murphy [4]. In 1990, Bill Inmon, who is considered as a pioneer in the field, published his article entitled "Building the Data Warehouse", while in 1996 his rival Ralph Kimball published "The Data Warehouse Toolkit" [5]. The researches of these two innovators of the field were boosted by the advent of high-capacity storage devices and powerful backup and processing tools. 978-1-4673-8149-9/15/$31.00 ©2015 IEEE III. R OOTS OF FIELD A. Statistics Statistics is a discipline that embraces several techniques and concepts used to study the data and relationships between this data. These techniques are standard distribution, standard variance, standard deviation, regression analysis, discriminate analysis, cluster analysis, confidence intervals and others, because these are the very building blocks with which more advanced statistical analyses are underpinned. The richness of this discipline represented a main foundation for the birth of data mining. Therefore, we found that all the current data mining tools are based on conventional statistical analysis. B. Artificial intelligence and machine learning These two areas are second as contributors to the birth and evolution of data mining. Indeed, artificial intelligence is built upon heuristics as opposed to statistics, and attempts to apply human-thought like processing to statistical problems. Because this approach requires a vast computer processing power, it was not practical until the early 1980s, when computers began to offer useful power at reasonable prices. So, artificial intelligence has complemented the role of statistics in data mining process. Concerning machine learning, it is considered as an evolution of artificial intelligence, because it blends artificial intelligence heuristics with advanced statistical methods. It let computer programs learn about the data they study and then apply learned knowledge to data. C. Databases Huge amount of data needs to be stored in a repository, and that too needs to be managed. So, comes in light the databases. Earlier data was managed in records and fields, then in various models like hierarchical, network etc. Relational model served the needs of data storage for long while. Other advanced system that emerged is object relational databases. But in data mining, volume of data is too high, so we need specialized servers for it. We call the term as Data Warehousing. Data warehousing also supports OLAP operations to be applied on it, to support decision making [6]. D. Other technologies Besides fields mentioned above, data warehousing and data mining inculcates various other areas such as pattern discovery, vis ualization, business intelligence... IV. D EFINITION OF CONCEPTS A. Decision support systems Decision support systems (DSS) are systems that help managers to optimize the process of making decision; by giving the possibility to quickly retrieve efficient information from multiple blocks of data (raw data, transactions, ...). Every DSS has three fundamental components: 1) ETL Tools: which means E xtract needed data from different sources; T ransform and adjust the data; L oad the selected data into the system (database, data warehouse...). 2) Storage tools: mainly Databases and Data warehouses. 3) OLAP (On-Line Analytical Processing): which offers interactive and complex multidimensional data queries with rapid execution time. Files Data warehouse Data mining Operational Databases Data Decision support system Decision sources makers Figure 2. Components of a decision support system B. Data warehouse Many authors have defined the concept of data warehouse, but the most commonly used definition came from Bill Inmon who says that [7]: “ A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process ”. x Subject-Oriented : A data warehouse is used to analyze specific subjects. For example, "sales" can be a particular subject. x Integrated : A data warehouse integrates data coming from many data sources. Thus, there will be only one way to identify a specific data. x Time-Variant : Historical data is maintained in a data warehouse, since the history of data can help decision making. x Non-volatile : Each data integrated in the data warehouse, will never be modified or changed. Another interesting definition is provided by Ralph Kimball who says that [8]: “ A data warehouse is a copy of transaction data specifically structured for query and analysis ”. The definitions of these two leaders in the Data warehousing field are completely different, and each one reveals a particular aspects. This difference is due to the concept that still cause confusion, and to the difference between these two schools that adopt two different visions. E T L Analysis tools Data Thus, there is the Inmon philosophy that follows a global approach and gives priority to designing of the entire data warehouse which is considered as a global and centralized reference of the company, while the philosophy of Kimball allows the gradual construction of the data warehouse by using Data marts, in such a way that each data mart relates to a particular activity. The table bellow summarizes the differences between these two schools. TABLE I. T HE MAIN DIFFERENCES BETWEEN I NMON AND K IMBALL PHILOSOPHIES OF DATA WAREHOUSING Ralph Kimball Bill Inmon Process Bottom-up Top-down Organization Data marts Data warehouse Schema Star Snowflake In both schemas, there is a central fact table containing the facts to be analyzed and dimension tables containing data about the ways in which the facts can be analyzed. The only difference is that in star schema, each dimension is described by a single fact table whose attributes represent the different possible granularity. Whereas in the snowflake schema, the dimensions are described by a succession of tables that represent the granularity of information. In this way, this schema avoids information redundancy. There is another type called constellation schema which contains fact tables sharing the same dimension tables, and this makes schema more complicated and modeling more difficult. A generic data warehouse is shown below in figure 3 [9]. Figure 3. Generic Data Warehouse architecture C. Datamining Also known as the knowledge discovery, data mining is the process of examining large amounts of data (big data) collected in a systematic way [10]. It analyzes the data from many different dimensions or angles in order to extract useful knowledge for decision making. The data mining process consists of 5 steps, namely: 1) Problem analysis: In this step, we must select and analyze a complicated problem and whose resolution will provide competitive advantages to the company. It’s an essential step, since it helps you to know what you want extract exactly as a knowledge from data stored in the data warehouse. 2) Data analysis: this phase is used to evaluate data quality, detect inadequacies and analyze distributions and combinations. In practice, this step is executed in parallel with the first one, in order to determine the right solution. 3) Data preparation: Once available data sources are identified, they need to be cleaned in order to eliminate any noise (aberrant or missing values), and transformed by using some techniques such as constructive induction. 4) Modeling: during this step, modeling techniques should be selected to create one or more models. Then, these models should be tested for checking their validity. 5) Evaluation: the created models should be interpreted and evaluated to verify if they meet business needs. 6) Deployment. The figure below illustrates and summarizes these steps. Figure 4. The steps of data mining process OLAP server Other sources Operational DBS Data Marts Metadata Data warehouse Monitor & Integrator Extract Transform Load Refresh Serve Analysis Query Reports Data Mining Problem analysis Data analysis Data preparation Modeling Evaluation Deployment The main techniques used in data mining are [11]: x Classification: is a process of generalizing the data according to different instances. Several major kinds of classification algorithms in data mining are Decision tree, k-nearest neighbor classifier, Naive Bayes, Apriori and AdaBoost. Classification consists of examining the features of a newly presented object and assigning to it a predefined class. The classification task is characterized by the well-defined classes, and a training set consisting of reclassified examples. x Estimation: deals with continuously valued outcomes. Given some input data, we use estimation to come up with a value for some unknown continuous variables such as income, height or credit card balance. x Prediction: It’s a statement about the way things will happen in the future, often but not always based on experience or knowledge. Prediction may be a statement in which some outcome is expected. x Association Rules: is a rule which implies certain association relationships among a set of objects (such as “occur together” or “one implies the other”) in a database. x Clustering: can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. V. R ESEARCH NEWS According to our first reading of the last articles in the area, we found that the majority of researches focus on two issues: x How to effectively integrate information from multiple heterogeneous data sources in multidimensional databases? x How to improve the quality of data warehousing and Data mining systems? There is no doubt that the integration of data sources is a first and crucial phase in the data warehousing process. Therefore, any data integration system must be well founded and based on a detailed analysis of operational data and user requirements. Besides its importance in the process, the technical difficulty makes this phase more critical. Indeed, the types of data sources evolve rapidly and constantly. Therefore, the company is facing a huge mass of heterogeneous data that should be systematically integrated into a single data warehouse for further analysis. These data sources can be divided into three categories according to their structure: x Structured data sources: like relational databases, object data etc. x Semi-structured data sources: such as XML documents, HTML data, graphs etc. x Unstructured data sources: such as images, text, audio etc. Thus, professionals have to extract this data by using the appropriate access mode, and to transform it into a conventional format that must be determined. To solve this problem, professionals use ETL, which play the role of an intermediary between different data sources and the data warehouse. However, the extraction and transformation of data will affect data content, and this will generate more complicated problems. Indeed, each data source is set up to meet the needs of a specific activity of the company, and to perform specific functions. So, it's evident that the set of data sources will contain similar data, general data including specific data, complementary data etc. Therefore, it's important to proceed with further data cleaning by using machine learning techniques, clustering techniques, similarity functions etc. in order to get good quality data with minimal noise. To solve the first issue, some researchers proposed in a paper entitled "Designing Data Warehouses with OO Conceptual Models" [12], a set of minimal constraints and extensions to the Unified Modeling Language (UML) for representing multidimensional modeling, and use of object- relational databases and object-oriented databases in data warehouse, multidimensional database, and online analytical processing applications (OLAP). While other authors presented in their book "Advanced Data Warehouse Design" [13] three different approaches to specify the requirements and each one leads to the creation of a conceptual multidimensional model. Concerning the second issue, a researcher called Veronika Stefanov [14] tried to bridge the gap between the organization's strategy and the data warehouse, by exposing an approach based on the development of a conceptual modeling language to model the way the organization interacts with the data warehouse, and by creating business metadata in order to describe the business context of the data. Researchers have also proposed, in a paper titled "A Strategy for managing data quality in data warehouse systems" [15], a model based on quality indicators and the user needs in terms of quality, while others completed this model by suggesting in their article "Tools for data Warehouse quality" [16] three tools to improve the quality of a data warehousing system. These tools are: ConceptBase - co-decided - MIDAS. Moreover, in another article published by IEEE and titled "Using Information Filtering in Web Data Mining Process"[17], and which deals with the issue of data mining quality rather than data warehousing. Researchers presented a system consisting of an information filter and a pattern taxonomy model, which helps to improve the results of extracting web data. In conclusion, the two concepts form an excellent search field, especially, when we talk about quality and standardization of data collection, and which still represent a great obstacle for any company wishing to venture in a data warehouse project. VI. F UTURE TRENDS Data warehousing and data mining are among the most promising research areas in the field of computer science. It is considered as the convergence of several disciplines such as statistics, machine learning, pattern recognition, databases, information extraction, the World Wide Web, visualization and many other areas. Therefore, the evolution of these disciplines or technologies will generate new challenges. According to our analysis of existing, and study conducted by Xindong Wu [1], these challenges are: x Developing a unifying theory of data mining : The current data mining systems are designed to execute one task at a time, such as the classification or clustering etc. And therefore, these systems try to solve a specific problem and not the whole problem, and companies are forced to combine several tools to get complete results. In order to solve this problem, researchers proposed a multi agent system [18] that speeds up the performance and operation of the system by providing a method for parallel computation. In this way, many tasks can be executed at the same time, and other agents can be integrated easily. x Mining complex knowledge from complex data : This complexity may be due to the kind of data (multimedia data, medical imaging, geomatics data etc.) or the data content (missing data, uncertain data...).An Indian researcher proposed combined mining approach [19] as a solution to execute such extraction. x Distributed data mining and mining multi-agent data : Distribution is particularly useful when there are data marts that should be added due to a merger or acquisition of another company. The distributed data mining still presents great challenges to professionals both at the architectural level (what is the distribution model that corresponds more to requirements? What communication protocol to use? Etc.) and at the physical level (how to fragment the data warehouse? how to allocate the fragments?). Some researchers have proposed a distributed data mining system, combining multiple fraud detectors of credit card [20]. This system demonstrated that it is possible to reduce loss due to fraud through distributed data mining of fraud models. x Lack of standardization : We believe that a unified conceptual model for data warehouses implemented in sophisticated case tools would be a great support for research. This model should be well-founded, easily usable and understandable by the designers. It should support the deployment, data sources, ETL, facts, dimensions, workload etc. It must be significant and flexible to allow not only the representation of classical requirements of the company, but also the support of particular problems that arise in unusual and emerging applications. x Spatial data warehousing : While all existing conceptual models support basic modeling of a spatial dimension (e.g., most business data warehouses include a geographic hierarchy built on customers), location data are usually represented in an alphanumeric format. Conversely, picking a more expressive and intuitive representation for these data would reveal patterns that are difficult to discover otherwise. As concerns design methods, adequate solutions for properly moving from conceptual to logical schemata in presence of spatial information must be devised [21]. x Real-time data warehousing : As data warehouse systems provide an integrated view of an enterprise, they represent an ideal starting point to build a platform for business process monitoring (BPM). However, performing BPM on top of a data warehouse has a deep impact on design and modeling, since BPM requires extended architectures that may include components not present on standard data warehouse architectures and may be fed by non-standard types of data (such as data streams) [21]. These challenges lead us to the following big question: Will the challenging problems become more challenging? VII. C OMPARISON OF TRENDS FROM PAST TO FUTURE The following table [22] depicts the evolution of techniques, tools and applications fields: TABLE II. E VOLUTION TRENDS OF DATA MINING AND DATA WAREHOUSING Trends Algorithms/ Techniques employed Data formats Computing resources Prime areas of applications Past Statistical, Machine Learning Techniques Numerical data and structured data stored in traditional databases Evolution of 4G PL and various related techniques Business Current Statistical, Machine Learning, Artificial Intelligence, Pattern Reorganization Techniques Heterogeneous data formats includes structured, semi- structured and unstructured data High speed networks, High end storage devices and Parallel, Distributed computing etc. Business, Web, Medical diagnosis etc. Future Soft Computing techniques like Fuzzy logic, Neural Networks and Genetic Programming Complex data objects includes high dimensional, high speed data streams, sequence, noise in the time series, graph, Multiinstance objects, Multi-represented objects and temporal data etc. Multi-agent technologies and Cloud Computing Business, Web, Medical diagnosis, Scientific and Research analysis fields (bio, remote sensing etc.), Social networking etc. VIII. C ONCLUSION In this paper we tried to give an idea about the evolution of the field of data warehousing and data mining, and which evolved from a simple execution of classical statistical calculation, to complex algorithms that processes a huge amount of heterogeneous data (Big Data), while focusing on the most promising aspects of the field. Therefore, this study will serve as a reference to detect different points on which researchers can work. There is no doubt that the evolution of data warehousing generated more complex problems and more challenging requirements but, on the other hand, it was able to attract researchers interest, and to expand the application scope for covering other areas. In this sense, we will try to make some improvements to the methods and techniques used, and try to apply these improvements to a new domain which is the library science. R EFERENCES [1] R. Herschel , “ Principles and Applications of Business Intelligence Research, ” IGI Global, 2012 [2] X. Wu, “10 Years of Data Mining Research : Retrospect and Prospect , ” University of Vermont, 2010. [3] X. Wu, “ Data Mining an AI perspective, ” IEEE, 2004. [4] K. D. Gupta, J. Gupta, J. Gomez, and P. Prasoon, “ Novel architecture with dimensional approach of Data Warehouse, ” International journal of advanced research in computer science and software Engineering, 2013. [5] M. Breslin, “ Data Warehousing Battle of the Giants : Comparing the Basics of the Kimball and Inmon Models, ” Business intelligence journal, 2004. [6] A. Naidu Paidi, “ Data mining: Future trends and applications,” International Journal of Modern Engineering Research, 2012. [7] B. Inmon, “ Building the Data Warehouse, ” 1st Edition, Wiley and Sons, 1992. [8] R. Kimball, “ The Data Warehouse Toolkit, ” 1st Edition, Wiley, 1996. [9] Shaweta, “ Critical Need of the Data Warehouse for an Educational Institution and Its Challenges, ” International Journal of Computer Science and Information Technologies, 2014. [10] S. Goele, and N. Chanana, “ Data Mining Trend In Past, Current And Future, ” International Journal of Computing & Business Research, 2012. [11] B. Thakur, and M. Mann, “ Data Mining for big data : a review, ” International Journal of Advanced Research in Computer Science and Software Engineering, 2014. [12] J. Trujillo, M. Palomar, J. Gomez, and I.-Y. Song, “ Designing data warehouses with OO conceptual models, ” IEEE, 2001. [13] E. Malinowski, and E. Zimányi , “ Advanced Data Warehouse Design, ” Springer, 2008. [14] V. Stefanov , “ Bridging the Gap between Data Warehouses and Organizations, ” Vienna University of Technology. [15] M. Helfert, and E. Von Maur, “ A Strategy for Managing Data Quality in Data Warehouse Systems, ” University of St. Gallen. [16] M. Gebhardt, M. Jarke, M. A. Jeusfeld, C. Quix, and S. Sklorz , “ Tools for Data Warehouse Quality, ” IEEE, 1998. [17] X. Zhou, Y. Li, P. Bruza, S.-T. Wu, and Y. Xu , “ Using Information Filtering in Web Data Mining Process, ” IEEE, 2007. [18] D. M. Khan, N. Mohamudally, and D. K. R. Babajee, “ Towards the Formulation of a Unified DataMining Theory, Implemented by Means of Multiagent Systems (MASs), ” INTECH, 2012. [19] A. Kapoor , “ Combined Mining Approach and Pattern Discovery in Online Shopping Application, ” IJETTCS, 2014. [20] K. P. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolfo, “ Distributed data mining in credit card fraud detection, ” IEEE, 199 9. [21] S. Rizzi, J. Lechtenbôrger, A. Abelló, and J. Trujillo, “ Research in Data Warehouse Modeling and Design: Dead or Alive?, ” ACM, 2006. [22] A. Kapoor , “ Data Mining : Past, Present and Future Scenario, ” IJ SR, 2012.