Today, organizations face many data quality management challenges, particularly the phenomenon of “Laptop Data Quality,” where individuals create ad-hoc solutions to improve data quality based on their specific needs, often leading to inconsistencies and inefficiencies. The emergence of a hybrid data warehouse architecture that integrates the strengths of both Inmon and Kimball approaches is helping mitigate this by consolidating various data sources into a single data warehouse system to support business intelligence (BI) and analytics.
Receive your daily serving of all the leading trends and knowledge in enterprise data management!
What is Data Warehousing? Data warehousing is the process of designing, building, and maintaining a data warehouse, which serves as a centralized repository for storing data from various sources. The primary goal of data warehousing is to provide a single source of truth for an organization’s data, enabling business users to make informed decisions based on accurate and up-to-date information. This comprehensive process involves several key activities, including:
Data Integration: Combining data from different sources to provide a unified view.Data Transformation: Converting data into a suitable format for analysis, which includes cleansing, aggregating, and the summarization of data.Data Quality: Ensuring the accuracy, consistency, and reliability of data.Data Governance: Managing the availability, usability, integrity, and security of data.By focusing on these activities, data warehousing helps organizations enhance their data management practices. It also supports robust BI efforts.
Cloud Data Warehousing in Modern Enterprise Solutions As organizations increasingly adopt cloud technology, cloud data warehousing has emerged as a flexible and scalable solution that surpasses the capabilities of traditional on-premises data warehouses. Cloud data warehousing offers several significant benefits, including:
Scalability and Flexibility: Cloud data warehouses enable businesses to scale their data storage and processing capabilities effortlessly, accommodating both structured and unstructured data. This flexibility allows for seamless data integration across multiple sources. Cost-Effectiveness: By eliminating the need for expensive hardware and reducing infrastructure maintenance, cloud data warehousing provides substantial cost savings and operational efficiency. Advanced Analytics Support: Designed to handle complex queries and support advanced analytics, cloud data warehouses are ideal for machine learning and other data-driven applications. Real-Time Data Processing: With real-time data processing and analytics capabilities, organizations can retrieve data quickly and make faster, more informed decisions. In this context, cloud data warehousing is not only enhancing accessibility and scalability but also transforming how organizations approach data warehousing fundamentals and business intelligence processes.
Understanding Data Warehouse Architecture and Data Structures A well-designed data warehouse architecture integrates various technologies and BI tools to support comprehensive analysis and decision-making. Typically, a data warehouse architecture consists of several layers, such as:
Source Layer: This layer includes all systems of record that feed data into the warehouse.Staging Layer: A temporary landing area for data from source systems, where initial data processing occurs.Warehouse Layer: The core storage area where integrated and cleansed data is stored.Consumption Layer: The layer where data is modeled and made available for analysis using BI tools.Data structures play a crucial role in organizing and optimizing data within a data warehouse. Common data structures include:
Star Schema: A simple structure with a central fact table surrounded by dimension tables, facilitating straightforward queries and analysis.Snowflake Schema: A more complex structure where dimension tables are further divided into sub-dimension tables, providing a normalized approach.Fact Tables: Tables that store quantitative data for analysis, often linked to dimension tables that provide context.These components and structures ensure that a data warehouse can efficiently support BI, AI/ML, and analytical efforts.
Data Warehouse vs. Other Data Storage Solutions A data warehouse differs significantly from other data storage solutions like databases and data lakes. While a database is designed to support specific applications or business processes, a data lake serves as a centralized repository for raw, unprocessed data in its native format. In contrast, a data warehouse is a centralized repository that stores integrated data from various sources. They are specifically designed to support BI, AI/ML, and analytical efforts.
The key distinctions are:
Databases: Optimized for transactional processing and specific applications.Data Lakes: Stores raw data in its native format, suitable for big data and exploratory analysis.Data Warehouses: Provides a structured and integrated environment for comprehensive analysis and decision-making.Understanding these differences helps organizations choose the right data storage solution based on their specific needs and objectives.
Cloud Data Warehousing in Modern Enterprise Solutions As organizations adopt cloud technology, cloud data warehousing has emerged as a flexible, scalable solution that extends beyond the capabilities of traditional on-premise data warehouses. Cloud data warehouses support data-driven decision-making by delivering rapid, cost-effective access to massive datasets. This setup allows querying of billions of rows in seconds, providing businesses with near real-time insights. Sales data is categorized around subjects, allowing for a unified perspective on sales instead of scattershot insights coming in from various source systems.
Key advantages include:
Hyperscale and Flexibility: Cloud data warehouses enable businesses to reach hyperscale while accommodating a variety of data structures, including both structured and unstructured data. This allows for seamless data integration across multiple data sources.Reduced Infrastructure Management: By reducing the complexity of hardware and infrastructure maintenance, cloud data warehouses offer cost savings, operational efficiency, and easier data-sharing options.Advanced Analytics Support: Designed to support complex queries and advanced analytics, these solutions are suitable for machine learning and other data-driven applications.Data Accessibility: Cloud-based systems enable easier and faster data access, lowering latency for high-speed data retrieval and analysis, making them ideal for BI tools and complex trend analysis.In this context, cloud data warehouses are not only enhancing accessibility and scalability but also transforming how organizations approach data warehousing fundamentals and BI processes, particularly by enabling data analysts to work more effectively with vast and varied data types.
Exploring Data Warehouse Architecture and Data Structures A data warehouse is an environment that combines an integrated decision support database with software to collect, cleanse, transform, and store data from a variety of operational and external sources. Data models describe how data is structured and provide a framework for the relationships between data elements within a database, which is crucial for creating efficient data storage and management systems. These technologies are combined to support historical, analytical, and business intelligence (BI) requirements. A data warehouse may include dependent data marts, which are subject-area databases that are aligned with the data warehouse database. In most references, the term “data warehouse ” includes all the components of the environment.
Components of a Data Warehouse Figure 2: Components of a Data Warehouse To understand the challenges of building and managing an enterprise data warehouse system, it is important to provide some definitions of commonly used terms:
Data Warehouse – An environment (includes data, processes, hardware, software) that provides data for effective decision-making. Typically, a data warehouse is an enterprise level initiative.Atomic Data Warehouse (ADW): A data architecture construct that is the central hub of the Data Warehouse. It integrates, cleanses, and stores data from operational systems to be fed into dependant data marts. It can also support ad-hoc data mining, and other applications.Business Intelligence (BI): Application programs and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include query and reporting, online analytical processing (OLAP), balanced scorecards with KPI’s, statistical analysis, forecasting, and data mining. Some consider business intelligence to be the delivery aspects of a data warehouse environment.ETL Transformation (Extract, Transform, and Load): This is the process of extracting data from operational data sources or external data sources, transforming the data which includes cleansing, matching, standardization, aggregation, summaries, integration, as well as basic transformation, and loading the data into some data store of the data warehouse (ODS, atomic data warehouse, data mart) environment.Data Mart : A data structure that is optimized for analysis and reporting purposes for a business process or group of users. Typically, a data mart utilizes a dimensional schema using RDBMS and/or BI cubeStrategic Reporting: Summarized reporting needed for analysis and prediction to help guide the business from a strategic perspectiveTactical Reporting: Reporting and analysis to help the company run day to day businessOperational Reporting or Operational BI : Helps the company to apply intelligence to business transactions to improve performance in “real-time”.Operational Data Store (ODS): An integrated, current-valued data structure that is optimized for operational reporting or data service purposes.Real-Time Data Warehousing : Integrating and making data available through the data warehouse in “real” time. Also called Active Data Warehousing. Traditional use of phrase “real time” means instantaneous or very nearly instantaneous. For Data Warehousing, some latency is almost always implied.Data ingestion and transformation are critical components of a data warehouse architecture, enabling organizations to integrate data from various sources and make it available for analysis and reporting.
Data Ingestion: The process of moving data from different sources into the data warehouse. This can involve batch processing, real-time streaming, or a combination of both.Data Transformation: Converting data from its original format into a format suitable for analysis. This includes cleansing, aggregating, and summarizing the data to ensure consistency and accuracy.Common techniques for data ingestion and transformation include:
ETL (Extract, Transform, Load): Extracting data from source systems, transforming it to meet analytical requirements, and loading it into the data warehouse.ELT (Extract, Load, Transform): Extracting data, loading it into the data warehouse, and then transforming it as needed.Data Streaming: Continuously ingesting and processing data in real-time, enabling immediate analysis and decision-making.By effectively managing data ingestion and transformation, organizations can ensure that their data warehouse provides a reliable foundation for BI and analytical activities.
Different Views of Data Warehousing The term “Data Warehouse” was coined by William H. Inmon in his book, “Building the Data Warehouse “. In this book, Inmon defined the data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. This view was the accepted architecture for enterprise data warehousing, where the source applications provided data to be transformed through cleansing and re-formatting based on analytical requirements, then sent to a large central database used by a variety of applications and users. The large database is normalized, where tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). When applied in large enterprises the result is dozens of tables that are linked together by a web of relationships, providing a large-scale view of historical data for analysis. This design can be understood by business stakeholders during requirements development, but it can be cumbersome for developers to implement without sophisticated programming and tool manipulation. In addition, it can be difficult for users to join data from different sources into meaningful information and to access the information without a precise understanding of the sources of data, its metadata , and the structure of the data warehouse.
Figure 3: Traditional Inmon Enterprise Data Warehouse In response to some of the challenges faced by organizations attempting to implement enterprise data warehouses, an alternative approach was developed by Ralph Kimball, author of The Data Warehouse Toolkit and the creator of dimensional data modeling . In the dimensional approach, data warehouse facts (generally numeric transaction data) are associated with “dimensions” (generally reference information that give meaning / context to the facts). Data models play a foundational role in this approach by describing how data is structured and providing a framework for the relationships between data elements within the database.
Figure 4: Dimensional Model Example Dimensional models are used to create “data marts,” not enterprise data warehouses. The Kimball approach to data warehousing recommends creating a collection of data marts instead of a single enterprise data warehouse, and relating those data marts through a set of common dimension tables (“conformed dimensions”). Advantages to this approach include ease of use for business stakeholders as well as faster technical construction. The main disadvantages to the dimensional approach to data warehousing are the challenges of loading data to the various data marts from the disparate sources while maintaining data integrity within the dimensions, and the difficulty in modifying the various data mart structures to accommodate changes to business processes or business analytical requirements after implementation.
Figure 5: Traditional Kimball Data Mart Architecture To address the challenges of both approaches, Inmon and Kimball, and to exploit the advantages of the two architectures, a third data warehouse architecture arose and gained acceptance among many practitioners and experts. The hybrid architecture combines an enterprise data warehouse designed using a normalized atomic level database along with data marts targeted to a subject area or more detailed analytical data presentation.
Figure 6: Hybrid Data Warehouse General Architecture Objectives of a Data Warehouse Most organizations have the following goals when they embark upon a data warehouse initiative:
Integrate operational systems data from a variety of sources Without integration and standardization, analysts must repeatedly integrate data from the source systems. Research shows that up to 80% of an analyst’s time can be spent in finding the data and only 20% in analysis. Providing integrated, cleansed, and standardized data to the analysts in an easy to use structure allows them to spend more time focused on analysis and less on the mechanisms of getting the data into the proper format.
Improve strategic decision-making through analysis of data from a variety of sources using historical data Sales data is categorized around subjects, allowing for a unified perspective on sales instead of scattered insights from various source systems. Systems should not simply feed information, they should help lead a company to make decisions that are aligned with the organization’s strategy. An enterprise data warehouse can provide: access to the appropriate, decision-oriented information when needed; a business view of the data, from business vantage points for strategy development. Entities such as customer and product are often disparate due to the nature of a business (e.g., multiple lines of business) and must be reconciled to an enterprise view. Not having a standardized, enterprise-wide view of a company’s master data leads to information silos. An enterprise data warehouse can provide that enterprise view and reduce or eliminate information silos.
Improve operational decision-making through analysis of historical data of specific subject areas, or across subject areas Strategic corporate decisions focus on reporting, analysis and prediction – tactical corporate decisions focus upon attempting to drive behaviors in real-time.
For example, a casino knows what drink a customer likes, and after a certain amount of time, software prompts a manager to send a waitress to deliver the drink to the customer, thus reducing the likelihood that the customer will visit other casinos. That was a tactical corporate decision. Making tactical corporate decisions enabled by the Data Warehouse is supported by relatively new “real-time” data warehousing methodologies.
Characteristics of a Data Warehouse All data warehouses are defined by the following characteristics: Subject-oriented, integrated, non-volatile, time variant.
Characteristic Definition Subject-oriented Operational data is organized by specific processes or tasks and is maintained by separate systemsData warehouses are organized by subject area and are populated from various operational systems Integrated Data is gathered from the operational systems and is combined according to business requirements / rules and cleansed Non-Volatile Data warehouse data is not directly updated by the end-userEven for “real time” DW’s – updates occur through an ETL process – not directly by the user Time Variant Time variant doesn’t mean that the data is never changedBusiness changes sometimes dictate changes to the data in the data warehouse (e.g. region reorganization, change in accounting procedure, etc)Data in the data warehouse includes a time stamp to indicate its operational date and its inclusion date is recorded, along with any change date
Goals of Data Warehouse Although each organization will have its own, specific, goals for a data warehouse, there are certain common goals that can be used as the foundation for creating the list of expectations for any data warehouse initiative. Data models play a crucial role in structuring data and providing a framework for relationships between data elements, which is essential for creating efficient data storage and management systems. Some of these goals could include:
enabling users with appropriate access to a homogenized and comprehensive view of the organization, supporting forecasting and decision-making processes at the enterprise achieving information consistency . By bringing data from disparate data sources into a centralized database, the data warehouse provides a homogenized view of the organization’s data. Users from across the organization making use of the data warehouse all view a single and consistent version of the enterprise’s information that is available for analysis and reporting, at tactical and strategic decision-making levels . All data warehouse goals should be aligned with the organization’s business goals, so that business value of the data warehouse’s development and maintenance can be justified and supported.
Challenges to Data Warehouse Development Each data warehouse will experience specific challenges; however, most challenges will fall into one or more of the following categories. Sales data is categorized around subjects, allowing for a unified perspective on sales instead of scattered insights from various source systems. According to The Data Warehousing Institute , these data warehousing challenges are the ones most commonly experienced by organizations of every size and industry:
Data quality management in data warehouse database and data marts: Most data warehouses bring data from multiple source systems, with varying levels of quality found in the data. When a data warehouse project tries to integrate the data from the source systems, it encounters issues such as inconsistent data, repetitions, omissions, and semantic conflicts. All these issues lead to data quality challenges. Resolving these issues and conflicts become difficult due to limited knowledge of business users outside the scope of their own systems.Data warehouse performance expectations and results, metrics and measurements: Many organizations neglect performance requirements when designing data warehouse projects initially. Such oversight can hamper data warehouse success significantly, and little can be done to improve data warehouse performance after implementation, since performance objectives are easier to be designed into the environment than to be tuned. Data warehouses should be built for performance rather than tuned for performance. Database tuning for the data warehouse must include the atomic data warehouse and all data marts, and performance tuning requirements will vary based on architecture, platform, and user populations. Data warehouse testing, from unit to user acceptance: Data warehouse testing is a major project itself, and is often neglected by organizations. Testing for each ETL process from source to target (including testing from atomic data warehouse to all data marts), testing for all reports and analytics applications, and regression testing for all additions to the data warehouse environment must be included in the data warehouse testing plans. Data reconciliation, data transformation (ETL), data lineage mapping, source system analysis: Reconciliation is a process of ensuring correctness and consistency of data in an application or environment. Unlike testing, which is a part of the software development life cycle, reconciliation is a continuous process that must be performed during and after the development cycle has been completed. Reconciliation is challenging due to the complexity of the data warehouse development process, which requires that all transformations and their ETL processes be examined for logic and performance consistency. Additionally, the reconciliation process must comply with performance requirements. Other challenges in this area include the need to document all data lineage and source system analysis that is performed to support data selection from each source for inclusion into the atomic data warehouse and / or a data mart. Cultural change management, user process change acceptance: Generally, data warehouse implementation means changes will occur to existing business processes for reporting, analysis (ad hoc and standard), and all forms of decision-making as a result of the availability of integrated, historical, analytically formatted data. In many organizations, the largest challenge to the success of a data warehouse initiative is the cultural change management. It is important to include re-designed processes and procedures to take advantage of this new source of data and information, and to incorporate this enterprise approach to data management instead of relying on a variety of data / information silos. Resistance to change can be extremely strong in many organizations, and this issue should not be overlooked when planning for a data warehouse. Data warehouse metadata management : Metadata has been identified as a key success factor in data warehouse projects. Metadata captures a variety of information necessary to extract, transform and load data from source systems into the data warehouse and data marts, and to use and interpret the data warehouse’s (and data marts’) contents. The lack of a managed metadata environment , or a central metadata storage facility for the data warehouse, can be a main reason for a data warehouse project’s failure. Data warehouse data governance and data stewardship : Data governance refers to the overall management of the availability, usability, integrity, and security of the data found in an enterprise. A data warehouse serves as the focus for analytical and decision making querying and reporting, and, therefore, needs the attention to data requirements across the enterprise that a robust data governance program would provide. Data warehouse initiatives require organizations to make many decisions that involve data from several sources, to enable the cross-application analysis. In addition to these foundational data challenges, a for a data warehouse can provide analysis for external data that is brought into the warehouse, and can offer the oversight to enforce standards and rules after the decision support system becomes operational. For any organization that wants to take advantage of the wealth of information that exists in its variety of source applications, make strategic and tactical decisions based on facts that have been formatted for analytical purposes, and improve the quality of the data in their systems and the processes that govern those systems, embarking on a well-designed data warehouse initiative is the answer.