Progressive organizations recognize data as a strategic asset and rely upon it for critical decision making. Business intelligence spending has been steadily increasing and is forecast to be upwards of $16 billion worldwide in the next year.
Major investment and effort are spent on data extraction, transformation and load (ETL) from source systems into data warehouses and data marts. Incorrect decisions based on poor data can be disastrous, so how can we ensure that we are utilising the proper data to begin with? In order to do so, we must be able to address the following data quality considerations:
- Is the data accurate?
- Is the data timely?
- Is the data complete?
- Is the data consistent?
- Is the data relevant to the decision?
- Is the data fit for use?
This challenge has been complicated further by exponential data growth. Some studies show that up to 90 percent of the world’s data has been created in the past two years alone. This trend is accelerating, making data quality assurance even more challenging. This is compounded further by increasing complexity in the data landscape. Most corporations have a variety of software applications and databases scattered across multiple heterogeneous platforms, utilizing a spider web of point-to-point interfaces to move data back and forth. This includes ERP solutions and externally hosted SaaS solutions. The result is usually a high level of data redundancy and inconsistency.
In response to this, many of the ETL routines mentioned above usually perform some type of data cleansing to make the data usable at the point of consumption. However, this can be quite risky if we do not truly understand the data and the changes that have occurred on its journey through the organization's systems.
This is analogous to the problems that occurred in manufacturing production lines prior to the early 1980s: complex products were built from thousands of parts and sub-assemblies, then inspected for quality conformance after they rolled off the assembly line. Inspection does not improve the product. It simply identifies the defects that need to be addressed. Defective items were scrapped or reworked at significant cost, but the origin of the defects often went undetected. Thus, the problems ensued. To address this, the quality movement of the 1980s focused on many aspects, but a few are stated here because they are very relevant to our discussion regarding data:
- Validation of the inputs to every discrete process, preventing usage of defective components
- Traceability of components and subassemblies within finished goods to point of origin
- Empowerment of front line workers to address problems, even if it meant halting the entire production line
- Continuous improvement of all processes
Unlike physical products, it can be extremely difficult to detect and identify defects in data. However, we can utilize the approach and lessons learned by the quality discipline. In order to succeed, a collaborative culture must be established with a commitment to data quality, from senior executives through to the front line workers that create and modify data on a daily basis. Procedures must be put in place to ensure that data is accurately captured and recorded as it is created and modified through each business process. Workers must be empowered to correct any data that is wrong as part of their daily job function (with proper audit trails). If data originates outside the organization, it must be validated prior to use. Data governance and stewardship must be established so that responsibilities are understood and agreed to by all parties.
The primary challenge is to first understand and map the current data landscape, but retain the flexibility to easily adapt and update it as the business and the underlying landscape continue to change over time. The most effective means of doing so is through data models to describe the data (and metadata) as well as process models to describe business processes that create, consume and change the data. This allows data to be understood in context, and is the basis to identify redundancy and inconsistency. All manifestations of each critical business data object must be identified and cataloged. Typically the most critical business data objects are also master data, as they are utilized in most transactions (for example: customer, product, location, employee, etc.). Without context, it is extremely difficult to ensure that the proper data is being utilized for reporting and analytical purposes, and hence, decision making. In order to complete the understanding, the models must be supported by integrated business glossaries and terms that are owned by the business stakeholders responsible for each area. It is imperative that the business team is able to utilize tools that allow them to collaborate amongst themselves, as well as with technical staff that are assisting them.
Business analysts, data analysts, modelers and architects build the required conceptual and logical models based on continual consultation with business stakeholders. Physical data models are used to describe the underlying systems implementations, including data lineage. When combined with data flows, true enterprise data lineage can be understood and documented. This is the point at which we have established true traceability, which is vital for comprehension.
To achieve true collaboration, all of the models, metadata and glossaries must be integrated through a common repository. Approved artifacts need to be published in a medium that is easily consumed, typically through a web-based user interface. In addition, the models themselves become the means to analyze, design, evaluate and implement changes going forward.
Due to the size and complexity of most environments, this must be done on a prioritized basis, starting with the most critical business data objects. Metrics are established to quantify relative importance as well as to evaluate progress. As with any continuous improvement initiative, breadth and depth are increased incrementally. Establishing a data culture and improving data quality is not a one-time project. It is an ongoing discipline that, when executed correctly, delivers breakthrough results and competitive advantage.
Image credit: Sergey Nivens/Shutterstock
Ron Huizenga is Product Manager -- ER/Studio, Embarcadero Technologies Inc. He has over 30 years of experience as an IT executive and consultant in Enterprise Data Architecture, Governance, Business Process Reengineering and Improvement, Program/Project Management, Software Development and Business Management. His experience spans multiple industries including manufacturing, supply chain, pipelines, natural resources, retail, health care, insurance, and transportation. He has a business degree from the University of Calgary and has also attained Six Sigma Black Belt certification.