The Importance of Data Quality in Healthcare

The Greek philosopher, Plato, once said, “Doctors cut, burn, and torture the sick—and demand payment as if they’d done a great service.” To be fair, medicine in Plato’s time was, unquestionably, nasty, brutish, and it probably shorten more than a few lives, but things have changed a lot since those ancient times. Calling some of the results of today’s medical procedures “miraculous” is far from hyperbolic, quite the opposite as it, literally, saves lives. To think that replacing an organ in a human being is almost a standard operating procedure in most midsize, U.S. hospitals today is nothing short of miraculous. In many ways, data is the backbone of these “miracles” occurring on operating tables in hospitals around the world every day.

High-quality data is crucial for accurate diagnoses, effective treatments, and improved patient outcomes. The consequences of poor data quality can be catastrophic. A disease misdiagnosis can result in the loss of life. Data quality is essential for healthcare organizations to remain profitable and achieve their biggest goal, improve the lives of their patients. So, what exactly is data quality and how can healthcare companies improve it?

What is Data Quality?

Data quality refers to the overall usefulness of a dataset, due to its:

  • 1

    Accuracy

    Data accurately represents real-world values (e.g., correct customer names, addresses, etc.).

  • 2

    Completeness

    No missing or null values in critical fields (e.g., all required form fields filled in).

  • 3

    Consistency

    Data is uniform across different systems (e.g., the same customer ID matches in all databases).

  • 4

    Timeliness

    Data is up-to-date and available when needed (e.g., real-time stock prices).

  • 5

    Validity

    Data conforms to defined rules (e.g., email formats, age ranges).

  • 6

    Uniqueness

    No unnecessary duplicates (e.g., no repeated customer records).

  • 7

    Integrity

    Data maintains correct relationships (e.g., foreign keys in databases link properly).

For IBM, data quality “is critical to all data governance initiatives within an organization.” High-quality data is essential for businesses to make informed decisions, ensure operational efficiency, and remain compliant with regulations.

IBM believes, “Data quality solutions exist to help companies maximize the use of their data.” The three key benefits of strong data quality are better decision making, improved business processes, and increased customer satisfaction. “High quality data allows organizations to identify key performance indicators (KPIs) to measure the performance of various programs, which allows teams to improve or grow them more effectively,” says IBM.

The First Rule of Healthcare Data: Do No Harm

Although attributed to the ancient Greek physician, Hippocrates, “First, do no harm” is not a part of the Hippocratic Oath. Rather, it is taken from another of Hippocrates’s work, Of the Epidemics. As Dr. Robert H. Shmerling explains in his article First, do no harm, “Yes, the pledger commits to avoiding harm, but there’s nothing about making it a top priority.” Shmerling argues if physicians took this oath literally, no one would have surgery, even if it was lifesaving. Even something as simple as a blood test would be avoided because of the pain, bruising, and/or bleeding the drawing of blood requires, says Shmerling. These are all clearly avoidable harms.

While honoring the “First, do no harm” dictum in medicine requires a weighing of the risk to reward, cost to benefit ratio, this is not the case when it comes to data. With healthcare data, this dictum makes complete sense. Ensuring a healthcare company’s data is of the highest quality and, therefore, will do no harm to a patient, is paramount. In an industry dealing with literal life and death issues on a daily basis, trustworthy data leads to stronger business intelligence, which helps create more accurate data visualization dashboards, which improves analytics and analysis, making diagnoses more accurate, and, ultimately, saving lives.

Three Key Benefits of Strong Data Quality

Strong data quality provides a competitive advantage to any healthcare provider. As Taylor Larsen explains in his article, How to Run Analytics for More Actionable, Timely Insights: A Healthcare Data Quality Framework, “Healthcare organizations increasingly understand the value of data quality, but many lack a systematic process for establishing and maintaining that quality.

Larsen recommends healthcare companies measure and monitor their data quality by using the following four-level framework:

  1. Think of data as a product.  
  2. Address structural data quality first.
  3. Define content level data quality with subject matter experts (SMEs).
  4. Create a coalition for multidisciplinary support.

This framework ensures data is fit to drive quality data-informed decisions. Larsen believes that creating a fit-for-purpose healthcare data product requires analytics professionals prioritize data quality at the beginning of the data pipeline and then shepherding that quality as data is utilized by the data users. In healthcare, more so than almost any other industry, it is important to confirm all data accurately represents its original source, contends Larsen.

Shutterstock 300719852

Think of Data as a Product

Healthcare companies should think of data as a product, which means managing and delivering data with the same discipline, quality control, and user-centric approach as a software product or a physical good. Data is no longer the byproduct of a system; it is recognized as a strategic asset driving business value. “Data results from a process or system that assesses and treats its quality throughout,” argues For Larsen. He compares it to how a car evolves from raw materials that turn into parts, which get assembled on a factory floor. The final produced car ends up on a dealership floor and maybe even in the pages of a car magazine. “To progress successfully through the automobile manufacturing, sales, and evaluation process, car makers need quality raw materials (e.g., body and engine parts) to take their vehicles from concept to the consumer,” contends Larsen.

“To create a fit-for-purpose healthcare data product, analytics professionals need to prioritize quality at the beginning of the data pipeline and shepherd that quality as data traverses the system. In healthcare, it is important to confirm that data is an accurate representation of its source (e.g., EMR, payer/claims, costing, human resources, etc.),” says Larsen.

Create Structure Around Your Data

Data should be intentionally designed for specific consumers (analysts, ML models, business teams, etc.). Data ownership and data accountability should be instilled throughout the organization. The data product owner should ensure the quality of his or her data as well as its timeliness, and usability. Data should always be standardized and reusable. All data should follow consistent schemas, metadata, and access patterns. Data should have quality metrics (i.e., completeness, freshness, accuracy, etc.).

As much as possible, users should be able to access data in a self-serviceable way. This makes data more discoverable, usable, and, ultimately, more valuable. A customer behavior dataset should be clean, well-documented, and formatted for an analytics team. “Health systems struggle to move to higher data quality levels if the data is not first structurally sound,” warns Larsen. Without a strong foundation, data loses its value, and sometimes, like with the case of IBM’s AI-powered Watson for Oncology, it craters the value of an asset.

Define Content Level Data Quality with Subject Matter Experts

Healthcare organizations must identify data subject matter experts (SMEs) to define content level data quality. These experts will understand the content and then tailor data definitions accordingly. The Table below shows how a structured approach to defining and enhancing data quality with SMEs might look:

Data Quality Dimension SME-Driven Definition Example Rule
Accuracy Data reflects real-world values. “Patient weight must be within a clinically plausible range (e.g., 2-500 lbs).”
Completeness No missing mandatory fields. “Insurance claim must include ICD-10 code and provider NPI.”
Consistency Data aligns across systems. “Patient’s birth date in EHR must match billing system.”
Timeliness Data is up-to-date. “Lab results must be entered within 24 hours of test completion.”
Validity Data follows defined formats. “Phone numbers must match (XXX) XXX-XXXX pattern.”
Uniqueness No unintended duplicates. “No two patients can have the same MRN (Medical Record Number).”

Data integrity acts as the foundation tying all these data quality dimensions together. SMEs contribute to integrity by defining immutable data, such as fields that should never change after creation (e.g., patient ID, transaction timestamps). They establish audit trails, specifying who is allowed to modify data and how it can be changed, which ensures full data traceability. They ensure relationships between datasets remain intact, for example a prescription must link to a valid patient ID. SMEs also help define role-based data access controls. While SMEs define what makes data valid, integrity ensures how it stays valid over time. Together, they prevent errors, detect anomalies, and correct issues when things go awry

Create a Coalition

“Typically, organizations take a grassroots approach to data quality by addressing it within individual projects or department silos. However, creating a data quality coalition brings together organizational leaders, managers, subject matter experts, and analytics professionals—all with a vested and shared interest in ensuring data quality because it facilitates better decisions,” claims Larsen. Standard approaches are agreed upon to advance proven processes and avoid spending resources reinventing the wheel, adds Larsen. The coalition must have buy-in and support from the highest levels of the organizational. A data governance council should be created within this coalition.

The Data Governance Council

This a cross-functional leadership body responsible for overseeing an organization’s data governance strategy, policies, and standards. It ensures that data is managed as a strategic asset, aligning with business goals, compliance requirements, and risk management.

Key Responsibilities:

  1. Establish rules for data ownership: Ensure quality, security, and lifecycle management of corporatewide data systems.
  2. Align Data Strategy with Business Goals: Ensure data initiatives support organizational objectives.
  3. Resolve Data Issues & Conflicts: Act as a mediator for data disputes (e.g., conflicting definitions of “customer lifetime value” between departments).
  4. Oversee Compliance & Risk Management: Ensure adherence to regulations (GDPR, HIPAA, CCPA) and mitigate data-related risks (e.g., data breaches, poor-quality analytics).
  5. Assign Data Stewards & Owners: Appoint SMEs to manage data domains as needed.
  6. Monitor Data Quality & Integrity: Review metrics and enforce corrective actions.

Waste Not, Want Not…

In its Waste in the US Health Care System: Estimated Costs and Potential for Savings, The National Library of Medicine (NLM) found, “In this review based on 6 previously identified domains of health care waste, the estimated cost of waste in the US health care system ranged from $760 billion to $935 billion, accounting for approximately 25% of total health care spending, and the projected potential savings from interventions that reduce waste, excluding savings from administrative complexity, ranged from $191 billion to $286 billion, representing a potential 25% reduction in the total cost of waste. Implementation of effective measures to eliminate waste represents an opportunity reduce the continued increases in US health care expenditures.”

Waste comes from inefficient or misguided rules. For example, excessive paperwork, redundant processes, and bureaucratic hurdles that increase costs, delay care, and frustrate both providers and patients. Simple things like when payers fail to standardize forms can consume a physician’s limited time by making billing procedures needlessly complicated.

Cost of wasted spending

$760B-$935B
Total annual cost of waste in US healthcare
The review yielded 71 estimates from 54 unique peer-reviewed publications, government-based reports, and reports from the gray literature.

Good Data Saves Money

According to its Waste in the US Health Care System: Estimated Costs and Potential for Savings, The NLM found, “The projected potential savings from interventions that reduce waste, excluding savings from administrative complexity, ranged from $191 billion to $286 billion, representing a potential 25% reduction in the total cost of waste.” The administrative complexity category, which NLM considers the greatest contributor to waste, is a direct result of the fragmentation of the healthcare system. Recent proposals, like Medicare’s Blue Button 2.0 initiative, which allows Medicare beneficiaries to securely access and share their health claims data with third-party applications, researchers, or caregivers, should alleviate some of the burden as information will flow more freely and billing and authorization processes will become more automated.

Understanding Data Quality Issues

Common data quality concerns include inaccurate data entry, inconsistent data formats, missing data, and duplicate records. Factors affecting data quality include technical, organizational, behavioral, and environmental factors. Data quality issues can lead to incorrect diagnoses, inappropriate treatments, and patient harm. Poor data quality can result in inefficient care delivery, compromised patient safety, and heavy financial losses.

Shutterstock 2255695571

Data quality can be poor for a number of reasons. Inaccurate data entry due to human error, miscommunication, or outdated data entry methods. Inconsistent data formats hinder data interoperability and make it difficult to share and analyze information accurately. Duplicate records can lead to redundant tests or conflicting treatment plans. Missing data can lead to incomplete patient histories and impact clinical decision-making and patient care. Duplicate records can lead to redundant tests or conflicting treatment plans. Organizational factors, such as a lack of data governance and standardization, can also cause problems. Behavioral factors can include human error and/or a lack of training. Environmental factors, such as regulatory requirements and changing a healthcare landscape, can also add difficult to foresee factors.

Case Study: The Steep Price of Bad Data

According to Henrico Dolfing’s Case Study 20: The $4 Billion AI Failure of IBM Watson for Oncology, “IBM marketed Watson for Oncology as a revolutionary tool that could bridge the gap between cutting-edge research and clinical practice. Its promise was to assist oncologists in identifying personalized treatment options for patients, thereby improving outcomes and reducing variability in care.” IBM included five goals for their system, including:

  1. Streamline Clinical Decision-Making
  2. Bridge Knowledge Gaps
  3. Improve Patient Outcomes
  4. Expand Access to Expertise
  5. Establish Market Leadership

However, IBM did not live up to its lofty goals, far from it. Human error was a major factor in the failure of the IBM Watson for Oncology system. In their article, IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments, internal documents show, Casey Ross and Ike Swetlitz explain how IBM’s Watson supercomputer misdiagnosed cancer treatment advice and that company medical specialists and customers identified “multiple examples of unsafe and incorrect treatment recommendations” as IBM was promoting the product to hospitals and physicians around the world.

Ross and Swetlitz claim the problem occurred because of the training data input into the system by IBM engineers and doctors at the renowned Memorial Sloan Kettering Cancer Center. “The software was drilled with a small number of ‘synthetic’ cancer cases, or hypothetical patients, rather than real patient data. Recommendations were based on the expertise of a few specialists for each cancer type, the documents say, instead of ‘guidelines or evidence,’” say Ross and Swetlitz.

This was a profound human error, an error that, ultimately, led IBM to sell off its AI-powered Watson Health operation to the private equity firm Francisco Partners for a reported sale of more than $1bn. This may have been a man-made error, and it clearly showed the importance of data quality in healthcare.

Realistic Expectations, Rigorous Validation, and End-user Involvement

As Henrico Dolfing explains in his IBM Waston for Oncology case study, “While IBM’s vision was ambitious, its execution fell short, underscoring the challenges of applying AI in complex, high-stakes domains. Moving forward, the healthcare industry must balance optimism about AI’s potential with a commitment to patient safety and ethical responsibility.” IBM Watson for Oncology’s failure “offers valuable lessons for AI projects in healthcare and beyond. It highlights the importance of realistic expectations, rigorous validation, and end-user involvement in developing and deploying AI solutions.” 

Data Quality Improvement Strategies

Healthcare companies must implement robust data governance frameworks and data standardization processes. They should leverage data quality tools, such as automated data cleansing and data validation processes. Enhancing data collection and entry processes to minimize errors and inconsistencies. Poor data quality can lead to flawed insights, financial losses, and reputational damage, making it crucial for healthcare organizations prioritize data quality management. They can do so in the following ways:

  • Data Cleaning: Fix errors, removing duplicates, and filling missing values.
  • Validation Rules: Enforce input checks (e.g., mandatory fields, format checks).
  • Automated Monitoring: Use tools to detect anomalies in real time.
  • Standardization: Apply consistent formats (e.g., dates as YYYY-MM-DD).
  • Governance Policies: Define ownership, roles, and responsibilities for data management.

Healthcare companies should implement data quality management processes to ensure reliable and trustworthy information for decision-making processes. Data errors can be minimizing through automated data cleansing and data validation. Healthcare companies should also provide training and guidelines to staff involved in data entry to ensure that they understand the importance of data quality.

Good Data Saves Lives

While bad data can result in billion-dollar failures, like IBM’s Watson for Oncology, AI-driven predictive analytics models built on good data and can help predict diseases before any symptoms appear. This is exactly what happened for patients with sepsis at John Hopkins University in 2022. In her article, Sepsis-detection AI has the potential to prevent thousands of deaths, Laura Cech states, “Patients are 20% less likely to die of sepsis because of a new AI system developed at Johns Hopkins University that catches symptoms hours earlier than traditional methods.” The system identifies patients at risk of the life-threatening complications of an illness that is notoriously difficult to detect, significantly cutting patient mortality from one of the top causes of hospital deaths worldwide, says Cech.

Sepsis is a life-threatening medical emergency that occurs when the body’s response to an infection spirals out of control, leading to tissue damage, organ failure, and sometimes death if not treated quickly enough. According to Cech, “About 1.7 million adults develop sepsis every year in the United States and more than 250,000 of them die.” Although all sepsis cases are eventually diagnosed, the risk of dying from the disease increases by as much as 8% for every hour of delayed treatment, and approximately 30% of patients diagnosed with severe sepsis succumb to it.

Early Detection

Early detection can improve outcomes, but this has been challenging because there has been a lack of systems that can accurately diagnose the disease in its early stages, says Cech. To address this problem, Suchi Saria, founding research director of the Malone Center for Engineering in Healthcare at Johns Hopkins, developed a targeted real-time early warning system for sepsis. “Combining a patient’s medical history with current symptoms and lab results, the machine-learning system shows clinicians when someone is at risk for sepsis and suggests treatment protocols, such as starting antibiotics,” says Cech. The AI tracks patients from arrival to discharge, ensuring all critical information is tracked, explains Cech, adding, data was collected from more than 4,000 clinicians treating 590,000 patients.

“It is the first instance where AI is implemented at the bedside, used by thousands of providers, and where we’re seeing lives saved,” said Saria. In 82% of sepsis cases, the AI was accurate nearly 40% of the time. Previous attempts to use electronic tools to detect sepsis caught less than half that many cases and were accurate 2% to 5% of the time. “This is an extraordinary leap that will save thousands of sepsis patients annually. And the approach is now being applied to improve outcomes in other important problem areas beyond sepsis,” concluded Saria.

Accenture sees Artificial Intelligence (AI) as healthcare’s new nervous system and a self-running engine for growth. According to Accenture analysis, “when combined, key clinical health AI applications can potentially create $150 billion in annual savings for the US healthcare economy by 2026.”

AI and machine learning can improve data quality by detecting inconsistencies and ensuring interoperability across systems. Patient-generated data and real-time data can provide valuable insights into patient outcomes and healthcare. Electronic health records and electronic medical records can improve data quality by automating data capture and reducing manual errors.

Top 10 AI Technologies

Top Ten AI 2
Top ten AI technologies

Data: The Building Blocks of Medicine

Medicine has, of course, changed radically since Plato’s days. One can only ponder the brutality of what medicine was like back then. Hippocrates, the man considered the father of medicine, once said, “The greatest medicine of all is teaching people how not to need it.” It’s great advice, but the revolution in medicine has made this advice rather quaint, almost naive. Some of things being done in medicine today Hippocrates would be considered a miracle in his time.

“Data, do no harm” is the first rule of data quality. When the stakes are, literally, life and death, this rule is a must. Accurate and timely data ensure that healthcare professionals have the right information at the right time to produce their “miracles”. High-quality data reduces the risk of errors and improves overall patient outcomes. Inaccurate or incomplete patient data can compromise patient safety while reliable data will help doctors create an accurate diagnosis to a patient’s ailment and then provide an evidence-based treatment. “Good and transparent data quality instills confidence in the insight provided, which accelerates sound decision making. Conversely, poor data quality degrades confidence, ultimately delaying or leading to wrong decisions,” claims Larsen.

In the end, Hippocrates’ timeless principle—First, do no harm—applies just as much to data as it does to medicine. In an era where algorithms diagnose diseases and robots assist in surgery, the quality of our data determines the quality of our care. The future of healthcare isn’t just in new technologies, but in the disciplined, ethical stewardship of the data that powers them.

So, while Plato might still scoff at modern medicine’s bills, even he’d have to admit, when data is trusted, curated, and used wisely, it doesn’t just heal—it transforms.