Data quality: origin, processing and storage

By Patrick Centeno On Jan 25, 2022

The complexity of how data is captured, mapped and transmitted also means that its accuracy tends to decrease over time. Additionally, computer systems designed to ensure data integrity have little control over the quality of data that has been compromised at source or corrupted during transmission.

The whiplash effect

Systems scientist Jay Wright Forrester theorized in 1961 that interpretations of consumer demands can be distorted as information travels upstream between distributors, wholesalers, and producers. Called Forrester or the bullwhip effect, the concept first gained prominence for tracing the movement of products. Over time, these magnifying effects have proven true for IT and healthcare. Simply put, the principle states that an unexpected fluctuation in user activity can inadvertently cause service providers to overreact by exaggerating user needs, resulting in overproduction of systems and data. that result are ultimately wasted through lack of use.

In the expanding digital world, data quality is largely determined and managed by system design. Users of clinical systems tend to exaggerate the computing requirements initially, which they then request to reduce during the testing phase, as they realize that whatever was initially requested would not be of practical benefit.

It is also difficult to assess the relevance or usefulness of data generated from patient activity demand projections, the scope of which also brings data storage and system performance issues to the fore.

Big data, shape and size

Aspects of the world that cannot be personally experienced can be understood through data. More and more data is generated from digitization, devices, artificial intelligence and machine learning. Data has been described as the ‘new oil’, ‘the currency of our times’ and ‘reducing uncertainty’. Yet big data is not only about the size of the data, but also about the many correlations and relational links that add to its complexity.

As data grows exponentially, data discussions will grow from gigabytes (10⁹) and terabytes (10¹²) in petabytes (1 PB = 210 TB or 10¹⁵), exabytes (1EB = 210PB or 10¹⁸) and zettabytes (1ZB = 210EB or 10²¹) which will optionally take into account the capacity in yottabytes (1YB = 210ZB or 10²⁴).

While Big Data is exciting, it also presents a myriad of management challenges. The unprecedented growth of data draws attention to structural attributes such as speed (capture), volume (increment), valence (complexity), veracity (accuracy), variety (variability) and value (importance ).

Considerations for data processing decisions include retrievability, reliability, performance, and cost, while privacy and security continue to act as inhibiting factors for data expansion strategies.

The volume of personal data stored in the cloud has increased dramatically, with organizations following suit to adopt hybrid approaches comprised of on-premises and cloud solutions dictated by the practicalities of retrievability, security, and performance.

Emerging Storage Technologies

After the industrial revolution, the volume of data doubled every ten years. After 1970 it doubled every three years and today it doubles every two years. Global data that was created and copied reached 1.8 ZB in 2011 and was estimated to reach 40 ZB by 2020.

Big data storage does not necessarily mean thinking bigger, but could simply mean the opposite: “thinking smaller”. Next-generation storage technologies examine the original DNA structure that promises much more capacity to hold the world’s data.

So when a Dalton weighs 1.67 x 10-24 grams and the human genome weighs 3.59 x 10-12 grams (aka picogram), the culmination of this work could mean that all estimated data in the world in 2020 could fit in about 90g of DNA-based storage.

Conversely, where magnetic storage on hard disks has been relied upon, advances in multi-dimensional nanophononics are pushing the limits of optical storage solutions by altering the frequency and polarization of the write beam that determined the conventional compact disc storage capacity limits.

Until then, magnetic tape data storage technology will remain in use as it remains affordable until emerging technologies overcome the capacity, performance and cost hurdles to replace it.

Our emotional connection to data

Data is a stagnant commodity that becomes dynamic through a person’s association with it. In other words, it derives its strength and validity from its emotional affinity with him. The data also carries its own nature according to its construction and its affiliations.

No one really cares about analytics – whether advanced, impressive, or state-of-the-art – until they are affected by the implications of it. Emotions play an active role in increasing the strategic importance of specific data segments within information processing units.

Users are drawn to specific data segments associated with their work. In healthcare, real-time data is invaluable for providing clinical diagnosis, while trends from historical data can be essential for the treatment of chronic diseases. Likewise, studies of human DNA can open up predictions of potential illnesses or health care needs required in the future.

Emotions can induce feelings of personal responsibility for the loss or corruption of data that plays a critical role in health predictions, just as concerns about reputational damage can limit sharing details of data irregularities with third parties. wider stakeholders.

Data quality – a matter of opinion

Understanding data quality draws attention to attributes such as accuracy, consistency, integrity, relevance, timeliness, security, and timeliness. It should also be understood that while organizations provide procedural boundaries, perceptions of what is considered acceptable data are subjectively formed.

Therefore, the understanding of quality may differ between staff members on the same team, following the same processes and undertaking similar tasks. The limitations of documenting every intimate action and keystroke compound this variation. Moreover, individual experiences, emotions and tolerances act as mediating effects during data quality assessments.

Although data of the highest quality is desired, for practical reasons this tends to lean more towards what is acceptable rather than what is perfect. The balance to ensure data quality can be described as a seesaw effect between risk and preventive action. In other words, even when all routine checks have been performed to ensure the quality of a dataset, closer examination will inevitably reveal additional anomalies with a dataset that may need correction.

It is relevant to realize that all data contains errors. The time and resources required to ensure ideal quality conflict with the urgency of presenting it in mission-critical systems where it can be interpreted and exploited in real time. It is for this reason that deliberation or analysis of isolated data without a business use case is of little importance.

Provenance adds value to data by explaining how it was obtained. However, systems designed and deployed prematurely to meet an urgent project need can pose a host of data quality issues. In addition, data validation techniques can only go so far as to verify the processing of the data but offer no guarantee of quality as to its integrity at the source.

When data goes wrong

Datasets lack the ability to determine what is worthwhile and what is arguably junk, which means good data often arrives clumped together with inconsequential information. Most healthcare facilities have systems and processes in place to maintain data standards. But while these play an active role in quality results, there’s no denying that human cognition, coordination, attentiveness, and personal integrity are key to ensuring data quality.

When the data goes wrong, no one wants to be associated with the stigma of being identified as the source of the deviation. Contrary to unbiased data, people arrive with an ingrained selfish bias that causes them to automatically attribute success to themselves and failure to others. Although aware of their own flaws or flaws, people believe that their intentions are always inherently good and that it is simply not possible for them to be wrong, which instinctively leads them to castigate others for unintentional mistakes. .

Services that provide critical services are subject to more rigorous scrutiny in the event of an outage. Teams working together to jointly deliver solutions can suddenly regress into their own silos when a data error has been identified. The question of ownership of a problem becomes a logistical hot potato that ultimately rests with the processing unit where the anomaly occurred.