Featured in PARCEL: Ensure the Quality of Your Big Data
Travis Rhoades, our Director of Data Science, is authoring a series of guest columns on Big Data for PARCEL magazine. This column is the third and final installment and explains the importance of data quality. You can read the full column below or on the PARCEL web site.
Big data gives parcel shippers access to information they’ve likely never had before. When properly leveraged, data analytics lead to decisions that mitigate risk and seize opportunities.
It’s important to keep in mind, though, that big data sets are complex, dynamic and resource intensive. It’s not as simple as flipping a switch or doing a software upgrade. The quality of your data is key and it is vital that you do everything you can do to maximize data quality.
Here is a best practice approach to data quality:
Data is a raw material. From it, information and knowledge (i.e. intelligence) are refined and then used to make decisions. As with other raw materials, quality is critical. An ore with a high level of impurities will be more difficult and expensive to refine. If it’s not properly refined, it will produce poor products; like crude oil straight from the ground – it’ll burn but it’ll also quickly destroy the engine that uses it.
Dirty, unrefined data, will likely destroy the decision-making engine that relies upon it. If the intelligence derived from the data is too expensive to produce or of poor quality and therefore unreliable, data-driven decision making will falter, denying the enterprise the very important benefits data-driven decision-making bestows. If faulty intelligence drives your logistics strategy, it’s tough to achieve cost-savings, identify new revenue streams or eliminate inefficiencies.
It is clear that quality data is a building block of business intelligence. Your company’s competitive edge and the success of your logistics strategy hinge on data that’s in good condition. You can avoid bad data — and flawed decision-making — by keeping your data on TRAC:
To support responsive decision-making, data should be captured as quickly as possible after the event that generated it and must be available for reporting and analysis within a reasonable period of time.
Your data must be stable over time. If your collection process is inconsistent or the definition of terms vary, then the usefulness of your reporting and analyses will be degraded, perhaps severely. Your reporting and analysis should reveal real changes in your operational performance, not variations in your collection methods or the definitions of terms.
Data sources come with varying degrees of noisiness. Sometimes they’re like an AM radio station when you’re driving next to a powerline, and other times they’re like a high definition picture sent via a fiber optic cable. Noise is the accumulation of random errors in your data. Knowing the noisiness of your data sources is very important. Large data sets are generally more resilient to noise – statistical methods for removing noise work well with large data sets. However, the same is not true for smaller data sets.
The quality of all data sets, large or small, are vulnerable to systemic errors. For instance, if your system assumes all package weights are measured in pounds when in fact many are in kilograms, you may have a significant error, impervious to statistic noise reduction, that could go unnoticed for some time.
The data that you use to support decision-making must be as complete as possible. There are several ways to think about completeness. Data can be complete in the sense that all fields within your records are populated. Here you have all the information regarding, say, a particular charge on a particular package. Data can also be complete in the sense that you have records for all the charges on all your packages. Of course, being complete in both these senses is ideal. Similar to Accuracy above, some incompleteness can be tolerated if the data is large enough to be aided by statistical methods and if the missing data is random. Otherwise, incomplete data means incomplete intelligence and a corrupted decision-making process.
Businesses across all industries — including your competitors — are taking on big data initiatives to capture more market share, to save on costs and eliminate inefficiencies. Harnessing the power of big data hinges on accurate information that supports confident and informed decision-making. Can your business support the complexity and resources big data requires? What steps is your business taking to ensure data quality?