With ever-increasing requirements for a higher quantity and quality of analytical output, the need to boost productivity in risk management has become more acute. In pursuing these productivity gains, we have observed that investments in data quality can offer dramatic improvements and typically pay for themselves.

The post 2008 has dramatically increased the consequences to financial institutions of unsound risk analytics. The analytical complexity and massive downside risks of unacceptable regulatory submissions lead firms to maintain large headcounts of high-cost analytical employees who transform data into projections of their institutions’ financial performance. In our experience, this “get it right at any cost” situation results in material inefficiency, waste, and delay.

While there are many sources of waste, we frequently observe data quality to be a root cause. Dealing with some “garbage in” when there is no room for even a little “garbage out” is expensive. Bank risk managers and stress testing teams felt this acutely in the 2017 Comprehensive Capital Analysis and Review (CCAR) stress tests, when the US Federal Reserve released a data error and revised it a week later. The incident prompted rework, compressed schedules, and created uncertainty.

Obviously, increasing data quality improves the productivity of these analytical risk management processes. What may not be obvious is how best to invest in data quality and what return on investment may be possible. Even defining data quality in the context of risk analytics is not straightforward.

What’s the Bottom Line?

Investing in data quality can provide a range of substantial cost savings. In research1 and in our own experience, data quality investments consistently lead to a 30% to 50% reduction in direct costs – expenses such as payroll that are necessary for the actual production of analytical results and supporting documentation.

Improved data quality can also provide substantial indirect gains, cutting the costs that arise from quality problems and uncertainty. Although quantification of indirect costs remains elusive, we find two broad sources of such costs: model development time and confidence levels.

Model development time: Accurate data is a necessary – though not sufficient – condition for constructing a predictive model. If historical performance data for a loan portfolio are incorrect, a model developed on such a dataset will fail to capture the true underlying relationships between performance and economic factors. Noisy data will either provide weak signals at best or spurious correlations at worst.

From our experience developing a wide variety of econometric forecasting models, we find that poor data quality is the main reason for increasing the cycle time for model development. Having analytical modelers spend time addressing fundamental data issues during the model development process is wasteful for two reasons. First, being downstream consumers of data, modelers will waste time locating and communicating with the appropriate data experts within the organization. Second, the data corrections that modelers ultimately develop for their specific projects will not be captured at the source. The latter issue is particularly costly, as it implies that an institution may end up paying to address a single data error multiple times.

Confidence: High data quality creates confidence. It reduces noise, which in turn reduces model uncertainty. More broadly, model users who have low confidence in reported data are inclined to add an “uncertainty premium” to model results.

In the specific case of loss forecasting and allowance calculation, bad data may lead managers to assume conservative estimates ultimately leading to higher-than-necessary capital allocation. In this case, the cost of poor data quality directly translates into higher-than-required capital buffers and loss allowance provisions. While this may be prudent, overly conservative projections can price lenders out of the market, disappointing shareholders and ceding opportunities to more nimble competitors.

The indirect benefits of confidence may go beyond the business users of models. High-quality data are necessary to gain the confidence of model validators, senior managers, regulators, auditors, and other interested parties. Even if a model is well-constructed and estimated using state-of-the-art techniques, data anomalies can distract and call into question the integrity of model results – adding to compliance and other costs.

Quality is Productivity is Quality

The inseparable relationship between quality and productivity has been known in the manufacturing world for years,2,3 and research on the topic of data quality has made effective use of the analogy.4 Experience with serving our risk management clients and our own internal data processes has shown that the analogy also applies to quantitative analytical work. We have found, time and again, that there is a virtuous cycle between increasing quality and increasing productivity. Better-quality data boost analytical productivity by reducing wasted effort, idle resources, process bloat, and the number of required judgment calls. Likewise, higher productivity increases quality by automating error-prone tasks, reducing haste, and leaving time to evaluate results.

We have identified four major buckets of waste: rework, questioning, process deviations, and peak load.

Rework: The most obvious source of waste is having to discard a result and do the same task again. The serial nature of analytical processes makes rework issues especially costly when upstream steps are error-prone and weak error detection results in late identification of problems.

Questioning: Questioning an analytical result usually entails several higher-end tasks including troubleshooting, trade-off analysis, and ultimately making judgment calls, often requiring escalation. Improved data quality should reduce questioning time by reducing the frequency of anomalous results, caused by actual errors or poor models, and facilitating the troubleshooting process.

Process deviations: Unacceptable input data, such as model drivers with values outside required bounds, may not be addressable by rework and could require a change to the process itself. Changes must be developed, documented, and often validated. Furthermore, process deviations increase the probability of repeated errors, should the amended analytical step be repeated.

Peak load: All of the above reasons may delay the completion of upstream tasks, leaving less time for downstream tasks.  forcing organisations to apply more resources to meet deadlines. This creates a spike in the demand for analytical capacity, which may require carrying excess staff or contracting for temporary help. When upstream tasks are completed efficiently, there is a decreased probability of downstream tasks being compressed.

The experience of the 2017 CCAR stress tests has elements of all four of these types of waste. The data quality concern was that the US Federal Reserve issued incorrect data for BBB bond yields in its supervisory scenarios. A week later, it issued corrected scenario data. The rework this caused is obvious: Analysts set to work upon the initial release of data and were required to repeat some or all of this work in response to the revised data. Additional questioning occurred, with institutions seeking to determine what this change meant for them and how best to proceed. Analytical results received further questioning after recalculations to evaluate the impact of the revised guidance. The unanticipated correction in scenario guidance certainly created process deviations, since doubtless few, if any, institutions anticipated this occurrence. Finally, the rework and additional week of delay in receiving definitive figures from the Federal Reserve compressed schedules and created higher peak loads.

Defining Data Quality

Identifying the channels where data problems can impair productivity enables us to propose functional requirements for data quality in risk analytics. High-quality data should be (1) verifiably correct, (2) fit for use, and (3) documented. These are thematically similar to data requirements in regulatory guidance,5 but in this section we tailor their meaning and add specifics in terms of how they relate to the productivity of risk analytics processes.

Verifiably correct: Numerical correctness is clearly the minimum threshold for data quality and incorrect data will likely incur rework or process deviation waste. However, the ability to independently verify numerical correctness further increases data quality. This is especially important when using third-party data such as economic or market risk variables. The ability to independently verify data accelerates troubleshooting and avoids communication iterations with vendors or internal parties that would add delay and downstream peak load issues. Verifiability can come in several forms, such as backlinks to primary sources, quality declarations, unique identifiers, and accessible quality logs.

Fit for use: For data to produce sound analytical results, they must accurately quantify the concept they intend to measure. Modelers should consider both the definition of the data and their objective properties, such as time series length, frequency, timeliness, and consistency. Data that are too aggregated or nonspecific may provide weak or wrong fits, such as if national data on house prices were used when one could use state-, metro-, or ZIP code-level data.

Data that are fit for use should produce better-functioning models with more trusted results. This not only speeds up model development, but also reduces the expected questioning time and probability of process deviations.

Documented: Documentation is essential to the interpretation of data. Without documentation, questions need to be asked, such as: Exactly which loans are included in this sample? Is the volatility in this time series due to volatility in real life or uncertainty in estimation? These investigations are time consuming. Having the specifics and sources of the data increases confidence and reduces questioning time for analysts and validation team, thereby allowing them to do their job more effectively

Quality in Practice

Several guiding principles underlie the data quality initiatives we have implemented in our analytical processes:

Prioritize upstream data inputs, especially in model development.

Implement high-throughput quality checks to verify the thousands of variables being forecast.

Maximize use of objective pass/fail tests with low rates of false positives.

Log judiciously and store intermediate results.

Press data vendors to improve their delivered data quality.

Catalog data centrally and insure all users have access to the latest catalog.

To illustrate the application of these principles in real situations, we can look at the following case study:


Macroeconomic Scenarios

Each month, Moody’s Analytics produces macroeconomic forecasts and alternative economic scenarios for 58 countries. The process involves about 45 economists, many of whom are senior, and usually takes three weeks per month. Data for the process come from more than 100 sources around the world.

The high complexity and large resource requirement of this forecast process create monthly peak loads which impose bounds on analytical throughput and scheduling. Consequently, we strategically invested in reducing both the size and duration of this peak load while improving quality, with the goal of materially increasing analytical output and shortening process time. To that end, we redesigned the forecasting process with an emphasis on implementing the concepts previously discussed. Several key investments were made, notably in a new infrastructure for updating historical data, a new initial forecast system, standardized forecast quality-check programs, and mistake-proofing the forecast editing process.

The central features of the new historical data infrastructure are improved archiving, logging, transparency, and high-throughput quality checks. The system takes as inputs a mapping spreadsheet with definitions, equations, and metadata and produces the complete dataset needed for the forecast. The system also runs a barrage of quality tests on all data and results. Along the way, it archives the mapping information, raw data, processed data, and the output of all quality checks run. Archiving intermediate results improves our ability to diagnose problems quickly and explain unexpected results. Furthermore, we log each calculation to accelerate troubleshooting. This not only makes the actual update process go faster, but also facilitates answering questions that may come from analysts downstream.

The quality checks run by the historical data infrastructure efficiently surface potential problems and focus attention on what is important. These checks include:

Data with large or deep revisions relative to the previous month’s forecast run

Metadata changes

Discontinued data

Highly lagged data

Sparse data

Next, the initial forecasting system merges the new historical data, model equations, and carried-over forecast assumptions to produce an initial forecast. If it fails to produce the intended results, forecasters need to perform a substantial amount of cleanup work downstream. We substantially improved the initial forecasting system by the addition of integrity checks, which verify that intended steps did, in fact, occur as intended. These checks verify the following:

Every variable in the model has updated history.

All intended exogenous overrides are applied to the forecast.

Exogenous add-factor variables created by re-endogenizing variables produce exactly the desired result.

Additionally, we archive a copy of the initial forecasting package – data, equations, add-factors, and quality findings – to facilitate troubleshooting and trace the causes of specific results downstream.

After that, we enhanced the process by which economists impart judgment into the forecast by adding universal quality checks and redesigning support programs to prevent forecasters from taking actions that would impair quality.

The universal quality checks inspect for objectively implausible results such as values going negative that should not, long-run seasonality in a forecast of a deseasonalized variable, and unprecedented growth being forecast in the first period of forecast. The support programs display these quality findings prominently each time an economist runs the model.

The support program redesign implemented poka-yoke, or “inadvertent error prevention,”6 in several ways. If an economist attempts to edit the forecast of a variable that he or she should not edit, the solve program halts and displays an appropriate message. The program also firewalls actual historical figures from editing and overrules any attempted edits where there is reported history. The same protection of numerical values also extends to shared assumption data that flow into all models in use.

Our support program redesign implemented a poka-yoke or "inadvertent error prevention," to protect data quality.

Overhauling the process with a focus on data quality achieved telling results. Economists reported their time before and after the operational redesign and showed a 30% reduction in direct time spent forecasting. Quality improved measurably: Delivery schedules moved forward by several days, and forecast restatements went from common to rare. More than one economist remarked that the improvement in upstream data quality reduced cleanup time and let them spend more time adding value to the forecast.

Data quality underpins confidence in decision-making process

Data quality is fundamental to sound risk management and analytical productivity. As our case studies have illustrated, problems with data quality can trickle through and affect every subsequent analysis, model, and decision made by an institution. These problems incur substantial direct and indirect costs. Conversely, high data quality creates a virtuous cycle of productivity and quality, reducing labor costs, accelerating schedules, and bolstering confidence in analytical output.

Despite the common refrain that information is central to his or her ability to make proper decisions, data quality often doesn’t get the attention – or the funding – that revenue-generating initiatives do. However, our experience and research show that the benefits of investing in data quality are widespread and pay for themselves quickly.

The indirect costs of not investing in data quality are significantly higher than the direct costs of implementing quality programs. The costs of data assurance are low relative to the benefits. Before dismissing data quality initiatives as expensive, businesses need to consider their return on investment through the lens of cost avoidance.


1 Hansen, Mark David. “Zero Defect Data: Tackling the Corporate Data Quality Problem.” Massachusetts Institute of Technology. January 1991.

2 Shewhart, Walter A. Economic Control of Quality of Manufactured Product. New York: McGraw-Hill, 1931.

3 Deming, W. Edwards. Out of the Crisis. Cambridge, Mass.: MIT Press, 1986.

4 See Ref 1.

5 See Basel Committee on Banking Supervision, “Principles for effective risk data aggregation and risk reporting,” BCBS 239, January 2013; and Office of the Comptroller of the Currency, “Supervisory Guidance on Model Risk Management,” OCC 2011-12, April 4, 2011.

6 Shingo, Shigeo. Zero Quality Control: Source Inspection and the Poka-Yoke System. Portland, Ore.: Productivity Press, 1986.

For further information contact:
Stephen McKinney
General Manager – Oceania
Bureau van Dijk
Ph: (02) 9233 3088
Email: stephen.mckinney@bvdinfo.com

Download article