Swipe Icon
home
Organisational aspects of implementing ML based data editing in statistical production
A-
A+
Appendix 3: Use cases
Use case 1
ABS: Un-supervised ML for anomaly detection in large and frequent admin data
Project overview
Anomaly detection of frequent big data – Un-supervised approach to identify anomalies in wage payment administrative data as reported by businesses.
The ABS has been investigating unsupervised anomaly detection methods for large and frequent business administrative datasets - wages and jobs as reported by businesses to the tax office, with frequent extracts provided to ABS.
Unsupervised methods produce anomaly scores that can be used in combination with significance scores to better-target validation and editing efforts - providing human decision-makers with a short-list of anomalous and significant units, along with contextual information.
This forms part of a broader validation and editing approach and is a low-risk way to introduce the benefits of machine learning.
The methods were selected based on performance, efficiency and explainability.
Unsupervised methods are also useful for identifying unexpected anomalies in new and evolving datasets, where labelled data is limited.
This method is being assessed for inclusion in a production system, and if useful may also be considered for other statistical programmes. Other possible future directions may be the automated treatment of less-significant units.
Organisational readiness
The ABS has a long history of innovation. This includes allocating targeted resourcing into key areas of research, methodological developments, and data science / compute capabilities. Ideas are assessed for their potential to concretely improve the delivery of statistical information, such as improving efficiency, quality, capabilities or delivering new statistical insights. This preparedness better-enables the organisation to harness opportunities.
This project arose from the use of a big new dataset within a new compute environment - the need to understand and identify anomalies in large, frequent, evolving data. The new environment provided functionality and tools that were not usually available - such as broad access to python/R packages and compute capabilities - which made this work possible for big data. However, the opportunity to undertake this project also came with some limitations. The compute environment was new and still being built, with limited tools, functionality, access, and support for early users such as this research team. It was a steep learning curve for all teams involved in this new environment.
Before this project started, the methodology area had:
- undertaken investigation into potential use of machine learning for data editing more-broadly; and
- engaged with other NSOs doing work in this space, including UNECE HLG-MOS and Statistical Data Editing.
it was able to leverage these learnings for this project.
A key deliverable for this project is to better understand and create documentation for ongoing maintenance of these algorithms, with the aim to gradually build confidence-in-ML and expertise within statistical production areas.
Understand business needs (Who needs what)
As mentioned, the need for this new statistical product in a short timeframe created the opportunity for this project.
- This use case involved very large data that needed frequent processing. This provided an opportunity to investigate new compute solutions and machine learning to identify anomalies for big data. Because automated editing rules were already built into the pipeline, with manual validation and editing also undertaken; this project introduced a low-risk complementary approach, aiming to better-target and better-inform the work of the validation / editing team and so provide efficiencies and improved quality.
- The team also focussed on building a solution that was useful for the broader organisation (not just a point solution) while also delivering a useful concrete deliverable for the particular use case.
- The data itself was large, new, and evolving (with staged onboarding of data providers), thus were still developing our understanding of what 'wrong' looked like. However, this was an opportunity to demonstrate unsupervised methods to identify anomalies, and to help build up our understanding of what 'wrong' looks like. This learning could be used to develop rules or train models to recognise these patterns; however it is anticipated that unsupervised methods would continue to be useful ongoing for identifying unexpected anomalies.
This stage of the work was funded through the allocation of a methodology team, with access to the environment and data provided by the statistical production area and IT team.
The research team built a good working relationship with the business area that enabled us to understand their needs and put us in a position to be 'on the scene' to provide solutions.
A key aim of this project was to build understanding and confidence in the performance of machine learning for this purpose.
- To build confidence in machine learning (ML), this work aimed to bring our stakeholders on a journey, starting with a low-risk approach to demonstrate that ML can add value / complement the more-traditional and familiar approaches. (As the stakeholders become comfortable that the approach is working appropriately, then later stages may investigate the use of ML to propose edit values.)
- A key aspect of this is explainability; It is important to be able to explain how the approach is working and why particular units are being identified, for transparency; to build stakeholder confidence, and to determine whether the approach is working / improve the performance. As mentioned further below, this can be challenging without labelled data.
The IT team had been investigating emerging compute environments, so was able to harness the opportunity when it came along.
Taking things to the next stage in the productionisation process will depend on future funding decisions.
Assess Preliminary Feasibility
Methods:
Building on our existing knowledge of the benefits and issues of ML for anomaly detection, we undertook a small/fast assessment of a number of potential methods and selected a method that suited the nature of the data and production needs - initially applying Local Outlier Factor (LOF). LOF is a density-based clustering method that is relatively efficient; relatively simple to understand and maintain; likely to provide good and robust results with minimal hyper/parameter decision-making and pre-processing. LOF provides a score reflecting how anomalous the unit is, and a subset of anomalous units is sent to the validation team, along with contextual information and visualisations. This shortlist can be further targeted on groups, such as significant contributing units for example. The anomaly scores can be normalised to assist with interpretation (e.g., between 0,1).
The initial LOF approach looked only at the current period - that is, in the current period of data, the timeseries variables were created for each unit - and anomalous units were given a higher score if their variable combinations were different other units in the same period. This approach is easy to explain, and easy to maintain as it does not need training data / models to be created.
The statistical production area became more familiar and comfortable with the performance and explainability of this approach.
Two additional approaches were assessed, both using training data to create models for: (i) LOF; and (ii) Isolation Forest (IsoF).
These are both used because they identify different anomalies.
Isolation Forest randomly splits the data, over and over, until each point is isolated. Every point is given a score based inversely on the number of times it took to split the data until that point was isolated. This process is repeated a number of times and an average score is created. Anomalous points tend to need fewer splits and therefore tend to get a higher score.
Local Outlier Factor identifies “local” outliers relative to their neighbourhood. Data points are compared to other data points in this neighbourhood, and given a score based on the density of their neighbourhood, relative to that of their neighbours. A point that is less dense than its neighbours has a higher score.
These models capture 'normal' relationships between the variables over the previous 12mths, and units from a selected period are compared to this information.
The results were found to be fairly robust to the inclusion of some anomalies in the training data.
Other methods would be good to assess, such as classification methods, however more labelled data would be needed. This can be difficult to create where anomalies are few.
Variable creation and selection:
Regarding the variables used in the LOF model, a number of time-series variables were created in particular to incorporate information about the expected movement for that unit (with respect to itself, or 'like units'). The variable definitions and selection were initially simple - to see whether simple-and-fast to create / understand / maintain models were able to provide a good and robust result.
It was found that a relatively small number of appropriately defined variables captured much of the important parameter space needed for fast, low dimension, efficient and effective outcomes. The LOF identifies units with anomalous combinations of these variables.
The variables were normalised to incorporate a shift and spread to fix the 5th and 95th percentile (so they were roughly on the same scale, but the tails were allowed to remain long).
Hyperparameter selection:
The hyperparameters were chosen to be smaller (for faster compute) that still provides good, stable performance, particularly for anomalies.
The key hyperparameters were:
- LOF: number of nearest neighbours; and
- IsoF: number of trees and number of samples to build the tree.
Performance:
The team explored a number of approaches to help determine whether anomalies identified were of interest, and also whether the method was missing key anomalies:
- Explored use of visualisations (e.g., time series, scatterplots).
- Compared with some key outliers (what would a human consider 'wrong' vs 'unusual').
- Feedback from the business area / data experts. The business areas were very busy, thus it was harder to get their time / input. We also needed to spend time bringing them on a journey.
- As the research team and business area learned more about the data, were able to start building a set of known anomalies, which was also useful for assessing the performance of the models.
The feasibility assessment was undertaken using samples of data due to memory / processing limits in the environment; and for the visualisations. Random samples were used initially for feasibility assessment. Later work instead used group-specific data/models as specified by the statistical team. This also enabled us to parallelise the preprocessing / training code. We are still learning about the most efficient way to code for dashboards.
Learnings:
- Categorical variables are problematic for LOF/IsoF, so continuous variables were created to capture the concept, for example by comparing a unit with units in the same category.
- Variable normalisation was required, however only basic normalisation was necessary to have good and robust results.
- There are some situations where data may have unusually high densities, which can impact the LOF score of nearby units. The current arrangement no longer has this issue, but at the time the issue was dealt with by dropping these units because deemed to be 'boring' (i.e., the same value every period) and so were dropped from the analysis so that they did not impact other units.
- Some additional pre-processing was needed. For example, capping very large/tiny values of some ratio variables.
- Testing was also undertaken on the various variable definitions, the number of variables and to ensure appropriate targeting (e.g., not identifying large units just because large, or small units just because they tended to be more volatile).
Engagement with statistical production area and IT area
The team worked with the business area and IT area to build a prototype to demonstrate how the method worked.
Multi-level engagement has been important throughout the process, including with the business owners, the IT team, the corporate infrastructure funding/management team for this build work, other corporate areas (including data custodians, methods owner).
A number of models were assessed (sets of variables) and the key hyperparameter was selected (e.g., nearest neighbours). The team selected a small set of useful models, and these were provided to the business area for assessment and feedback. As expected, the proof of concept showed that LOF/IsoF performed fairly well in identifying anomalies.
We were able to get some feedback from the business area throughout the process, which was crucial to ensure the model was useful for their needs.
- They are a busy team, thus we needed to be mindful of their availability (e.g., production cycle).
- It was important to spend some time over multiple sessions helping the business area become comfortable with the concepts and ideas. We ran some presentations and demos, and also provided them with information and visualisations and allowed them the space to consume and dwell on.
- All teams have some turnover so from time-to-time we needed to introduce new staff to the concepts and ideas.
- Most of the business area did not have much experience with the environment; were also learning about the tools/environment.
- Some of the feedback from the business area related to functionality that needed IT resources to build.
Develop proof of concept
A proof of concept was developed for evaluation however needed additional IT components / processes built and enabled.
Some initial IT support - e.g., to build some of the key data analysis / visualisation tools to enable app hosting - was funded and managed by methodology (with goodwill from busy IT area). This enabled the team to build and host an Anomaly detection dashboard for evaluation by the statistical area.
This anomaly detection dashboard app / system is currently being built / evaluated.
Approach / method used
A combination of Local Outlier Factor and Isolation Forest was used to identify anomalies. The idea is to send a targeted list of the most significant and most anomalous units to the human decision-makers, along with some contextual information. The data is large and is provided regularly for publication, so fast pre-processing is important. Pre-processing was kept to a minimum and parallelised over subgroups (that are relevant to the outputs and how the decision-makers operate). For every period, the pre-processing extracts the data, creates the variables and the training data/models. The anomaly scores were scaled for interpretability.
A prototype dashboard was built to enable the human decision-maker to view the short-list of significant+anomalous units (the user can vary the anomaly score and significance score cut-offs; the dashboard compares to the pre-processed models); and view contextual information such as time series plots to help with decision-making.
For more detail, please refer to the ‘Assess Preliminary Feasibility’ section above.
Prepare a Comprehensive Business Case
Future directions will depend on funding.
Deploy the model
Not yet at this stage.
Results
Not yet deployed in production.
Latest status and next steps
Future stages of productionisation depends on funding - so an interim dashboard tool has been built for evaluation by the validation team (currently being assessed - initial feedback is positive), with some IT components/systems currently being built to enable business areas to host their own dashboards.
Upcoming work to:
- Incorporate feedback from evaluation.
- Work with statistical production team on maintenance of system (when and how), including work on explainability.
- Investigate application to other statistical products.
- Assess feasibility of automated treatment of less-significant anomalies.
- Any future productionisation stages will need to consider testing, tech debt, etc.
Lessons learned & recommendation
Learnings included:
- Machine learning can provide benefits for anomaly detection, including finding unexpected anomalies, better targeting lists of anomalies sent to validation teams, providing more contextual information for the validation team, managing large datasets, leveraging multi-variate analysis.
- It was found that a relatively small number of appropriately defined variables captured much of the important parameter space needed for fast, low dimension, efficient and effective outcomes. A set of group-specific models was developed that suited the statistical production team and for efficient processing and use within the dashboard tool.
- Initially aimed for low-risk / easy to explain machine learning solution ... however once production areas were comfortable with it, they very quickly wanted more-advanced approaches.
- New compute environments offer opportunities for big data and new approaches, however a lot of effort, for example:
- Some additional functionality / tools become available, but other usual functionality is not/yet available. Particularly the case for research environments.
- For business areas who work in these new compute environments there is a very large amount of additional knowledge that is needed.
- There is a very large amount of IT effort to build environments suitable for business areas to have greater control over their own statistical products.
e.g., building environments / systems for business areas to create their own apps takes work, and business areas need to build / maintain a different set of skills.
- It was very helpful to have some staff who could 'bridge the language gap' between IT / not-IT areas.
- Close connections with the business areas were crucial. Found it was important to provide useful concrete outcomes along the way. For example, the team identified some key anomalies and proposed some interim rules to help the validation team identify anomalies in the short term. Ongoing engagement with IT teams, Methodology support areas and other corporate areas were also very important.
Reference
No publications available yet.
Use case 2
StatCan - Unit Value (UV) Error Detection and Correction: A Machine Learning Approach
Project overview
In short, the goal of the work is to improve the quality of import data, specifically the Quantity and Unit Value (UV) fields, received from administrative sources. These data are used to produce indicators on international trade (import statistics) as part of the system of macroeconomic accounts. The issue is that the UV (or Quantity) is often misreported on Customs Declaration forms. The UV is a derived variable from the reported Quantity and Value. The Value is carefully checked by Customs agents but the Quantity field (and thus the Unit Value field) less so. This results in:
A great deal of User Inquiries
Significant time investment in the review process by data processing/production analysts
An “edit and imputation” approach to detect and correct/impute erroneous Quantity/UV fields exists but was determined to be underperforming and inadequate. The existing approach largely focuses on “clipping” extreme values and imputes them with random donors. Given the large size of the micro dataset (import declarations) there is little room for manual validation and edit. The business need of the project was to develop a new error detection and imputation approach (as not all extreme values are errors and not all errors are extreme values). A machine learning model approach was chosen for exploration as such methods had not been tested in the past and were known to show promise for processing with large data sets.
Work mostly started in 2019 (exploration started in 2017). Work now (fall 2022) is at the final steps of implementation.
Organisational readiness
Low to moderate:
Initially, the tools (e.g., ready access to python/R packages) and compute infrastructure were initially lacking.
IT Service Providers within the organisation were very hesitant to provide broad access to open-source tools out of a fear that more cyber security breaches could occur (in addition to not yet having a clear process to maintain and support such tools).
Over the 2-3 years of the work, this changed as higher performance machines were purchased, cloud access increased, and clear product owners for the needed open-source tools were identified.
Expertise in ML was still at early stages when the project started in 2017 (e.g., a handful of employees had any experience and were found 2-3 teams).
In 2022, there were potentially 50+ that have experience in the ML methods being applied with a large concentration in the Methodology Team but also with a few small but important pockets with the Subject Matter teams.
Organisationally, there was supported to try new methods.
Understand business needs (Who needs what)
The business need was clear: The current E&I approach was underperforming. Desire for a new approach was high. Although a new method for E&I was being explored, that was not the goal. The goal was always to develop a better E&I method.
Assess Preliminary Feasibility
The choice to try an ML approach can in part be attributed to the large size of the micro dataset, large number of different products and large variability. Trying to come up with a successful rules-based approach was unlikely. Given a “clipping” approach was already in use, trying something new was warranted and ML approaches were showing promise on large datasets.
Initial feasibility of an ML work was low as the quality of existing labelled/training data was low to test a machine learning (ML) approach. Initial ML model performance (with an XGBoost-based model tested on non-representative samples) was promising but not particularly good.
Much work went into improving the quality of the existing labelled data. On an ongoing basis, production staff labelled random samples of the new months for use as validation / testing data. “Business rules” were used to correct errors in historical training data. Once done, model performance was tested and found to be good, significantly outperforming the existing E&I approach.
Develop proof of concept
The goal was to develop a new E&I method. Initial exploration showed an XGBoost-based model approach seemed promising. The key thing to mention is that the development of the Proof of Concept (test an ML approach) came out naturally from the business need for a better performing E&I method. It was not the reverse (e.g., find a use case to test an ML approach).
An important obstacle at early stages was the low quality of the initial training data. If the signal is bad, no amount of modelling (no matter how advanced) is likely to make any sense of it. Improving the quality and of labelled data was key. This has shown true in subsequent application of machine learning as well.
Prepare a Comprehensive Business Case
The business needed an improved system for E&I of unit value and wanted this project to go ahead. A key point is that the need was not for a modern method. The need was better results. Such business cases are standard and don’t really require anything special or new.
Deploy the model
The model deployment was done through a production-ready Windows R server accessible by the rest of the International Trade programmes and system. The deployment of the model went fairly smoothly. Much discussion occurred with IT partners to come up with the best way to deploy.
Having and obtaining the needed R-servers did take some time but this occurred in parallel over the course of the project as part of a broader initiative to have R/Python servers that could be used for official production.
Results
The model has not yet been deployed in production. The model results have been tested with various metrics on held-out test data from new months not seen by the model. The results were checked in a quality assurance (QA) dashboard developed by an independent team. Metrics checked include:
Mean absolute error (MAE).
Mean squared error (MSE).
Various classification-based metrics (like fraction of points incorrectly edited when should have been kept as-is, fraction of points kept as-is correctly, etc.).
The model results show significant improvements over the existing system (which was chosen as the benchmark) in a wide variety of product categories. The ML approach performed as well or better than the current E&I approach (UV-Clipping) in all product categories.
Latest status and next steps
The model and project are in its final stages of approval.
The methodology has been approved and reviewed for appropriateness and potential ethical consideration.
A code review was completed to help ensure the robustness of the model code (e.g., it is easily maintainable and well documented).
A formal Service Level Agreement that defines production roles and responsibilities across Production Staff, IT Staff, Data Scientists, and Statisticians has been developed and signed.
The work to deploy the model and integrate it into the current statistical processing systems/flow is complete.
The key remaining task is to brief senior management and train production staff on how to use the QA tools that have been developed to monitor the performance of the models.
After this, the assumption is that the full transition to production (switching over from the existing E&I system to the new system) will be done.
Lessons learned & recommendation
Execution: It is important to first have results from the proof of concept before doing any transition to production work. Some work on making the code “production-ready” was done before we had quality results and it turned out to be a waste of time given changes made later (which were verified on new random samples from newer months). First get the proof-of-concept results working well before working on any transition to production.
Execution: In addition, do not make the code too general and “heavy” before getting quality results. Work was done on having the code work for additional variables and support for other algorithms, which turned out to be unused and was deleted later. The code has a specific business objective and should focus on that, it is not a generalised system for doing lots of different things.
IT Infrastructure: Unless a deployment plan is known from the start, begin working early on with IT infrastructure providers to understand what a likely deployment architecture might look like. If it isn’t available acquiring the needed infrastructure for production deployment can take some time.
Organisation knowledge of the proposed method: Given the methods are unlikely to be known by many, don’t get caught up in traditional responsibilities of services provision and maintenance. Pool the expertise that does exist and work as a multi-disciplinary matrix team.
Reference
No public resources are available at this time.
Use Case 3
INE (PT) – Anomalies detection and imputation on administrative data
Project overview
Statistics Portugal (INE) started to analyse several approaches for anomaly detection and imputation of data from enterprise invoices (aggregated by enterprise and buyer) provided monthly by the Portuguese Tax Authority.
INE receives this administrative data on a month m with data referenced to m-1, on average around 85 million records per month, covering around 1 million different sellers. The data structure is as follows:
Year
Month
Seller
Buyer
Value
(€)
2022
8
seller1
buyer1
204,35
2022
8
seller1
buyer3
1154,12
2022
8
seller1
buyer4
115,33
There are some issues with the data, as it may have insufficient coverage depending on the day of the month the data is extracted by the tax authority.
Organisational readiness
The analysis and treatment of administrative data is following a new INE strategy of centralised data process and treatment: one dataset serves different users.
To achieve this purpose, since 2019, some adjustments have been made in the internal organisation for strengthening the capacity for data management and analysis in two departments: Methodology and Information System Department and Management and Data Collection Department.
Meanwhile, INE management has prepared and encouraged training courses in data science tools, both to empower employees with new knowledge and to bring machine learning methods both to empower employees with new tools and to bring machine learning methods into their daily work activities.
In the middle of 2020, a new unit was created (Administrative Data Unit, under the umbrella of Data Collection Department), responsible for:
Evaluation and testing the use of new data sources, with a view to improving the quality and consistency of statistical production;
Evaluation of the possibility of replacement of the information collected by surveys or censuses;
Definition of new validation models, consistency, and coherence analysis
Integration of data from various sources
Understand business needs (Who needs what)
For these administrative data to become statistical data, it must be treated and validated, to ensure quality, reliability, consistency, and completeness of the data. In this data cleansing process, we also perform a more in-depth and specific analysis of content handling anomalous or lack of information.
Although this dataset serves many different users, it has a user group that is very interested in the success of this process: the short-term statistics team.
In this case, we have brought the colleagues of this team to frequent meetings where we inform them of the progress and setbacks in the identification of anomalies and their treatment, presenting them with possible solutions and results and showing openness to their contributions and possible proposals.
The users need to have the data available to work within 2 working days. The continuous improvements in the treatment process have allowed the data to be delivered around 30 hours after its transmission.
Assess Preliminary Feasibility
No preliminary feasibility study has been developed, but evolutionary and phased work has been done, involving users, to make the results more robust and accepted by all.
During this process some analysis tools and approaches were used such as:
Data exploration using time series visualisation;
Comparison of the results obtained with survey and extrapolated data;
Comparison of the historical data with annual reported data;
Knowledge and feedback from key users about the potential anomalies identified (some of which could have an explanation).
Develop proof of concept
Identification of missing values and its imputation applied to the monthly taxable amount of a small but sufficiently relevant set of units, capable of ensuring a remarkable quality improvement in the data processed.
Prepare a Comprehensive Business Case
A solution is needed to solve the problems encountered when the e-invoice data received does not have sufficient coverage (have many missing values). The model must discriminate between total missing values and partial missing values (abnormally low values and records).
Deploy the model
The process, deployed in R language, is based on the following R-packages:
{tidyverse} for data manipulation,
{targets} for defining a workflow for functional programming,
{isotree} - Fast and multithreaded implementation of Isolation Forest (a.k.a. iForest) for anomaly detection
{imputeTS} for imputing missings in univariate time series,
{ROracle} to create an interface between R and Oracle database,
{tsibble} for time data manipulation,
{fable} and {fabletools} which provide forecasting models for time series,
{RJDemetra} interface for seasonal adjustment software officially recommended for members of the ESS,
{Metrics} for the implementation of validation metrics used in supervised machine learning methods.
In order to evaluate the best nowcasting method, the following models were applied to each of the seller series:
ETS - Exponential smoothing state space model, the best model is chosen automatically;
ARIMA - a variation of the Hyndman-Khandakar algorithm is applied to obtain the best ARIMA model;
NNETAR - Neural network autoregression, fits a NNAR(p,P,k)m model with a hidden layer.
Prophet - fully automated facebook forecasting procedure;
X13 - X13-ARIMA method for estimating seasonal adjustment of time series;
TRAMOS - TRAMO-SEATS method for estimating the seasonal adjustment of time series;
For validation and selection of the models, data from January 2016 to December 2021 was used as training and for testing, data from January to May 2022. The results obtained from each of the models for the test data were compared with the “real” values through validation metrics like RMSE, MAPE and MASE. Historical time series, for each one of the relevant sellers, were corrected from isolated missing values with Kalman-Smoothing method or by applying the chain variations of the respective NACE activities.
The procedure is now running monthly for the identification of missing values. For those time series with missing values, it is selected as the best model for nowcasting. The imputed anomalous values are integrated in the database to be made available to users, with the proper identification of the imputation made.
Results
The feedback from the users (in particular the short-term statistics team), on this dataset treatment has been very positive.
The values obtained after the treatment of anomalies have been compared with the values obtained through survey and are much closer than the original values received from tax authority.
Latest status and next steps
We consider our approach to be conservative but robust as it is based on analysis and imputation of large enterprises which, while they may have diversified behavioural patterns, offer some guarantee of stability.
So, because our focus was on large companies, there are still some issues to resolve among all the other vendors.
Due to the high number of companies involved, we think we will be obliged to use different approaches, according to the different characteristics of enterprises. We are also awaiting final versions of the data from the tax authority which will allow a more accurate assessment of the results obtained.
Looking for and testing new methods for nowcasting, for instance, an ensemble method.
Lessons learned & recommendation
The involvement of colleagues either from the methodology department and from the accounts department (responsible for the STS), has been crucial for a sustained and credible advancement of any process related with data quality improvement.
Reference
No public resources available currently.
Use case 4
SFSO – Imputation using missForest
Project overview
In 2018 an external mandate showed unsatisfactory results for the imputation of fortune variables in the Survey on Income and Living Conditions (SILC) using the IVEware software for the fortune module. The main problem was that distributional accuracy could not be achieved. However, the distribution of the variables is of high interest in this context because the results are used in poverty indexes. Slightly better results could be achieved with knn.
These findings encouraged SFSO’s Statistical methods unit to investigate the quality of the missForest algorithm in a simulation framework and extend it to material and social deprivation variables. A further extension of the simulation tests to income variables has also been decided.
Organisational readiness
The organisation was ready for the change with respect to
the infrastructure,
openness and
the needed skills.
Understand business needs (Who needs what)
The overall business need was clear: The current E&I approach was underperforming. Desire for a new approach was high. The goal was always to develop a better E&I strategy.
Therefore, the aim was to quickly gain an insight on the feasibility and the quality of using missForest.
Hence, it was decided to test missForest for the smallest set of variables (fortune) first and extend the tests to material and social deprivation afterwards because of the relatively high amount of missing item non-response and few relevant auxiliary variables at hand for these variables. Only after that, the income variables, which concern by far the biggest number of variables, with the highest item non-response rate, were considered in the testing.
However, these last imputations could have an effect on the first two modules and if the imputation of income variables will be successful it is advised to re-run those for the material and social deprivation variables and those for the fortune variables. Based on our understanding, the filtering questions unfortunately prevent the imputation of all variables at the same time.
The choice of this strategy was also influenced by the available resources.
Approach/method used
The approach of evaluating the performance or the algorithms consisted in a simulation framework where missing values were generated based on the missingness mechanism observed in the survey data.
Knn was used at a preliminary stage. Finally, missForest was used due to better performance than Knn.
Assess Preliminary Feasibility
Based on the fact that the data set is not very large, about 7’300 households and due to filtering, there were only between 2’200 and 5’600 households concerned by the fortune variables with an item non-response rate between 10% and 15% it was not sure that a ML algorithm would be appropriated.
The same problem occurred for the 13900 persons in the net sample for material and social deprivation variables and an item non-response rate of about 18%.
However, the simulations showed encouraging results in both cases.
We had also to take into account a questionnaire redesign for the fortune variables (splitting of variables and added range responses) in the simulation of the fortune variables as those real data were not available at that time.
Furthermore, due to a lot of true zeros for some variables in the fortune module, it was necessary to add an imputation based on a logistic regression to get rid of these zeros. Otherwise, an important part of the imputed values was outside the range values observed and the distributions of those variables were distorted.
Develop proof of concept
The proof of concept consisted in the simulation framework.
Prepare a Comprehensive Business Case
The setup of the simulation tests accounting for a questionnaire redesign (splitting of variables and added range responses) showed to be a realistic and a comprehensive business case.
The random generation of missingness patterns based on the observed ones needed a lot of resources.
Deploy the model
The model deployment consists in integrating the R-code into a SAS production pipeline.
Results
The models have not yet been deployed in production. Validation on the generated missing values sub-sample has been done by observing.
• Mean absolute error (MAE, called total error in the documentation above).
• Main error: same as MAE but limiting the error to a change between material deprivation.
• Confusion matrix.
• Decile boxplots of the error distribution.
• Imputation impact (based on imputing the missing values on the real net sample).
The results show significant improvements over the existing system (which was chosen as the benchmark). The impact on the distribution of the variables of interest and derived indexes showed encouraging results for the fortune module and the material and social deprivation variables.
Latest status and next steps
• The simulation study for the income variables is still going on and has to be finished.
• The validation of the results of the imputation of the income variables by domain experts needs to be done. This step also includes the assessment of the impact on already published results.
• Based on the results of the imputation of the income variables, the fortune variables and the variables on material and social deprivation should be re-imputed.
• Based on the assessment of the impact on the results of the income variables it has to be decided how to handle time series and how to organise the communication with the general public and with stakeholders.
• A formal decision by the general management based on the above-mentioned items might be necessary to implement the missForest imputation algorithm into production.
• It is planned that these imputation tests will be documented in a methodological report.
Lessons learned & recommendation
• Execution: A thorough validation based on a simulation framework is very time consuming. This has to be clear from the beginning.
• Execution: The transition from simulation tests from one variable set to another is not straightforward and is also time consuming.
• IT Infrastructure: no issue so far.
• Organisation knowledge of the proposed method: no issue so far.
Reference
For the simulation tests of the material and social deprivation variables, see https://unece.org/sites/default/files/2022-10/SDE2022_S4_Switzerland_Bianchi_AD.pdf.
Otherwise, there is no public documentation available at the moment.
Use case 5
Statistics Sweden - Imputation of Occupation in the Occupational Register
Project overview
The Swedish statistics on Occupation come from the Occupational Register, which contains information on the occupation of individuals. The occupational information is intermittently collected from businesses, and is therefore subject to missing values, especially for younger and older individuals. Imputation of occupational information can reduce the proportion of missing values.
The current model for imputation of Occupation is becoming obsolete and a new model needs to be developed. In addition, the population for occupational statistics is to be expanded, which may increase the number of missing values. To address this, Statistics Sweden has developed a machine learning model for imputation of Occupation. The model uses register variables on the individual level and the employer level to predict Occupation.
The development of a machine learning model for imputation follows the strategic and operative goals of Statistics Sweden, which emphasises the use of machine learning for automated methods such as imputation.
Organisational readiness
The organisational readiness of Statistics Sweden is varying. The expertise on statistical methodology and data science to develop machine learning models is good. The machine learning IT infrastructure is less developed.
Statistics Sweden has developed a process on development and implementation of ML methods. The process is accessible in the statistical production system of Statistics Sweden. Further development of the process includes additional process steps on assessing business needs, quality requirements, and prerequisites, and on the monitoring of ML models. The process may be used to support the development of machine learning models.
Understand business needs (Who needs what)
The project was initiated by subject matter experts for the Occupational Register. The aim of the project is to replace the outdated imputation model with a new model. Imputation is needed to address the issue with missing values in the Occupational Register. If the imputed values have the same quality as the other observations in the register, the quality of the statistics will increase.
Assess Preliminary Feasibility
The model utilises several register variables to predict Occupation. It is likely that traditional imputation methods would be less successful in realising the potential of the auxiliary information to predict Occupation; hence, it was decided at the initiation of the project to use a machine learning approach. This is also in line with the strategy of Statistics Sweden.
We considered only tree-based methods, i.e., random forest and gradient boosting, because such methods have shown good performance on similar problems previously.
Develop proof of concept
The development of a proof of concept was integrated in the development and was made during the early stages of the development. Because the predictive performance of the model was lower than stakeholders expected, it was decided that we should aim to impute Occupation to facilitate the production of statistics instead of aiming for individual level accuracy.
Approach/method used
The model was trained on data from the 2019 Occupation Register on the gainfully employed population 16-74 years old. Features were extracted from the variables in the register. The random forest model was used because it showed similar predictive performance as the gradient boosting model and needed less resources for training.
Evaluation of the model was done with respect to individual predictive performance, class level predictive performance, and the effects on the statistics. The individual predictive performance was evaluated using accuracy, precision, recall, and F1. The class level predictive performance was evaluated by simulating the missing data mechanism in validation data and replacing simulated missing values with imputed values, which facilitated the joint evaluation of the missing data mechanism and the quality of the imputed values. The effects on the statistics were also evaluated using simulated missing values and by imputing values on previously missing data and considering the effect on the distribution of Occupation.
Prepare a Comprehensive Business Case
The business case was successful because the task was clearly formulated from the outset. However, modifications had to be made with respect to the expected outcome and performance of the model.
Deploy the model
The model is yet to be deployed in production.
Results
The model is yet to be deployed in production.
Latest status and next steps
The project is currently in the deployment phase.
Lessons learned & recommendation
The project has highlighted the need for a process to facilitate the assessment of business needs, quality requirements, and prerequisites. If such a process had been in place, it would have been clear from the outset how to proceed with respect to the measured performance of the model. In addition, the project would have benefitted from further clarification of the expected use of the imputed values.
Reference
Use case 6
Bank for International Settlements - Time Series Outlier Detection using Metadata and Data Machine Learning in Statistical Production
Organisational aspects of implementing ML based data editing in statistical production
Project overview
The BIS Data Bank is a data warehouse hosting more than sixty thousand macroeconomic and financial time series.
Data quality checks currently in place in the BIS Data Bank identify outliers relying on traditional statistical methods (e.g., standard deviation band). These methods are typically based on predefined thresholds which may not be suited for time series with linear breaks, such as financial time series. Furthermore, it does not allow for contextual outlier detection (e.g., using cross-country data for the same indicator which is largely available in the BIS DataBank).
We propose a new method relying on machine learning that performs outlier detection taking into account also related time series. Our method has two main steps. First, time series are clustered based on their metadata and data. Second, contextual outlier detection is performed for each cluster. Our proposal aims to improve the current statistical production pipeline for the BIS Data Bank.
Organisational readiness
As the new method is not deployed in a production pipeline yet, it did not require specific organisational arrangement. However, synergies between IT and business teams are key to facilitate the deployment of innovative solutions, mostly of which are already available at the BIS (e.g., Python workbench, connectors to access the data, Azure DevOps).
Understand business needs (Who needs what)
The BIS Data Bank is undergoing a migration process. A reshuffle of the current in-house FAME-based software is ongoing towards a Python-based solution to perform most of the tasks covered by the Generic Statistical Business Process Model (GSBPM). The goal is to improve the overall efficiency of the existing statistical pipelines (e.g., less manual intervention, better DQM). The new ML-based outlier check could be leveraged in this context
Assess Preliminary Feasibility
The early stages of the project include an in-depth comparison of the new method against the current one, with a focus on accuracy/data quality. Other key aspects are optimisation of manual intervention and domain-specific knowledge (e.g., for the choice of ML algorithms), generalisation of the model (e.g. to micro/unstructured data), code transparency, black-box and lock-in issues. For the full development of the PoC other considerations will be required: performance, ML pipeline setup
Develop proof of concept
After the initial assessment feasibility, we aim at delivering the Proof of Concept on a limited but composite sample of the BIS DataBank and benchmark it against the current checks. This stage will require more rigorous tuning of the algorithm and check its performance.
Approach/method used
To prototype our method, we plan to test the accuracy of the model against multiple data types (indexes/prices, stock/positions, flows/transactions), parameters and pre-processing techniques (e.g., scaling cannot be applied across all data types). We will also tune the frequency of the checks against the update frequency of the underlying data and test the performance.
Prepare a Comprehensive Business Case
At this stage, the main driver of the project is to provide a better solution to increase productivity and reduce manual intervention on DQ checks.
Deploy the model
Not applicable
Results
Not applicable
Latest status and next steps
The method is not in production yet. The next stage is to further enhance the outlier detection algorithm and develop a Proof of Concept.
Lessons learned & recommendation
Not applicable
Reference
Use case 6
Bank for International Settlements - Time Series Outlier Detection using Metadata and Data Machine Learning in Statistical Production
Organisational aspects of implementing ML based data editing in statistical production
Project overview
The BIS Data Bank is a data warehouse hosting more than sixty thousand macroeconomic and financial time series.
Data quality checks currently in place in the BIS Data Bank identify outliers relying on traditional statistical methods (e.g., standard deviation band). These methods are typically based on predefined thresholds which may not be suited for time series with linear breaks, such as financial time series. Furthermore, it does not allow for contextual outlier detection (e.g., using cross-country data for the same indicator which is largely available in the BIS DataBank).
We propose a new method relying on machine learning that performs outlier detection taking into account also related time series. Our method has two main steps. First, time series are clustered based on their metadata and data. Second, contextual outlier detection is performed for each cluster. Our proposal aims to improve the current statistical production pipeline for the BIS Data Bank.
Organisational readiness
As the new method is not deployed in a production pipeline yet, it did not require specific organisational arrangement. However, synergies between IT and business teams are key to facilitate the deployment of innovative solutions, mostly of which are already available at the BIS (e.g., Python workbench, connectors to access the data, Azure DevOps).
Understand business needs (Who needs what)
The BIS Data Bank is undergoing a migration process. A reshuffle of the current in-house FAME-based software is ongoing towards a Python-based solution to perform most of the tasks covered by the Generic Statistical Business Process Model (GSBPM). The goal is to improve the overall efficiency of the existing statistical pipelines (e.g., less manual intervention, better DQM). The new ML-based outlier check could be leveraged in this context
Assess Preliminary Feasibility
The early stages of the project include an in-depth comparison of the new method against the current one, with a focus on accuracy/data quality. Other key aspects are optimisation of manual intervention and domain-specific knowledge (e.g., for the choice of ML algorithms), generalisation of the model (e.g. to micro/unstructured data), code transparency, black-box and lock-in issues. For the full development of the PoC other considerations will be required: performance, ML pipeline setup
Develop proof of concept
After the initial assessment feasibility, we aim at delivering the Proof of Concept on a limited but composite sample of the BIS DataBank and benchmark it against the current checks. This stage will require more rigorous tuning of the algorithm and check its performance.
Approach/method used
To prototype our method, we plan to test the accuracy of the model against multiple data types (indexes/prices, stock/positions, flows/transactions), parameters and pre-processing techniques (e.g., scaling cannot be applied across all data types). We will also tune the frequency of the checks against the update frequency of the underlying data and test the performance.
Prepare a Comprehensive Business Case
At this stage, the main driver of the project is to provide a better solution to increase productivity and reduce manual intervention on DQ checks.
Deploy the model
Not applicable
Results
Not applicable
Latest status and next steps
The method is not in production yet. The next stage is to further enhance the outlier detection algorithm and develop a Proof of Concept.
Lessons learned & recommendation
Not applicable
Reference
Use Case 7
Statistics Spain (INE):
Early Estimates of the Industrial Turnover Index using Statistical Learning Algorithms
Project overview
The final aim of this project is to obtain early estimates of the Industrial Turnover Index (ITI) even before finishing the data collection and data editing processes, thus improving the timeliness but keeping the accuracy of the early estimation under control. Currently, the dissemination of the index is carried out around 51 days after finishing the monthly reference period. However, the response rate is around 75% 21 days after finishing the monthly reference period. So, it was considered to explore new methods to provide more timely information. These new methods amount to performing fine-tuned mass imputation in the microdata set for those sampling units not yet collected. This way, the index is obtained combining the units already collected and edited together with the imputed values. The estimation error is also computed.
Collaboration with the subject matter experts is essential to include highly relevant information into the estimation process and how to deal with some issues that are raised during the project.
The pilot prototype was developed in 12 months, and it is already finished.
Organisational readiness
Statistics Spain is open to ideas regarding modernisation and innovation. The organisation provides the possibility to set up collaborations among different units such as (IT, methodology, and domain experts). There also exist several internal working groups about specific issues such as seasonal adjustment, National Accounts and short-term business statistics, temporal disaggregation, etc. There is also an important amount of specialised knowledge personnel with good expertise in their specific areas.
However, regular production of official statistics according to the National Statistical Plan and the European Statistics Programme constitutes the top priority, thus activities are strongly oriented towards this goal so that it is challenging to modify or introduce novelties in the statistical production processes. This also entails challenges and non-negligible efforts to implement and maintain new statistical products. The main challenges to deploy new proposals can be shortly summarised in (i) the lack of some professional roles or skills regarding Machine Learning techniques at an institutional scale and b) the lack of computational resources and structures appropriate for the execution of new computationally demanding methods at an institutional scale. These challenges are increasingly tackled with measures such as the organisation of internal courses about programming languages for modern data analysis techniques and the deployment of centralised computational facilities with these languages.
Understand business needs (Who needs what)
There is a huge need for improving timeliness in the production of official statistics. Short-term economic statistics are especially relevant to obtain fast economic indicators. Then, having early estimates of the industrial turnover index and similar short-term business statistics is relevant both for internal users such as National Accounts Departments and for external users and stakeholders as well. Furthermore, the need for timely information has become extremely obvious in recent times of uncertainty under a global pandemic.
Assess Preliminary Feasibility
Some assessments were made at the beginning of the project to evaluate the viability of this product in terms of quality, especially timeliness. The idea of performing imputation using machine learning techniques was clear from the first moment due to the versatility and predictive power of these methods. However, some preliminary analyses were made to choose the best model for the specific problem at hand, namely, both the target variable and most of the regressors are continuous. After trying different models (with some preliminary testing), a gradient boosting algorithm was chosen.
In order to develop the proof of concept, the available resources (both IT and human resources) were tight. Sound methodology and good-enough accuracy was primed over fine-tuned models to gain in time and to save in computational demands. There was not a detailed evaluation in advance of all the required resources and their availability to deploy the pilot study in production because the priority was to assess the viability of the underlying ideas and the general approach.
The developers of the prototype worked with PCs implementing the source code in R language. Expertise in ML techniques has been gradually improved thanks to the participation in international projects.
Develop proof of concept
The development of the proof of concept was carried out with real survey data of the Spanish Turnover Index from Oct 17 to Dec 2021. For each successive month, the statistical model was trained with data from the past time series and applied in turn to the reference time period, of course emulating real-life production conditions. Accuracy was assessed compared with real validated data from the survey. Notice that predicted values can always be compared to real validated values after the whole survey compilation and execution is over. The model and estimates are continuously updated when data is made available to domain experts from the data collection and data editing stages. This process was executed in a batch for 60 consecutive months.
At this point, the subject matter expert knowledge was recognised as fundamental for the information representation step. Feature engineering incorporated most of this knowledge. After encoding 287 regressors were built based on 10 variables using both the reference period values and historical values.
Interestingly enough, to compute the estimation error and to cope with the different statistical behaviour of sampling units (business populations are highly skewed), the exchangeability hypothesis was dropped, introducing some changes in the standard computation of prediction errors with these techniques.
The pilot implementation was refactored to allow for an iterative incremental computation and updating of the time series, thus bringing the pilot closer to production. Iterations can run parallel to data collection conditions (daily, weekly…). The increase in complexity is justified because of the versatility and adaptability to real-life production conditions.
Prepare a Comprehensive Business Case
In this project the proof of concept was done with a comprehensive business case. The domain expert team was involved and collaborative all along. They provided all the needed data, knowledge, and subject-matter support. They were involved in the project, and they were aware of the implications concerning the great improvement in quality.
Deploy the model
The project is potentially ready for deployment in production using the development code, which is not optimal and provides room for noticeable refactoring (memory usage, I/O optimisation, etc.). Thus, it is preferable and advisable to revise the implementation to be adapted to a MLOps platform connected to the data collection process. Model optimisation through hyperparameters fine-tuning, regularisation and other model selection techniques is also advised.
Results
The model has not yet been deployed in production. The results of the proof of concept are showed in a Shiny dashboard: https://sandra-ba.shinyapps.io/Advanced_ITI_indices_v1/
The development code is available and shared in Github: https://github.com/david-salgado/AdvITI
Latest status and next steps
Nowadays, the project is in pause waiting for the resources to take the leap to production. There is a full development prototype implemented which could be used to publish the pilot as an experimental statistics. Nonetheless, results so far have triggered complementary methodological considerations regarding response burden reduction, non-response treatment, and imputation beyond the sample (in the population frame).
In collaboration with two Spanish Universities, the next steps to be carried out in the following years will be to revise the machine learning methods as well as the hyperparameters to try to improve the current results. New regressors will be defined and new data sources will be included in the project. Finally new uses of the mass imputation of the microdata set will be analysed.
Lessons learned & recommendation
Methodology of statistical production: The use of statistical learning methods can clearly streamline business functions to improve quality dimensions by reorganising the production process.
IT infrastructure and capacity: The availability of a computational platform and the human resources with the required computing skills are needed both for development and for production.
Organisational knowledge of the proposed method: It is important that the product is spread not only to the external users but also internally. In this case, we have done a working paper to share the details.
Maintenance of the method once in production: We highly recommend that the unit in charge of the support in production, is planned ahead and involved from the first steps of the project.
Acceptance of the method by business areas: Since the collaboration with subject matter experts has been established from the beginning of the project, the acceptance and support are reached, and they are aware of the importance and the need of the new product. Their contribution has been essential in the development to overcome difficulties.
Reference
A working paper with a full description and some results is published in the INE webpage: https://www.ine.es/ss/Satellite?c=INEDocTrabajo_C&p=1254735116586&pagename=ProductosYServicios%2FPYSLayout&cid=1259953795823&L=1
Second Position of the 2022 IAOS Prize for Young Statisticians: