Chapter 1. Introduction
Data cleaning and editing are essential components of ensuring the quality of official statistics. However, finding and correcting errors in datasets can be a lengthy and time-consuming process. The increasing size of modern datasets makes manual interventions increasingly infeasible, and new types of data may not be well served by traditional methods. Machine learning methods have strong potential to provide solutions to these challenges.
Editing and imputation were identified as some of the most obvious use cases for machine learning by the High-Level Group for the Modernisation of Official Statistics (HLG-MOS). Its first project on machine learning was conducted in 2019 and 2020. The report from this project concluded that editing and imputation were valid use cases for machine learning in the production of official statistics1. However, since then, agencies have been slow to adopt machine learning methods for editing. Rather than discussing or proposing technical approaches to editing using machine learning, the Applying Data Science and Modern Methods (ADSaMM) Data Editing task team decided that it would be more valuable to examine some of the blockers preventing the adoption of these methods and suggest some guidelines for overcoming them. We decided to pursue this by gathering use cases from official statistics agencies to understand what the biggest difficulties were and how they had been overcome in each agency.
The process followed by the task team was as follows:
- We developed a template for gathering use cases. This was initially framed around the steps in the journey from experimentation to development for machine learning methods described in Chapter 5 of Machine Learning for Official Statistics2. Some adjustments were made to this initial version after gathering a couple of examples and determining which elements the task team found most useful. The team also eventually decided to incorporate a short technical description of the methods used in each use case, as this was deemed of considerable interest. The template is included in Appendix 2 for reference.
- We identified potential use cases and reached out to the agencies involved to fill in the template. Early members of the team provided a small number of use cases that were used as examples to assist subsequent use case development. The most fruitful source of intelligence for identifying potential use cases was the agenda of the UNECE Machine Learning for Official Statistics 2023 Workshop3. In most cases, agencies that provided a use case also gave a short presentation to the team about their use case and provided a staff member to join the team. In the end, there were seven use cases overall.
- We assessed the use cases to extract key themes and identify areas where there were blockers to the implementation of machine learning for editing. We utilised the use cases to craft brief descriptions of these key issues and provide guidelines for overcoming them. Then, we edited these guidelines along with a select set of use cases (those which agencies agreed could be included) to create a coherent document.
This document is the outcome of that work. Sections 2.1.-2.6. contain reflections on each of the key issues we identified. Appendix 1 contains information on the implementation of ML Operations (MLOps). Appendix 2, as noted above, contains the template used for constructing the use cases, while Appendix 3 contains the complete set of use cases gathered as part of the task team work. We hope you find this information useful.
The chair would like to thank all members of the Data Editing task team for their contributions to this work, and to extend special thanks to all the agencies that supplied use cases.