Using AI And ML For Synthetic Test Data Generation

Aaron Francesconi AI Synthetic Test Data Feature Image

MidJourney Generated Graphic. Prompt: double exposure of futuristic city and database diagram infographic drawing style

Written by:

Feb 21, 2024

(Part 3 in a 3-part series on Test Data in Government Agencies)

In this third and final installment of the series, I shift focus towards a groundbreaking solution in test data generation for complex and government legacy systems: the application of Artificial Intelligence (AI) and Machine Learning (ML).

My previous discussions, “Why Test Data is So Hard in Government Organizations” and “The Case for End-to-End Synthetic Data Generation in a Government Organization” have underscored the crucial need for high-quality test data that is both realistic and secure, especially in the intricate governmental framework. Traditional strategies like using live data, synthetic data, or data masking each have significant drawbacks, ranging from privacy risks to inadequacies in reflecting real-world complexities. AI and ML emerge as powerful tools, offering the ability to intelligently analyze patterns, predict trends, and simulate detailed scenarios. This capability is pivotal in creating synthetic data that accurately represents operational conditions while maintaining security and adaptability.

This approach promises a significant transformation in the landscape of test data generation, particularly for government legacy systems. I will examine how AI and ML can be integrated into these systems to enhance testing processes and outcomes, ensuring that the data generated is not only representative of real-world scenarios but also aligns with the evolving needs of complex government organizations.

How AI and ML Changed the Test Data Game

Before the emergence of AI and ML, managing data involving millions of entities with the requisite aging, coherence, and structural integrity was a daunting task, limited by human capacity for processing and analysis. AI and ML have not only made this feasible but have also significantly enhanced the efficiency, accuracy, and scalability of data management processes.

What AI/ML Techniques Allows Us to Do

  • Handling Vast Data Sets: Traditional data processing methods are limited in their ability to analyze the sheer volume of data that involves millions of entities. AI and ML, however, thrive on large data sets, with their performance often improving as the amount of data increases.
  • Complex Data Aging: Aging data, or updating it to reflect changes over time, requires understanding and predicting complex patterns. This is particularly challenging in dynamic fields like governmental systems, where citizen data evolves due to factors like demographic changes, aging, and various other tracked datapoints. AI and ML excel at detecting and learning these intricate patterns, enabling them to accurately age data based on realistic scenarios.
  • Ensuring Data Coherence: Making sense of massive data sets and maintaining logical consistency was a significant challenge before AI/ML. AI algorithms can analyze vast amounts of data to find correlations and causal relationships, ensuring the data makes sense and reflects real-world scenarios.
  • Structural Soundness: Maintaining structural integrity in large data sets involves ensuring that all relationships and hierarchies within the data are correctly represented and remain consistent as the data is updated. ML algorithms are particularly adept at identifying anomalies or errors that could compromise structural soundness, something that would be incredibly labor-intensive and error-prone if done manually.
  • Scalability and Adaptability: AI and ML systems can scale up to handle increasing volumes of data, a task that would be prohibitively time-consuming and complex for human analysts. They also adapt to new data types and structures, providing flexibility that traditional data processing methods lack.

A Pivotal Point in Time

Building on the capabilities of AI and Machine Learning (ML) in handling vast and complex data sets, we find ourselves at a pivotal point where the development of sophisticated simulations for data extraction and testing becomes not just feasible, but highly efficient. The transition from merely managing massive data repositories to creating dynamic simulations is a natural progression in this technology-driven landscape. With AI and ML’s proven proficiency in aging data, ensuring coherence, and maintaining structural integrity, these technologies are perfectly poised to underpin simulations that can mimic real-world scenarios with remarkable accuracy. Such simulations, fueled by the power of AI and ML, enable us to generate, manipulate, and extract data in ways that were previously unimaginable. This leap forward opens up new horizons for testing and analysis, allowing for the creation of virtual environments where hypotheses can be tested, patterns can be observed, and predictions can be made, all within a controlled yet realistic setting. This marks a significant advancement in our ability to utilize data for comprehensive testing, research, and development, pushing the boundaries of what’s possible in data-driven fields.

A SimCity-like Simulation for Generating Test Data

The prospect of using a SimCity-like simulation for ETL (Extract, Transform, Load) processes is genuinely exciting, as it brings a dynamic and interactive element to data management and analysis.

In such a simulation, each individual, business, and other entity is not just a static data point but part of a vibrant, evolving virtual city. Imagine a simulated environment where individuals have jobs, spend money, use services, and interact with businesses and other entities in a multitude of ways. Each interaction generates data, mirroring the complexity and richness of real-world data generation. This setup offers an unparalleled opportunity to not only extract and process data but also to observe and analyze how data flows and changes over time in a lifelike ecosystem.

One of the most intriguing aspects of this approach is the ability to age data and evolve the simulation over time. As the simulated time progresses, individuals age, their life circumstances change, businesses grow or fail, and new entities emerge. This dynamic aging process introduces a temporal dimension to the data, allowing for longitudinal studies and the analysis of trends over time. For instance, one could observe how the purchasing behavior of an individual changes as they move from being a student to a working professional and then to retirement. Similarly, the evolution of businesses and their relationships with customers and other businesses can be tracked, providing insights into economic and market dynamics. This kind of simulation provides a rich, multifaceted environment for testing and refining ETL processes, as it generates diverse, changing data that poses various challenges typical of real-world data management scenarios.

Incorporating relational dynamics between entities adds another layer of complexity and realism. People form relationships with businesses (as customers, employees, or suppliers), with public services (as beneficiaries or contributors), and with each other (forming social networks). These relationships can influence the data generated – for example, a person’s purchasing habits might change based on their social connections or employment status. By simulating these intricate relationships and observing how they affect data generation and flow, one can gain deeper insights into the interconnectedness of various data points. This, in turn, can lead to more sophisticated and realistic ETL processes, capable of handling the nuanced data scenarios found in the real world.

HealthcareRUs Insurance Company: A Case Study

The system is tailored for a fictional insurance company, HealthcareRUs, operating in a simulated state, “East Dakota.” This setting provides a controlled environment to test various healthcare scenarios.

Healthcare serves as an ideal use case due to:

  • Universal Relevance: Most people have some interaction with healthcare systems.
  • Complexity: The healthcare industry involves intricate relationships between diverse data types.
  • Privacy Concerns: Handling Personal Health Information requires adherence to strict privacy regulations.

The proposed IT system aims to address this challenge by simulating a population for healthcare data testing, particularly for an insurance company. This system is designed to be robust, adaptable, and realistic, ensuring that it can handle the intricacies of healthcare data management and compliance with privacy regulations.

System Components

  1. Data Store: A central repository for storing all simulated data.
  2. AI-based Engine: Responsible for maintaining data integrity, ensuring realistic relationships between data types, and aging data as per realistic demographics and health trends.
  3. ETL (Extract, Transform, Load) Facility: This component extracts data from the simulation, transforms it into usable formats (such as claim forms or enrollment forms), and loads it into the test system for further analysis.
  4. Legacy Test System: The eventual home for the generated synthetic test data.

Step 1: Establishing the Simulation: Simulating the Population

The population in East Dakota is modelled to reflect realistic demographics and healthcare interactions. Key demographics include:

  • Members: Split into primary subscribers and their dependents, using healthcare services.
  • Employers: Providing insurance to their employees.
  • Providers: Doctors and hospitals submitting claims to the insurance company.

The AI system plays a pivotal role in aging and generating data based on machine learning-derived statistics. This allows for adaptive modeling to reflect changes in population trends, such as increased life expectancy. The system should simulate various scenarios, including, and produce data in a quantity that would be useful for systems testing:

  • Claims Processing: Generating realistic insurance claims.
  • New Sign-ups: Creating data for new insurance subscribers.
  • Life Events Simulation: Including aging off insurance, deaths, births, and marriages.

In simulations where data population is key, AI and ML play a crucial role in using statistical methods to validate the accuracy and relevance of the populated data. The system can utilize AI and ML algorithms to conduct thorough statistical analyses, ensuring the data reflects realistic patterns and behaviors. These algorithms can apply a range of statistical techniques, from basic descriptive statistics to more complex predictive models, to assess the data’s validity. For instance, they can compare the distribution of simulated data against known real-world distributions, identifying any discrepancies or anomalies. AI and ML can also employ correlation analysis to ensure that the relationships between different data points in the simulation accurately mirror those found in real-world scenarios. This statistical validation is essential, particularly in dynamic fields like healthcare or government operations, where the accuracy of the simulation directly impacts the effectiveness of subsequent analyses or decisions. By leveraging AI and ML in this capacity, the system enhances the trustworthiness of the simulated environment, The system can ensuring it serves as a reliable foundation for testing, analysis, and decision-making processes.

Step 2: Extracting Data from the Simulation Engine for Input into the Test System

The proposed IT system’s simulation engine, especially its ETL (Extract, Transform, Load) capabilities, is a cornerstone feature that enables it to handle complex government and healthcare data scenarios efficiently. Here’s an expanded view of how this ETL process works and the opportunities it presents:


  • Dynamic Data Gathering: The extraction process begins with dynamically gathering data from the simulated environment. This includes demographic information, healthcare usage statistics, insurance claim data, and more.
  • Real-time Data Capture: Implementing real-time data capture mechanisms can provide up-to-date information, reflecting the most current state of the simulated population.


  • Data Normalization: Transforming the data involves standardizing it to ensure consistency across different data types and sources. This is crucial in government and healthcare, where data comes in various formats.
  • Data Enrichment: The engine can enrich the data by integrating external data sources, such as epidemiological studies or healthcare trends, to add depth and realism to the simulation.


  • Flexible Data Integration: The loaded data should be compatible with various test environments and formats required by healthcare systems, such as electronic health records (EHR) systems, billing software, or analytical tools.
  • Scalable Architecture: The system must be scalable to accommodate large volumes of data, ensuring performance and reliability are maintained.

ETL Creates Common Input Forms

Simulated claims play a vital role in this process, providing a way to test various scenarios in insurance and healthcare service provision. Simulated claims can be effectively inputted into test systems using existing mechanisms, thereby enhancing the testing process without disrupting current workflows.

  • Data Generation: The first step involves generating simulated claims using the Data store. These claims replicate real-world scenarios, including patient demographics, treatment details, billing codes, and insurance information.
  • Realism and Complexity: The simulated claims must encompass a variety of healthcare scenarios, from routine visits to complex procedures, mirroring the complexity and diversity of actual healthcare claims.

Integration into Test Systems

  • Utilizing Standard Formats: The simulated claims are formatted according to industry standards, such as HL7 or EDI formats, ensuring compatibility with existing test systems and software.
  • API Integration: APIs (Application Programming Interfaces) play a crucial role in the seamless transfer of simulated claims data into the test systems. These APIs can be configured to match the data structure and requirements of the test environments.
  • Automated Data Feeds: Setting up automated feeds that regularly input simulated claims into test systems can mimic the flow of real-time data, providing continuous testing and validation opportunities.
  • Batch Processing: For systems that rely on batch processing, simulated claims can be bundled and inputted in batches, allowing for testing under conditions similar to real operational environments.

Step 3: Running the Test System End-To-End to Generate the Complete Synthetic Dataset

The goal of the system is to process data all the way through so we can create synthetic data throughout the entire system. In this case study, synthetic data is the end result of processing the simulated claims and other types of input by the system.

The processing part of the system for generating end-to-end synthetic data involves delving into the specifics of how the system synthesizes, manages, and utilizes data to create a comprehensive and realistic testing environment. This process is crucial for ensuring that the synthetic data is not only representative of real-world scenarios but also adaptable to various testing needs in healthcare.

As an aside, I would be remiss to not mention that when processing data within legacy systems, the integration of AI and ML, coupled with Robotic Process Automation (RPA), opens up a myriad of opportunities for enhancing efficiency and automation. AI and ML algorithms are adept at identifying patterns and inefficiencies in the data flow, which often go unnoticed. These insights are invaluable in pinpointing areas where manual processes can be streamlined or replaced with automation. RPA, in particular, stands out as a powerful tool for automating routine, rule-based tasks that are prevalent in legacy systems. By implementing RPA in conjunction with AI and ML, tasks such as data entry, report generation, and routine queries, which traditionally required human intervention, can now be automated, leading to increased efficiency and accuracy. Furthermore, the adaptive nature of ML models means that they continuously refine these automation processes, uncovering new areas for RPA application as they evolve. This combination of AI, ML, and RPA not only transforms the current functionality of legacy systems but also sets a foundation for ongoing improvement and innovation in data management, turning these dated structures into more agile and efficient operations.

Step 4: Using the Synthetic Data to Test

Synthetic Data for Extensive Testing

  • Broad Spectrum Testing: The processed synthetic data is tailored for a wide range of tests – from claims processing to health policy simulations and healthcare IT systems testing. Its versatility stems from its comprehensive nature, mirroring real-world healthcare scenarios without involving actual patient data.
  • High-Fidelity Simulations: These data sets are meticulously crafted to reflect real-life scenarios, including patient behavior, disease progression, and healthcare system interactions. This high fidelity enables reliable testing and simulation of healthcare dynamics.
  • Innovation and Development: For developers and researchers, synthetic data provides a rich, nuanced, and risk-free environment for exploring system changes, technologies, and claim processing models. It accelerates innovation while ensuring that the development phase is grounded in realistic scenarios.
  • Software and System Testing: Healthcare software, such as EHR systems, benefits greatly from synthetic data. It allows developers to rigorously test software performance, user interface, and data processing capabilities in a controlled yet realistic environment.

Preserving Privacy and Security

  • Non-Existent Data Subjects: The key advantage of synthetic data is that it is entirely fabricated. There are no real individuals behind the data points, ensuring that personal privacy is inherently protected.
  • Regulatory Compliance: By using synthetic data, healthcare entities can comply with stringent data protection laws like HIPAA in the United States or GDPR in the European Union. This data allows for extensive testing without the risk of breaching privacy regulations.
  • Reduced Risk of Data Breaches: Since the data does not correspond to real individuals, the risk associated with data breaches is significantly mitigated. Even in the event of a security lapse, the synthetic nature of the data ensures that no real personal information is compromised.
  • Ethical Research and Development: Using synthetic data aligns with ethical standards in research and development. It eliminates concerns around consent and confidentiality that are inherent in using real patient data.

The processed synthetic data stands as a cornerstone in the intersection of testing and AI/ML innovation, testing, and privacy. It offers a unique solution where exhaustive testing across various facets of a legacy system can be conducted without the ethical and privacy concerns associated with using real patient data. This advancement not only propels technology forward but does so in a manner that upholds the highest standards of data privacy and security.

Concluding Thoughts

The integration of AI and ML in test data management illustrates a broader impact across various domains, including governmental legacy systems. The case of HealthcareRUs Insurance Company serves as a prime example, showcasing how advanced simulations akin to a SimCity-like environment can revolutionize data handling. This approach extends beyond healthcare, offering transformative potential for outdated governmental systems burdened by legacy processes. By adopting similar AI and ML-driven simulations, these systems can generate, extract, and process vast and complex datasets, achieving a level of realism and applicability previously unattainable.

Furthermore, the ETL processes in this simulation model redefine the management of data from static storage to dynamic interaction. The end-to-end testing to generate complete synthetic datasets mirrors real-world scenarios without compromising individual privacy, an aspect critically important in many government sectors dealing with sensitive citizen data.

This advancement in AI and ML technology transcends many industries, suggesting a path forward for modernizing governmental legacy systems. It assures that rigorous testing can proceed without privacy or ethical concerns, as the data involves no real individuals. Consequently, this revolution in data management and testing holds the promise of not only enhancing legacy systems but also revitalizing government operations, ensuring efficiency, privacy, and security. This represents a significant stride in many sectors, pushing the boundaries of what’s possible in managing and utilizing data for progress and innovation.


(This article first appeared on LinkedIn.  Re-published with permission from Aaron Francesconi.)


You May Also Like…


Skip to content