Synthetic data generation

Generative modelling of medical data for anonymisation purposes

Photo from Photo from SIIM-ISIC Melanoma Classification Kaggle Challenge

Synthetic data generation with GANs

The usage of healthcare data in the development of artificial intelligence (AI) models is associated with issues around personal integrity and regulations. Patient data can usually not be freely shared and thus, the utility of it in creating AI solutions is limited.

What and why?

In this project, the aim is to explore generative modelling techniques (GANs) for generating synthetic data and inspect the impact synthetic data has on modelling performance. Additionally, comparisons of performance between machine learning models developed from real and synthetic data will be performed as well as assessing and comparing data leakage.


Main tasks:

  • test GANs to generate artificial data (images and text),
  • use synthetic data (conditional and unconditional GANs) for balancing classes and examine biases,
  • use augmentation for balancing classes,
  • test different ratios real/fake (using provided models with help of master students),
  • explain classification results using XAI methods,
  • examine controllability in Latent Space (master students),
  • combine text and image data in multimodal classification task.

Technologies used: Python, Pytorch

Methods used: Deep Neural Networks, Skin Diseases Detection and Recognition, Explainable artificial intelligence, Multimodal learning


Github code:

Medium posts: