Synthetic Data Competition
***Please Note: Registration has CLOSED for the 2018 Hack-A-Thon***
A sizable proportion of today’s data contains sensitive information that renders it difficult to make available to the public. Privacy considerations, especially important amongst government and intelligence agencies, limit the amount of specialists who can process and utilize such data. As a result, vast amounts of sensitive data remains isolated from scientists and machine learning practitioners outside the immediate scope of these agencies.
It would be of value to the data science community to develop methods for effectively synthesizing new artificial data from a previously existing dataset, preserving the underlying distributions and relationships in the original data. The value is even greater when considering that this synthetic data could be used to robustly train a machine learning model, which could then be used to perform analytics on real world data.
This year, DATACON 2018 is hosting a data competition sponsored by LMI that focuses on the generation of synthetic datasets to aid model creation and prototyping.
For this challenge, we want you to:
- Create a data synthesizer that generates synthetic data points using only the provided dataset. Generated data should not be a manipulation of the original data (e.g., data scaling or shifting), but rather generated data that is artificial (entirely new data points). A couple of factors you may want to consider when building your synthesizer:
- Does my synthesized data preserve dependencies between different features? Are the relationships and covariances between different features of the data approximate to those in the original data?
- Do any of my synthesized data points correspond too closely to original data points, thus revealing sensitive information?
- Could there be any way for us to distinguish between the ‘real’ data and your ‘synthesized’ data?
- Build a predictive model trained on the data you synthesized. Ideally, the model you build will attain a high accuracy on your synthesized data, as well as our own original test data (which we will then use to judge your model and synthesizer). In practice, a well-built model trained on a representative synthesized dataset will translate into your model performing well on our test data. Feel free to use any machine learning method you see fit to build your model. (A predictive question will be provided once the competition opens up).
Though you are free to use any method, there are a couple of techniques you may want to consider, such as Generative Adversarial Networks, Variational Autoencoders, or Gaussian Copulas.
Your submission will be evaluated on:
- How well it upholds data privacy, meaning it is representative of the original data and the characteristics of its structure, and…
- How well your model performs against the original data test set.
Details on the competition to include how to sign up, prizes, timeline and rules are included below
Teams will register in one of two categories:
- Open Category – no restrictions or limitations on who can participate
- Student Category – only current undergrad, grad or PhD students
You will need to indicate the category you are entering at time of sign-up.