HACK-A-THON

Synthetic Data Competition

Sponsored By: 

***Please Note: Registration has CLOSED for the 2018 Hack-A-Thon***

A sizable proportion of today’s data contains sensitive information that renders it difficult to make available to the public. Privacy considerations, especially important amongst government and intelligence agencies, limit the amount of specialists who can process and utilize such data. As a result, vast amounts of sensitive data remains isolated from scientists and machine learning practitioners outside the immediate scope of these agencies.
It would be of value to the data science community to develop methods for effectively synthesizing new artificial data from a previously existing dataset, preserving the underlying distributions and relationships in the original data. The value is even greater when considering that this synthetic data could be used to robustly train a machine learning model, which could then be used to perform analytics on real world data.

This year, DATACON 2018 is hosting a data competition sponsored by LMI that focuses on the generation of synthetic datasets to aid model creation and prototyping.

For this challenge, we want you to:

  1. Create a data synthesizer that generates synthetic data points using only the provided dataset. Generated data should not be a manipulation of the original data (e.g., data scaling or shifting), but rather generated data that is artificial (entirely new data points). A couple of factors you may want to consider when building your synthesizer:
    1. Does my synthesized data preserve dependencies between different features? Are the relationships and covariances between different features of the data approximate to those in the original data?
    2. Do any of my synthesized data points correspond too closely to original data points, thus revealing sensitive information?
    3. Could there be any way for us to distinguish between the ‘real’ data and your ‘synthesized’ data?
  2.  Build a predictive model trained on the data you synthesized. Ideally, the model you build will attain a high accuracy on your synthesized data, as well as our own original test data (which we will then use to judge your model and synthesizer). In practice, a well-built model trained on a representative synthesized dataset will translate into your model performing well on our test data. Feel free to use any machine learning method you see fit to build your model. (A predictive question will be provided once the competition opens up).

Though you are free to use any method, there are a couple of techniques you may want to consider, such as Generative Adversarial Networks, Variational Autoencoders, or Gaussian Copulas.

Your submission will be evaluated on:

  1. How well it upholds data privacy, meaning it is representative of the original data and the characteristics of its structure, and…
  2. How well your model performs against the original data test set.

Details on the competition to include how to sign up, prizes, timeline and rules are included below

Teams will register in one of two categories:

  • Open Category – no restrictions or limitations on who can participate
  • Student Category – only current undergrad, grad or PhD students

You will need to indicate the category you are entering at time of sign-up.

 

Timeline:

  • Start Date: Friday, September 28th 12pm – Participants will be sent information on how to access the data, the expectation for generation of synthetic data and the question to answer using the team generated synthetic data.
    • Teams do not need to be identified by the 28th and can join at any time prior to the deadline
  • Interim conference calls to address questions: Noon on October 1st, 15th and 29th– conference call information will be emailed to team representatives.
  • Submission Date: Wednesday, October 31st, 5pm
    • Submissions will not be accepted after this date
  • Winners announced: Friday, November 2nd
  • Winners presentations: Wednesday, November 7th – During the conference  

Awards

  • Overall Winner: $2,000
  • Best Student Team: $1,000
  • Most Creative Video Submission: $500

Rules:

  • The data provided to the teams will represent 10% of a larger dataset. Teams will be asked to generate synthetic data to grow the dataset to the original size. A question will then be proposed to which teams should use their newly generated datasets to modelling the answer.
  • Winners will be assessed based on how well teams’ models perform in addressing the question on the original dataset and in the quality and creativity of the presentation of their results in the submission video.
  • Participants may not be on more than one team.
  • Teams may not collaborate with other teams.
  • Code files and a video submission of findings are required and must be submitted by October 31st.
    • Teams can use any public resources however code will be validated for originality.
    • Teams should generate a video describing their findings. Video submission should be no longer than 10 minutes providing a summary of the process for generating the synthetic data and the model used for addressing the question.
    • There is no prescribed format for submission of the videos, presentations or code – feel free to be creative and have fun!
  • Student teams can win all three categories however all other teams will be limited to the Overall and Best Video categories.
  • Student teams are to be composed of only current undergrad, grad or PhD students, however, seeking outside advice is encouraged.
  • Teams should be limited to no more than five members.
  • Conference participation is not a requirement to compete or win.