Aerial view of crowd connected by lines, representing the synthetic dataset

CanPath Synthetic Dataset

The CanPath Synthetic Dataset is a versatile resource designed for research, education, and practical applications, offering robust support and guidance for educators to integrate it into their curriculum.

What is synthetic data?

Synthetic data is designed to replicate the statistical properties and structure of real-world data without compromising privacy. Created through advanced computer simulations and algorithms, synthetic data offers a secure and versatile alternative for researchers and data scientists.

What is the CanPath Synthetic Dataset?

The CanPath Synthetic Dataset was manipulated to mimic CanPath’s nationally harmonized data but does not include or reveal actual data of any CanPath participants.

How was it developed?

The synthetic dataset was created using an open-source R software package called “synthpop.” This package was designed to generate synthetic versions of longitudinal survey data. It randomly sampled the CanPath data, replacing and rearranging the participant information. So, the synthetic dataset preserves the statistical patterns (i.e., relationships between variables) but none of the real-world data.

What are the advantages of the CanPath Synthetic Dataset?

What’s available?

Canadian Data

The synthetic dataset is similar to a random sample of CanPath data, which includes participants from the BC Generations Project, Alberta’s Tomorrow Project, the Ontario Health Study, CARTaGENE, and Atlantic PATH.

It includes over 40,000 observations with 403 categorical variables from the CanPath Baseline and Additional Diseases Questionnaires.

Areas of Information

Variables include socio-demographic and economic information, lifestyle and behaviour (e.g. tobacco use, alcohol use, nutrition), perception of health, and select self-reported diseases such as high blood pressure, arthritis, and first cancer.

CANUE Environmental Exposure Variables

It also includes environmental variables originating from the Canadian Urban Environmental Health Research Consortium (CANUE) dataset, such as material deprivation index and annual average exposure to ambient air pollution.

Examples of Use

Canadian university and college instructors can use the CanPath Synthetic Dataset for free for their academic courses. CanPath will provide the Synthetic Dataset and a supporting data dictionary.

Access Process

Completed applications and supporting documents can be submitted by email to apply@canpath.ca. Applications will be reviewed within two weeks.

Eligibility Criteria

  • Applicant must be an instructor at a Canadian university or college;
  • The dataset is being requested for use in an academic course;
  • The course objectives are relevant to CanPath’s purpose, vision and mission;
  • The CanPath dataset aligns with course objectives and methods.

Required Documents

  1. Completed Application Form
  2. Copy of REB application*
    • REB decision letter or proof of exemption  
  3. Brief CV of Applicant (2 pages) 
  4. Course syllabus** 

*An REB application, decision letter, and proof of exemption are only required if another dataset is being used along with the CanPath Synthetic Dataset in the course

**The course syllabus must cite the use of the CanPath Synthetic Dataset

After each iteration of the course, users are required to provide CanPath feedback on the use of the dataset using the Synthetic Dataset Utilization Form.

Next Steps

Following application approval, users must review and sign the Synthetic Dataset User Acknowledgement.

For all other inquiries, please connect with the CanPath Access Office.