Meet the researcher: Sabine Halabi on validating cancer risk models with CanPath data
University of British Columbia (UBC) researcher Sabine Halabi did not expect to fall in love with data science or cancer epidemiology when she began her master’s program. In fact, she came from a background in psychopharmacology and had never written a line of code. But a passion for women’s health and the chance to work with population-level data led her to an ambitious project: validating five existing endometrial cancer risk models using data from the BC Generations Project, part of CanPath.
Her thesis became the first Canadian study to externally validate these models, revealing unexpected findings about model performance and the importance of strong methodology in cancer prevention research. We sat down with Sabine to talk about her research journey, what she learned from working with CanPath data, and how this work is shaping her next chapter as she begins a PhD focused on nutrition epidemiology and machine learning.
Q: What first drew you to studying endometrial cancer risk, and what motivated you to pursue this as your Master’s research project?
My research journey hasn’t been linear at all. When I started, I had no background in data science, administrative data, or cancer research. My earlier work was in psychopharmacology research, where I gained some quantitative experience and had the opportunity to be a part of 7 publications. At that time, I was doing my BSc at the University of Toronto (U of T), studying human biology, immunology, and mathematics; so my background was interdisciplinary. But a real turning point for my research interests was during a global health course in my senior year at U of T, where I was inspired by my Teaching Assistant and their work. In that class, I wrote a report on the HIV epidemic among female sex workers in Uganda; this introduced me to the world of population and public health, and I was hooked!
I knew I wanted to focus on women’s health in my MSc while gaining more experience with population data. I wasn’t specifically looking to research endometrial cancer, but I was interested in Dr. Aline Talhouk’s work and expertise in statistics and machine learning. The project was outside of my comfort zone, and I had to learn to code from scratch—but Aline believed in me. It was a steep learning curve, but also one of the most rewarding experiences I’ve had.
Q: How did working with CanPath data during your Master’s shape your interest in population health research?
It shaped everything! That experience sparked my love for epidemiology and population health, and it was also my first real introduction to data science. I realized how powerful large-scale, data-driven research can be for understanding real-world health trends and inequities. It’s a constantly evolving field, and I love that challenge.
Population-level data gives you a unique window into people’s lives and experiences. Combining survey, administrative, and national data lets you uncover patterns that don’t just reflect biology, but also other factors you might not have expected to influence health outcomes. That’s what drew me in: the idea that data can tell stories about health at the population level and ultimately inform better policy and prevention strategies.
Working with CanPath not only deepened my interest in population health but also taught me what rigorous data collection looks like and where we still need to improve. It also shaped the way I think about research. I’ve learned that methods matter just as much as the topic. The research question might change, but strong methods and data science skills translate across everything.
I’m now pursuing my PhD in the School of Population and Public Health at UBC with Dr. Rachel Murphy, where I will continue working with CanPath through the HEAL and CHARM initiatives.
Q: For those less familiar, what is a risk prediction model, and why does it matter for endometrial cancer prevention?
Basically, it’s a statistical tool that uses known risk factors to estimate a person’s chance of developing a disease. For endometrial cancer, these models can help us stratify people into low, medium, or high-risk groups so we can better target prevention interventions. This is especially important because there are no routine screening guidelines for endometrial cancer.
These models can be used at a population level at a low cost. In some cases, they can be implemented as simple online calculators that are easily usable by the public. An example of this is Gail’s breast cancer risk assessment tool. You can receive a personalized risk estimate by entering a few pieces of information. Tools like this have the potential to reduce burden on the health-care system by helping clinicians prioritize who might benefit most from preventive care, while empowering individuals to understand and manage their own risk.
Q: You tested five models in a large B.C. population. What stood out to you most in your findings?
One big finding was that the simplest models performed the best. Every time a model is updated, more and more variables are added — things newly reported in the literature or tested by other groups. You’d think newer, more complex models would perform better, but in my work, the basic statistical models outperformed them in terms of discrimination and fit to the population.
This reinforced the idea of parsimony: sometimes fewer, more meaningful variables make a stronger model. I was also surprised that the machine learning model we validated didn’t outperform the statistical one. There’s a temptation to chase the “shiny new thing,” but we can’t ignore the strength of traditional statistical approaches.
Overall, all the models showed moderate performance. Given that they were all developed in the U.S. or Europe, the fact that they performed moderately in B.C. suggests they have potential, but we still need to retrain or adapt them to better reflect the characteristics and risk factors of the Canadian population. Doing so could make them more accurate and ultimately more useful for guiding prevention and screening strategies here at home.
Q: Were there any findings that surprised you or challenged assumptions you went in with?
Yes. One of the first things that stood out was that the number of actual uterine cancer cases was lower than expected based on general prevalence data. It wasn’t much lower, but it did make me think about what may happen over 30 years of follow-up or how people move through the healthcare system, which may affect how we capture population data.
Another interesting point was that smoking, which some literature suggests might be protective in endometrial cancer, did not show a statistical difference between those who did and didn’t have uterine cancer in our data. This could speak to the subgroups’ size or the association itself. Regardless, it is a reminder that patterns we assume to be true don’t always hold up in different populations.
This also really highlighted the importance of collaborating with clinicians. When findings challenge long-held beliefs, involving physicians helps ensure the results are interpreted responsibly. For me, it reinforced the value of staying curious and open to what the data says, even when it goes against expectations.
Q: If you were to improve these models, what additional factors would you include and why?
Actually, to explore whether socioeconomic status is a relevant risk factor to uterine cancer, I computed socio-economic status index score using income, occupation, and other related variables in the CanPath dataset. I didn’t get the chance to integrate the index as a model predictor, but I’d be interested to see whether it would improve model performance and generalizability in the population.
I’m also curious about the potential role of polygenic risk scores, which combine information from multiple genetic variants to estimate inherited risk. I conducted a systematic review — recently accepted by BMC Cancer — that synthesized all uterine cancer models, including those using polygenic risk scores. Overall, they didn’t perform much better than traditional models, but the field is still evolving.
So, is more always better? I don’t think so. To truly improve the models, we need to validate and retrain them on more diverse data. If a model is developed on mostly white populations in the U.S. or Europe and then applied elsewhere, they won’t perform well in systematically excluded communities. Canada’s population includes people from many backgrounds, and our models need to reflect ALL of those populations, not just one segment. Only then can risk prediction be accurate and equitable.
Q: How could your findings help researchers, clinicians, or even patients in the future?
This is the first project to validate endometrial cancer risk models in Canada, which gives us a crucial baseline framework. It shows researchers and clinicians that you can’t just take a model from one setting and apply it elsewhere. You need to test it internally and externally in more diverse populations, and eventually prospectively in clinical settings to ensure it works for the population it’s meant to serve.
It also gives the public something valuable: even if people don’t use the model itself, they learn about known risk factors for uterine cancer and how those factors are used to estimate risk. If a model ever gets implemented clinically, it could reduce burden on the system by helping guide prevention strategies, especially since we don’t have screening guidelines for uterine cancer right now.
Ultimately, this work helps move us toward more precise, personalized, and equitable prevention where risk assessment reflects the realities of all populations, not just a subset.
Q: What does your research mean for the CanPath participants who share their data and make this work possible?
Participants are the focal point of this work. As quantitative researchers, we must remember that they are not just rows in a dataset. They are people who consented to share information about their lives. We have a responsibility to treat that with intention and to share results back, so participants feel part of the research process.
Because this is the first validation study of these models, participants can quite literally say, “I am the framework!” If they know this work is happening and see the impact, it builds trust, reassures them their contribution matters, and encourages future participation.
I also try to contribute to knowledge translation outside of academic papers. I help produce the GOSH Podcast with the Gynecologic Cancer Initiative. It’s a patient-partnered podcast where people can hear gynecologic cancer research explained in plain language and listen to patient voices and researcher conversations. It’s one of the ways I try to give research back to the people it affects.
Q: You’re now starting your PhD at UBC. What will you be working on next?
My PhD is shifting more toward nutrition and machine learning, working with Dr. Rachel Murphy. The project is still developing, but I’m working on improving the classification of ultra-processed foods using machine learning methods to address dietary inequities in a Canadian context. HEAL and CHARM data will inform this work.
I now see myself as a methodologist-in-training. I want to train deeply in machine learning, administrative data, and epidemiologic methods. It’s a lot of work, but if we can improve how we classify ultra-processed foods, it could eventually inform food policy, dietary guidelines, and diet-related chronic disease prevention strategies.
Q: What excites you most about contributing to population health research at this stage?
Population health is the soil. It’s the foundation you need before building any prevention program or public health intervention. It challenges what people think they “know.” Analyzing large-scale data allows us to test whether long-held assumptions hold up in real populations and uncover patterns that might otherwise go unnoticed. That kind of evidence is hard to dismiss.
What excites me most is the opportunity to bring a magnifying glass to systematically excluded populations – the groups most often missed or underrepresented in our research yet experience the greatest health inequities. Data-driven population health research allows us to see those gaps, quantify them, and ultimately design better, fairer interventions.
Even if my work plays a small role, I hope it inspires others to question assumptions and make research more rigorous, inclusive, and equitable.
Q: Do you have any advice for early-career researchers using CanPath data?
Always remember the “other” — the participants behind the data. Treat the work with intention. And if you’re thinking of using CanPath or any large dataset, it can feel daunting, but take the challenge. Immerse yourself as a learner. You are a trainee for a reason. Don’t be afraid to try, ask questions, and build the skills as you go.
Sabine’s Master’s thesis began as a leap into the unknown: a women’s health topic outside her comfort zone, a dataset she had never worked with, and a crash-course in methods she had to teach herself along the way. That leap led to the first Canadian validation of these endometrial cancer risk models and a new conviction that strong, transparent methods are the backbone of population health research.
As she moves into her PhD and continues working with CanPath data on nutrition and machine learning, her approach remains the same: treat participants not as “data points” but as partners, question assumptions even when they’re long-accepted, and build research that others can trust and build from. For Sabine, this is only the beginning, but it already shows how a single training project can shape not only a career, but the way future cancer prevention research is done. We look forward to sharing more updates on Sabine’s research as she progresses in her PhD!
For more information, please contact:
Megan Fleming
Communications & Knowledge Translation Officer
Canadian Partnership for Tomorrow’s Health (CanPath)
info@canpath.ca