top of page
  • AND

Utilizing machine learning to discover new predictors of Breast Cancer in Post-Menopausal Women

Breast cancer is a prevalent form of cancer affecting women worldwide, with numerous factors contributing to its development, including genetic inheritance, reproductive factors, and lifestyle choices.

In the past, research has highlighted the differing causes of breast cancer between pre- and post-menopausal women. Recently, scientists have employed a combination of approaches to accurately predict breast cancer in women. Background Machine learning (ML) techniques are capable of analyzing vast datasets of predictors and deciphering complex non-linear relationships. While previous studies have utilized ML for breast cancer risk prediction, their focus was not on identifying the specific predictors.

The United Kingdom Biobank (UKB), which consists of a comprehensive and detailed cohort, provides an opportunity to employ hypothesis-free approaches in order to discover novel predictors for breast cancer. The development of polygenic risk scores (PRS) has enabled the projection of the impact of hundreds and thousands of genetic variations associated with specific diseases or traits, leveraging data from genome-wide association studies (GWAS).

PRS can be used to identify individuals at high risk for diseases, facilitating targeted interventions such as early statin prescription for coronary artery disease. Notably, PRS has enhanced the accuracy of existing risk predictors for coronary artery disease, such as the Framingham risk score.

Previously, breast cancer PRS has been incorporated into risk prediction models like the Tyrer-Cuzick model and the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA). While the interaction between PRS and phenotypic features, such as gene-environment interactions, has been examined for breast cancer, conflicting findings have been reported.

Study Overview

A recent study published in Scientific Reports utilized machine learning techniques for feature selection, followed by Cox models for risk prediction. The primary objective of the study was to demonstrate the effective application of ML techniques for feature selection to aid classical statistical methods.

SHapley Additive exPlanation (SHAP) feature dependence plots were employed to explore potential interactions between phenotypic features and PRS. The study utilized data from the UKB, which encompasses over half a million participants from England, Wales, and Scotland. Baseline data was collected through verbal interviews conducted by trained nurses, questionnaires, biological samples, and physical examinations.

Post-menopausal women aged 40 to 69 at baseline were specifically recruited due to the aforementioned etiological heterogeneity associated with menopausal status. The incidence of breast cancer was identified using International Classification of Diseases codes, and two PRS variations, PRS313 and PRS120k, were considered as potential genetic features.

Key Findings

The study included 104,313 participants, of whom 4,010 developed breast cancer during the 11.9-year follow-up period. By combining ML with traditional statistical approaches in cancer epidemiology, several known and previously unidentified risk factors for post-menopausal breast cancer incidence were identified. Known risk factors included age at menopause, testosterone levels, and age itself. Additionally, five novel predictors were discovered, including blood biochemistry, blood counts, and urine biomarkers.

The newly identified predictors showed a strong association with post-menopausal breast cancer incidence, and further research is needed to determine if they are potentially modifiable risk factors.

Interestingly, the XGBoost model selected a comprehensive body composition measure over body mass index (BMI), suggesting that precise body composition is a crucial predictor of breast cancer. Basal metabolic rate also emerged as a significant predictor, contradicting a previous study that found no association between basal metabolic rate and breast cancer.

Furthermore, plasma urea, a blood biomarker associated with kidney function, demonstrated an association with breast cancer. This study also reported, for the first time, an association between breast cancer and plasma phosphate, sodium, or creatinine levels in urine.

The two polygenic risk scores ranked as the strongest risk factors according to agnostic ML models. Cox regressions confirmed that PRS is a significant predictor for post-menopausal breast cancer.


This study revealed five statistically significant novel correlations with post-menopausal breast cancer, including urine biomarkers, blood counts, and blood biochemistry. Integrating these five novel features into the baseline Cox model maintained the discrimination performance. Moreover, the SHAP value indicated that the two pre-specified PRS variations were the most important features.

These findings highlight the need for further research on the utilization of more precise anthropometry measures to enhance breast cancer prediction. External validation of the results is the next crucial step before implementation in clinical practice.

Source : Nature


bottom of page