Low Back Pain Clinical Trial
Official title:
Development and External Validation of an International, Multicenter Machine Learning Algorithm for Prediction of Outcome After Lumbar Spinal Fusion for Degenerative Disease: The FUSE-ML Study
The aim of the FUSE-ML study is to develop and externally validate a robust ML-based prediction tool based on multicenter data from a range of international centers that will provide individualized risk-benefit profiles tailored to each patient undergoing lumbar spinal fusion for degenerative disease. Data will be collected by a range of international centers.
Introduction Low back pain is one of the top-three causes of disability in Western societies and imposes significant direct and indirect socio-economic costs. The etiology of low back pain with or without radiating leg pain is multifactorial, but it is often related to degenerative disc disease (DDD) or to spondylolisthesis. The standard treatment for symptomatic spondylolisthesis or progressive DDD in patients who are unresponsive to long-term conservative treatment is interbody fusion, but this is controversial. With some reports showing no benefit compared to conservative treatment, patient selection is vitally important. Various prognostic tests attempt to identify subsets of patients that might benefit most from surgery, but the validity of these tests is unclear. Ultimately, success in this category of patients should be defined by improved physical symptoms (patient-reported outcome measures [PROMs]) rather than technical success of the procedure. A relevant proportion of patients with intractable, conservative therapy-resistant lumbar degenerative disease do finally profit from lumbar fusion surgery - the difficult question is how to identify them securely and avoid unnecessary, unsuccessful surgery. In the literature, several subsets of patients with lumbar degenerative disease who may profit more than others from lumbar spinal fusion have been identified. Accurate preoperative identification of patients at high risk for unsatisfactory outcome and vice-versa would be clinically advantageous, as it would allow enhanced resource preparation, better surgical decision-making, enhanced patient education and informed consent, and potentially even modification of certain risk factors for unsatisfactory outcome. However, it is often impossible for clinicians to balance the many described single risk factors for each adverse event to arrive at a personalized risk-benefit profile in individual patients. Machine learning (ML) methods have been extraordinarily effective at integrating many clinical patient variables into one holistic risk prediction tailored to each patient. One multicenter model based on classic statistics has already been described by Khor et al. - However, upon external validation, it proved to be unreliable and rather poorly calibrated. Also, this model was based on a relatively small number of patients for ML. The aim of the FUSE-ML study is to develop and externally validate a robust ML-based prediction tool based on multicenter data from a range of international centers that will provide individualized risk-benefit profiles tailored to each patient undergoing lumbar spinal fusion for degenerative disease. Methods Overview Data will be collected by a range of international centers. Overall, the models will be built and publication will be compiled according to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines. One model will be created for each of the relevant outcomes detailed below. University of Zurich (V.E. Staartjes, C. Serra) is the sponsor of this study. Ethical Considerations Each center will be responsible for their own ethics board / institutional review board (IRB) approval and for establishing a data transfer agreement (DTA). The sponsor (University of Zurich) will present a standard DTA upon request. They must gain approval for retrospective or prospective data collection and sharing of the completely deidentified data with the sponsor. The sponsor can aid by providing this detailed study protocol. All study procedures will be carried out according to the Declaration of Helsinki and its amendments. Inclusion and Exclusion Criteria Patients with the following indications for thoracolumbar pedicle screw placement are considered for inclusion: Degenerative pathologies (one or multiple of the following: spinal stenosis, spondylolisthesis, degenerative disc disease, recurrent disc herniation, failed back surgery syndrome (FBSS), radiculopathy, pseudarthrosis). Patients undergoing surgery for - as the primary indication - infections, vertebral tumors, as well as traumatic and osteoporotic fractures or deformity surgery for scoliosis or kyphosis are not eligible. Patients with moderate or severe scoliosis (Coronal Cobb's >30 degrees / Schwab classification sagittal modifier + or ++) are not eligible. Patients undergoing surgery at more than 6 vertebral levels are also not eligible. Patients with missing endpoint data at 12 months will be excluded. Patients are required to give informed consent. Only patients aged 18 or older are considered for inclusion. Data Collection Each center will collect their data either retrospectively, from a prospective registry, or from a prospective registry supplemented by retrospectively collected variables. Each center has to contribute a minimum of 100 patients with complete 12-month follow-up data to be included in the study. A standardized Excel database will be provided by the sponsor for anonymous data entry. The data will be entered in standardized and deidentified form. This Excel database will only contain a study-specific patient number. Each center will keep an internal spreadsheet in which the study-specific patient numbers can be traced back to center-specific patient-numbers, should this be necessary. The deadline for submission of the complete data to the sponsor institution is 13th of August 2021. Authorship Centers will have to contribute at least 100 cases with complete outcome data in total to be included in the study. Each participating center will be able to designate a maximum of four authors to be included in the author list. Any other center-specific contributors will be listed as full members of the FUSE-ML study group and will be granted full PubMed / Medline contributor status. The sponsor institution will have six primary author positions available. Primary Endpoint Definitions Several endpoints will be assessed: - 1. Oswestry Disability Index (ODI) at 12 months. - 2. Visual Analogue Scale (VAS-BP, 0 to 100) for back pain at 12 months. This can also be a converted numeric rating scale (NRS) from 0-10, or a VAS from 0 to 10 converted to 0 to 100. - 3. Visual Analogue Scale (VAS-LP, 0 to 100) for leg pain at 12 months. This can also be a converted numeric rating scale (NRS) from 0-10, or a VAS from 0 to 10 converted to 0 to 100. These outcomes will be dichotomized using the minimum clinically important difference (MCID) according to Ostelo et al. Thus, a 30% or greater improvement in a specific score compared to baseline will be considered as achievement of MCID (clinical success) in that specific score. If patients presented with zero symptoms initially (in either ODI, NRS-BP or NRS-LP), and remained at zero for that score, this will also be defined as MCID for that score. Features and Their Definitions All features are measured or estimated preoperatively. In addition to the endpoints, the following input features will be collected: - Age (years) - Gender (m/f) - Presence of the following indications for surgery (choose all that apply): - Spondylolisthesis - (Recurrent) disc herniation - Radiculopathy - Chronic low back pain (CLBP) / Degenerative disc disease (DDD) - Failed back surgery syndrome (FBSS) - Lumbar spinal stenosis - Pseudarthrosis - Index Level(s) (choose all that apply, T12 - S1) - Height (cm) - Weight (kg) - BMI (kg/m2) - Smoking status (active / ceased / never) - Preoperative (baseline) ODI - Preoperative (baseline) VAS-BP - Preoperative (baseline) VAS-LP - American Society of Anesthesiologists (ASA) Score (1-2 / 3 or higher) - Preoperative use of opioid pain medication (yes / no) - Asthma pulmonale as a comorbidity (yes / no) - Prior thoracolumbar spine surgery (yes / no) - Race/Ethnicity (Caucasian / Black / Asian / Other) - Surgical approach (choose all that apply: TLIF / PLIF / ALIF / Lateral) - Pedicle screw insertion (yes / no) - Minimally invasive technique (yes / no) Sample Size While even the largest cohort with millions of patients is not guaranteed to result in a robust clinical prediction model if no relevant input variables are included ("garbage in, garbage out" - do not expect to predict the future from age, gender, and body mass index), the relationship among predictive performance and sample size is certainly directly proportional, especially for some data-hungry ML algorithms. To ensure generalizability of the clinical prediction model, the sample size should be both representative enough of the patient population, and should take the complexity of the algorithm into account. For instance, a deep neural network - as an example of a highly complex model - will often require thousands of patients to converge, while a logistic regression model may achieve stable results with only a few hundreds of patients. In addition, the number of input variables plays a role. Roughly, it can be said that a bare minimum of 10 positive cases are required per included input variable to model the relationships. Often, erratic behavior of the models and high variance in performance among splits is observed when sample sizes are smaller than calculated with this rule of thumb. Of central importance is also the proportion of patients who experience the outcome. For very rare events, a much larger total sample size is consequentially needed. For instance, a prediction based on 10 input features for an outcome occurring in only 10% of cases would require at least 1000 patients including at least 100 who experienced the outcome, according to the above rule of thumb. In general and from personal experience, the authors do not recommend developing ML models on cohorts with less than 100 positive cases and reasonably more cases in total, regardless of the rarity of the outcome. Also, one might consider the available literature on risk factors for the outcome of interest: If epidemiological studies find only weak associations with the outcome, it is likely that one will require more patients to arrive at a model with good predictive performance, as opposed to an outcome which has several highly associated risk factors, which may be easier to predict. Larger sample sizes also allow for more generous evaluation through a larger amount of patient data dedicated to training or validation, and usually results in better calibration measures. Between 20% and 40% of patients report no clinically relevant improvement after spinal fusion (minority class). For sample size calculation, the authors take 20% for a conservative estimate. Consequently, for this study, based on the authors' expertise and on the rules of thumb mentioned above, the authors estimate that a minimum of 200 patients with a negative outcome (minority class) are required to extract generalizable feature relationships. With an estimated incidence of approximately 20% as explained above, that means that a minimum of around 1000 patients are required for training. For adequate evaluation of calibration at external validation, the authors estimate that another 300 patients will be required (thus, approximately 60 patients with a positive outcome). Thus, in total, the authors estimate that a minimum of 1300 patients are necessary to arrive at a robust model. More data will likely lead to greater performance and better calibration. Predictive Modeling A KNN imputer will be co-trained to impute any missing data that may occur in future application of the model. If there is missing data in the training set, it will be imputed using said KNN imputer. Features or patients with a missingness greater than 25% will be excluded. Data will be standardized and one-hot-encoded. In case of major class imbalance - which is expected for the abovementioned endpoint - random upsampling or synthetic minority oversampling (SMOTE) will be applied to the training set. All features will initially be provided to the model for training. If necessary, the authors will apply recursive feature elimination (RFE) to select input features on the training data. The authors will trial the following algorithms for binary classification: Generalized linear model (GLM), generalized additive model (GAM), stochastic gradient boosting machine (GBM), naïve Bayes classifier, simple artificial neural network, support vector machine (SVM), and random forest. Each model will be fully trained and hyperparameter tuned where applicable. The final model will be selected based upon AUC, sensitivity, and specificity, as well as calibration metrics on the resampled training performance. Training will occur in repeated 5-fold cross-validation with 10 repeats. The one final model will then be assessed on the external validation data only once. 95% confidence intervals for external validation metrics will be derived using the bootstrap. The threshold for binary classification will either be identified on the training data alone using the AUC-based "closest-to-(0,1)-criterion" or Youden's index to optimize both sensitivity and specificity, or will be optimized on the training set based on clinical significance (rule-out model). All analyses will be carried out in R Version 4.0.2 or more recent. Evaluation The performance of classification models can roughly be judged along two dimensions: Model discrimination and calibration. The term discrimination denotes the ability of a prediction model to correctly classify whether a certain patient is going to or is not going to experience a certain outcome. Thus, discrimination described the accuracy of a binary prediction - yes or no. Calibration, however, describes the degree to which a model's predicted probabilities (ranging from 0% to 100%) correspond to the actually observed incidence of the binary endpoint (true posterior). Many publications do not report calibration metrics, although these are of central importance, as a well-calibrated predicted probability (e.g. your predicted probability of experiencing a complication is 18%) is often much more valuable to clinicians - and patients! - than a binary prediction (e.g. you are likely not going to experience a complication). Resampled training performance as well as performance on the external validation set will be assessed for discrimination and calibration. In terms of discrimination, the authors will evaluate AUC, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 Score. In terms of calibration, the authors will assess the Brier score, expected-observed (E/O)-ratio, calibration slope and intercept, the Hosmer-Lemeshow goodness-of-fit test, as well as visual inspection of calibration plots for both datasets, which will also be included in the publication. Interpretability The degree and choice of methods for interpretability will depend on the finally chosen algorithm. Some algorithms can natively provide explanations as to which factors influence the outcome in what way. Thus, in case e.g. a GLM, GAM, or naïve Bayes classifier is chosen, the parameters / partial dependence values will be provided. For simple decision trees, diagrams of the decision-making process can be provided. Other models with higher degrees of complexity, such as neural networks or stochastic gradient boosting machines cannot natively provide such explanations. In that case, the authors will provide both AUC-based variable importance as well as model-agnostic local interpretations of variable importance using the LIME principle. Expected Results The authors expect to arrive at a generalizable model based on multicenter international data that is likely to predict consistently with an AUC of at least 0.70 and that is well-calibrated. A web-based prediction tool will also be created for each of the two models using the shiny environment, much akin to e.g. https://neurosurgery.shinyapps.io/impairment (Also see for example: Staartjes et al., Journal of Neurosurgery, 2020). This web-based app will be available for free on any internet-capable device (mobile or desktop), and should be stable on most devices due to the server-based computing. The costs for maintaining the server will be carried by the sponsor. The collected data will be stored by the sponsor for 10 years. The large dataset will be open to further analysis and will be provided to any of the contributing centers at reasonable request and after approval by all other centers. The goal is to enable other analyses using the collected dataset. If any additional analyses lead to publication, all contributors will be included as co-authors and all co-authors will have the opportunity to review said manuscript beforehand. Any contributing study center has the right to veto publication of any subsequent analyses containing their own data. ;
Status | Clinical Trial | Phase | |
---|---|---|---|
Completed |
NCT03916705 -
Thoraco-Lumbar Fascia Mobility
|
N/A | |
Completed |
NCT04007302 -
Modification of the Activity of the Prefrontal Cortex by Virtual Distraction in the Lumbago
|
N/A | |
Completed |
NCT03273114 -
Cognitive Functional Therapy (CFT) Compared With Core Training Exercise and Manual Therapy (CORE-MT) in Patients With Chronic Low Back Pain
|
N/A | |
Recruiting |
NCT03600207 -
The Effect of Diaphragm Muscle Training on Chronic Low Back Pain
|
N/A | |
Completed |
NCT04284982 -
Periodized Resistance Training for Persistent Non-specific Low Back Pain
|
N/A | |
Recruiting |
NCT05600543 -
Evaluation of the Effect of Lumbar Belt on Spinal Mobility in Subjects With and Without Low Back Pain
|
N/A | |
Withdrawn |
NCT05410366 -
Safe Harbors in Emergency Medicine, Specific Aim 3
|
||
Completed |
NCT03673436 -
Effect of Lumbar Spinal Fusion Predicted by Physiotherapists
|
||
Completed |
NCT02546466 -
Effects of Functional Taping on Static Postural Control in Patients With Non-specific Chronic Low Back Pain
|
N/A | |
Completed |
NCT00983385 -
Evaluation of Effectiveness and Tolerability of Tapentadol Hydrochloride in Subjects With Severe Chronic Low Back Pain Taking Either WHO Step I or Step II Analgesics or no Regular Analgesics
|
Phase 3 | |
Recruiting |
NCT05156242 -
Corticospinal and Motor Behavior Responses After Physical Therapy Intervention in Patients With Chronic Low Back Pain.
|
N/A | |
Recruiting |
NCT04673773 -
MY RELIEF- Evidence Based Information to Support People Aged 55+ Years Living and Working With Persistent Low-back Pain.
|
N/A | |
Completed |
NCT06049251 -
ELDOA Technique Versus Lumbar SNAGS With Motor Control Exercises
|
N/A | |
Completed |
NCT06049277 -
Mulligan Technique Versus McKenzie Extension Exercise Chronic Unilateral Radicular Low Back Pain
|
N/A | |
Completed |
NCT04980469 -
A Study to Explore the Effect of Vitex Negundo and Zingiber Officinale on Non-specific Chronic Low Back Pain Due to Sedentary Lifestyle
|
N/A | |
Completed |
NCT04055545 -
High Intensity Interval Training VS Moderate Intensity Continuous Training in Chronic Low Back Pain Subjects
|
N/A | |
Recruiting |
NCT05552248 -
Assessment of the Safety and Performance of a Lumbar Belt
|
||
Recruiting |
NCT05944354 -
Wearable Spine Health System for Military Readiness
|
||
Completed |
NCT05801588 -
Participating in T'ai Chi to Reduce Back Pain and Improve Quality of Life
|
N/A | |
Completed |
NCT05811143 -
Examining the Effects of Dorsal Column Stimulation on Pain From Lumbar Spinal Stenosis Related to Epidural Lipomatosis.
|