Clinical Trial Details
— Status: Completed
Administrative data
NCT number |
NCT05161130 |
Other study ID # |
FUSE-ML |
Secondary ID |
|
Status |
Completed |
Phase |
|
First received |
|
Last updated |
|
Start date |
January 1, 2021 |
Est. completion date |
November 1, 2021 |
Study information
Verified date |
December 2021 |
Source |
Bergman Clinics |
Contact |
n/a |
Is FDA regulated |
No |
Health authority |
|
Study type |
Observational
|
Clinical Trial Summary
The aim of the FUSE-ML study is to develop and externally validate a robust ML-based
prediction tool based on multicenter data from a range of international centers that will
provide individualized risk-benefit profiles tailored to each patient undergoing lumbar
spinal fusion for degenerative disease. Data will be collected by a range of international
centers.
Description:
Introduction Low back pain is one of the top-three causes of disability in Western societies
and imposes significant direct and indirect socio-economic costs. The etiology of low back
pain with or without radiating leg pain is multifactorial, but it is often related to
degenerative disc disease (DDD) or to spondylolisthesis. The standard treatment for
symptomatic spondylolisthesis or progressive DDD in patients who are unresponsive to
long-term conservative treatment is interbody fusion, but this is controversial. With some
reports showing no benefit compared to conservative treatment, patient selection is vitally
important. Various prognostic tests attempt to identify subsets of patients that might
benefit most from surgery, but the validity of these tests is unclear. Ultimately, success in
this category of patients should be defined by improved physical symptoms (patient-reported
outcome measures [PROMs]) rather than technical success of the procedure. A relevant
proportion of patients with intractable, conservative therapy-resistant lumbar degenerative
disease do finally profit from lumbar fusion surgery - the difficult question is how to
identify them securely and avoid unnecessary, unsuccessful surgery.
In the literature, several subsets of patients with lumbar degenerative disease who may
profit more than others from lumbar spinal fusion have been identified. Accurate preoperative
identification of patients at high risk for unsatisfactory outcome and vice-versa would be
clinically advantageous, as it would allow enhanced resource preparation, better surgical
decision-making, enhanced patient education and informed consent, and potentially even
modification of certain risk factors for unsatisfactory outcome. However, it is often
impossible for clinicians to balance the many described single risk factors for each adverse
event to arrive at a personalized risk-benefit profile in individual patients.
Machine learning (ML) methods have been extraordinarily effective at integrating many
clinical patient variables into one holistic risk prediction tailored to each patient. One
multicenter model based on classic statistics has already been described by Khor et al. -
However, upon external validation, it proved to be unreliable and rather poorly calibrated.
Also, this model was based on a relatively small number of patients for ML. The aim of the
FUSE-ML study is to develop and externally validate a robust ML-based prediction tool based
on multicenter data from a range of international centers that will provide individualized
risk-benefit profiles tailored to each patient undergoing lumbar spinal fusion for
degenerative disease.
Methods Overview Data will be collected by a range of international centers. Overall, the
models will be built and publication will be compiled according to the transparent reporting
of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)
guidelines. One model will be created for each of the relevant outcomes detailed below.
University of Zurich (V.E. Staartjes, C. Serra) is the sponsor of this study.
Ethical Considerations Each center will be responsible for their own ethics board /
institutional review board (IRB) approval and for establishing a data transfer agreement
(DTA). The sponsor (University of Zurich) will present a standard DTA upon request. They must
gain approval for retrospective or prospective data collection and sharing of the completely
deidentified data with the sponsor. The sponsor can aid by providing this detailed study
protocol. All study procedures will be carried out according to the Declaration of Helsinki
and its amendments.
Inclusion and Exclusion Criteria Patients with the following indications for thoracolumbar
pedicle screw placement are considered for inclusion: Degenerative pathologies (one or
multiple of the following: spinal stenosis, spondylolisthesis, degenerative disc disease,
recurrent disc herniation, failed back surgery syndrome (FBSS), radiculopathy,
pseudarthrosis). Patients undergoing surgery for - as the primary indication - infections,
vertebral tumors, as well as traumatic and osteoporotic fractures or deformity surgery for
scoliosis or kyphosis are not eligible. Patients with moderate or severe scoliosis (Coronal
Cobb's >30 degrees / Schwab classification sagittal modifier + or ++) are not eligible.
Patients undergoing surgery at more than 6 vertebral levels are also not eligible. Patients
with missing endpoint data at 12 months will be excluded. Patients are required to give
informed consent. Only patients aged 18 or older are considered for inclusion.
Data Collection Each center will collect their data either retrospectively, from a
prospective registry, or from a prospective registry supplemented by retrospectively
collected variables. Each center has to contribute a minimum of 100 patients with complete
12-month follow-up data to be included in the study. A standardized Excel database will be
provided by the sponsor for anonymous data entry. The data will be entered in standardized
and deidentified form. This Excel database will only contain a study-specific patient number.
Each center will keep an internal spreadsheet in which the study-specific patient numbers can
be traced back to center-specific patient-numbers, should this be necessary. The deadline for
submission of the complete data to the sponsor institution is 13th of August 2021.
Authorship Centers will have to contribute at least 100 cases with complete outcome data in
total to be included in the study. Each participating center will be able to designate a
maximum of four authors to be included in the author list. Any other center-specific
contributors will be listed as full members of the FUSE-ML study group and will be granted
full PubMed / Medline contributor status. The sponsor institution will have six primary
author positions available.
Primary Endpoint Definitions
Several endpoints will be assessed:
- 1. Oswestry Disability Index (ODI) at 12 months.
- 2. Visual Analogue Scale (VAS-BP, 0 to 100) for back pain at 12 months. This can also be
a converted numeric rating scale (NRS) from 0-10, or a VAS from 0 to 10 converted to 0
to 100.
- 3. Visual Analogue Scale (VAS-LP, 0 to 100) for leg pain at 12 months. This can also be
a converted numeric rating scale (NRS) from 0-10, or a VAS from 0 to 10 converted to 0
to 100.
These outcomes will be dichotomized using the minimum clinically important difference (MCID)
according to Ostelo et al. Thus, a 30% or greater improvement in a specific score compared to
baseline will be considered as achievement of MCID (clinical success) in that specific score.
If patients presented with zero symptoms initially (in either ODI, NRS-BP or NRS-LP), and
remained at zero for that score, this will also be defined as MCID for that score.
Features and Their Definitions
All features are measured or estimated preoperatively. In addition to the endpoints, the
following input features will be collected:
- Age (years)
- Gender (m/f)
- Presence of the following indications for surgery (choose all that apply):
- Spondylolisthesis
- (Recurrent) disc herniation
- Radiculopathy
- Chronic low back pain (CLBP) / Degenerative disc disease (DDD)
- Failed back surgery syndrome (FBSS)
- Lumbar spinal stenosis
- Pseudarthrosis
- Index Level(s) (choose all that apply, T12 - S1)
- Height (cm)
- Weight (kg)
- BMI (kg/m2)
- Smoking status (active / ceased / never)
- Preoperative (baseline) ODI
- Preoperative (baseline) VAS-BP
- Preoperative (baseline) VAS-LP
- American Society of Anesthesiologists (ASA) Score (1-2 / 3 or higher)
- Preoperative use of opioid pain medication (yes / no)
- Asthma pulmonale as a comorbidity (yes / no)
- Prior thoracolumbar spine surgery (yes / no)
- Race/Ethnicity (Caucasian / Black / Asian / Other)
- Surgical approach (choose all that apply: TLIF / PLIF / ALIF / Lateral)
- Pedicle screw insertion (yes / no)
- Minimally invasive technique (yes / no)
Sample Size While even the largest cohort with millions of patients is not guaranteed to
result in a robust clinical prediction model if no relevant input variables are included
("garbage in, garbage out" - do not expect to predict the future from age, gender, and body
mass index), the relationship among predictive performance and sample size is certainly
directly proportional, especially for some data-hungry ML algorithms. To ensure
generalizability of the clinical prediction model, the sample size should be both
representative enough of the patient population, and should take the complexity of the
algorithm into account. For instance, a deep neural network - as an example of a highly
complex model - will often require thousands of patients to converge, while a logistic
regression model may achieve stable results with only a few hundreds of patients. In
addition, the number of input variables plays a role. Roughly, it can be said that a bare
minimum of 10 positive cases are required per included input variable to model the
relationships. Often, erratic behavior of the models and high variance in performance among
splits is observed when sample sizes are smaller than calculated with this rule of thumb. Of
central importance is also the proportion of patients who experience the outcome. For very
rare events, a much larger total sample size is consequentially needed. For instance, a
prediction based on 10 input features for an outcome occurring in only 10% of cases would
require at least 1000 patients including at least 100 who experienced the outcome, according
to the above rule of thumb. In general and from personal experience, the authors do not
recommend developing ML models on cohorts with less than 100 positive cases and reasonably
more cases in total, regardless of the rarity of the outcome. Also, one might consider the
available literature on risk factors for the outcome of interest: If epidemiological studies
find only weak associations with the outcome, it is likely that one will require more
patients to arrive at a model with good predictive performance, as opposed to an outcome
which has several highly associated risk factors, which may be easier to predict. Larger
sample sizes also allow for more generous evaluation through a larger amount of patient data
dedicated to training or validation, and usually results in better calibration measures.
Between 20% and 40% of patients report no clinically relevant improvement after spinal fusion
(minority class). For sample size calculation, the authors take 20% for a conservative
estimate. Consequently, for this study, based on the authors' expertise and on the rules of
thumb mentioned above, the authors estimate that a minimum of 200 patients with a negative
outcome (minority class) are required to extract generalizable feature relationships. With an
estimated incidence of approximately 20% as explained above, that means that a minimum of
around 1000 patients are required for training. For adequate evaluation of calibration at
external validation, the authors estimate that another 300 patients will be required (thus,
approximately 60 patients with a positive outcome). Thus, in total, the authors estimate that
a minimum of 1300 patients are necessary to arrive at a robust model. More data will likely
lead to greater performance and better calibration.
Predictive Modeling A KNN imputer will be co-trained to impute any missing data that may
occur in future application of the model. If there is missing data in the training set, it
will be imputed using said KNN imputer. Features or patients with a missingness greater than
25% will be excluded. Data will be standardized and one-hot-encoded. In case of major class
imbalance - which is expected for the abovementioned endpoint - random upsampling or
synthetic minority oversampling (SMOTE) will be applied to the training set. All features
will initially be provided to the model for training. If necessary, the authors will apply
recursive feature elimination (RFE) to select input features on the training data.
The authors will trial the following algorithms for binary classification: Generalized linear
model (GLM), generalized additive model (GAM), stochastic gradient boosting machine (GBM),
naïve Bayes classifier, simple artificial neural network, support vector machine (SVM), and
random forest. Each model will be fully trained and hyperparameter tuned where applicable.
The final model will be selected based upon AUC, sensitivity, and specificity, as well as
calibration metrics on the resampled training performance. Training will occur in repeated
5-fold cross-validation with 10 repeats.
The one final model will then be assessed on the external validation data only once. 95%
confidence intervals for external validation metrics will be derived using the bootstrap.
The threshold for binary classification will either be identified on the training data alone
using the AUC-based "closest-to-(0,1)-criterion" or Youden's index to optimize both
sensitivity and specificity, or will be optimized on the training set based on clinical
significance (rule-out model). All analyses will be carried out in R Version 4.0.2 or more
recent.
Evaluation The performance of classification models can roughly be judged along two
dimensions: Model discrimination and calibration. The term discrimination denotes the ability
of a prediction model to correctly classify whether a certain patient is going to or is not
going to experience a certain outcome. Thus, discrimination described the accuracy of a
binary prediction - yes or no. Calibration, however, describes the degree to which a model's
predicted probabilities (ranging from 0% to 100%) correspond to the actually observed
incidence of the binary endpoint (true posterior). Many publications do not report
calibration metrics, although these are of central importance, as a well-calibrated predicted
probability (e.g. your predicted probability of experiencing a complication is 18%) is often
much more valuable to clinicians - and patients! - than a binary prediction (e.g. you are
likely not going to experience a complication).
Resampled training performance as well as performance on the external validation set will be
assessed for discrimination and calibration. In terms of discrimination, the authors will
evaluate AUC, accuracy, sensitivity, specificity, positive predictive value (PPV), negative
predictive value (NPV), and F1 Score. In terms of calibration, the authors will assess the
Brier score, expected-observed (E/O)-ratio, calibration slope and intercept, the
Hosmer-Lemeshow goodness-of-fit test, as well as visual inspection of calibration plots for
both datasets, which will also be included in the publication.
Interpretability The degree and choice of methods for interpretability will depend on the
finally chosen algorithm. Some algorithms can natively provide explanations as to which
factors influence the outcome in what way. Thus, in case e.g. a GLM, GAM, or naïve Bayes
classifier is chosen, the parameters / partial dependence values will be provided. For simple
decision trees, diagrams of the decision-making process can be provided. Other models with
higher degrees of complexity, such as neural networks or stochastic gradient boosting
machines cannot natively provide such explanations. In that case, the authors will provide
both AUC-based variable importance as well as model-agnostic local interpretations of
variable importance using the LIME principle.
Expected Results The authors expect to arrive at a generalizable model based on multicenter
international data that is likely to predict consistently with an AUC of at least 0.70 and
that is well-calibrated. A web-based prediction tool will also be created for each of the two
models using the shiny environment, much akin to e.g.
https://neurosurgery.shinyapps.io/impairment (Also see for example: Staartjes et al., Journal
of Neurosurgery, 2020). This web-based app will be available for free on any internet-capable
device (mobile or desktop), and should be stable on most devices due to the server-based
computing. The costs for maintaining the server will be carried by the sponsor. The collected
data will be stored by the sponsor for 10 years. The large dataset will be open to further
analysis and will be provided to any of the contributing centers at reasonable request and
after approval by all other centers. The goal is to enable other analyses using the collected
dataset. If any additional analyses lead to publication, all contributors will be included as
co-authors and all co-authors will have the opportunity to review said manuscript beforehand.
Any contributing study center has the right to veto publication of any subsequent analyses
containing their own data.