Clinical Trial Details
— Status: Not yet recruiting
Administrative data
NCT number |
NCT05774457 |
Other study ID # |
1879014-3 |
Secondary ID |
|
Status |
Not yet recruiting |
Phase |
N/A
|
First received |
|
Last updated |
|
Start date |
April 1, 2024 |
Est. completion date |
March 31, 2025 |
Study information
Verified date |
March 2023 |
Source |
University of Delaware |
Contact |
Katherine Verdolini Abbott, PhD |
Phone |
302-831-0956 |
Email |
kittie[@]udel.edu |
Is FDA regulated |
No |
Health authority |
|
Study type |
Interventional
|
Clinical Trial Summary
The specific aim of the clinical trial portion of the larger research project is to obtain
preliminary data on the utility of voice training (resonant voice) in the VR environment
compared to a traditional clinical environment, using a mixed model within- and
between-subjects randomized experimental design.
Independent Variables are (1) training and test condition (clinic room vs VR classroom for
training); (2) visual speaker-to-listener distance (2m, 4m, and 6m for training); and (3)
time point (baseline at 2 m, retention test at 4 m, and 9 m for transfer test). Dependent
Variables are (a) vocal sound pressure level (SPL); and (b) spectral moments (spectral mean
and standard deviation (in Hz and cents), skewness, and kurtosis).
The hypothesis is that a two-way interaction will be shown between training condition and
time point showing greater acquisition and transfer of voice skills following training in the
VR environment than in the typical clinical environment.
This series will utilize a high degree of innovation and sophisticated VR technology to
identify parameters important for subsequent VR development in voice therapy, and to lay the
empirical foundation for subsequent studies that build on the present work expanding both its
basic science and translational value.
Description:
This project addresses three Specific Aims. Specific Aims 1 and 2 set up many of the
parameters for the clinical trial, which is addressed in Specific Aim 3. Details for the
project as a whole, including the clinical trial, are as follows, copied and pasted from the
grant proposal.
3.0 RESEARCH APPROACH The overall purpose is to investigate the effects of auditory, visual,
and audiovisual information on the perception and production of one's own voice, using VR as
an investigation tool, and to provide preliminary data on the potential utility of VR in the
voice training environment. Details regarding Specific Aims are provided in the relevant
page.
3.1 Participants: For SA1 and SA2, 60 vocally healthy classroom teachers will be recruited
between ages 24 and 50 years, (see 3.2). At the lower end, this age range represents the
earliest age at which teachers might initiate their professional teaching careers, and at the
upper end, represents the average age of onset of menopause for women, and we wish to limit
hormonal and other age-related influences in the data. All participants will participate in
SA1 and SA2, which use the same simultaneous data collection procedures with different
analyses for perceptual measures (SA1) and production measures (SA2). For SA3, which is
exploratory, 10 additional healthy teachers will be recruited with the same characteristics.
For all SA, inclusion and exclusion criteria are: Inclusion: By self report: (1) K-12
classroom teacher with at least two years teaching experience (SA1 and SA2) or elementary
school classroom teacher (SA3), between 24 and 50 yr; (2) No history of voice disorder
lasting more than two weeks, and Voice Handicap Index -10 (VHI-10)63 score < 10; (3) Lifetime
non-smoker; (4) No hearing or uncorrected visual impairment; By written documentation: (5)
Proof of full COVID-19 vaccination; By clinical evaluation: (6) Normal voice on days of
participation, as assessed by a voice-specialized licensed SLP based on an overall severity
score from the Consensus-Auditory Perceptual Evaluation of Voice (CAPE-V) < 10.65 Exclusion:
By self report:(7) History of vocal fold pathology or other pathology affecting voice; (8)
any acute condition that may affect voice production such as coughing, nasal congestion, or
temperature greater than 98.6o F (37.0o C). Note that only vocally healthy teachers are
assessed at this stage, before introducing the complexities associated with voice disorders.
Those complexities will be addressed in later translational work that builds on the present
series. It should be noted that the research program is ultimately relevant for teachers with
voice problems, but also for the working environment of currently healthy teachers as well.
3.2 Power analyses: Power analyses assumed a medium effect size of d = 0.4, two-sided, for
tests of all dependent variables across SA1 and SA2. Results suggested that an N=51 will be
sufficient to detect findings for all variables with a significance level of α = 0.05 and
power of 0.8. Accounting for possible attrition, 60 participants will be recruited for Aims 1
and 2. This number has been shown by our Co-Investigator Bottalico and in our own more recent
preliminary data to be ample to detect significant effects similar to those investigated in
the present series (e.g., differences in perceptual ratings of vocal effort and comfort, and
also SPL and mean f0; (Bottalico, 2017; Bottalico et al., 2016; Daşdöğen et al., unpublished
data). For SA3, which is exploratory, a total of 10 participants will be recruited to obtain
preliminary data for a later clinical series.
3.3 Procedures 3.3.1 SA1 and SA2: Sixty K-12 classroom teachers will be recruited through
flyers posted in the community and on social media, and through direct contact with Delaware,
New Jersey, Pennsylvania, and Maryland public schools, all of which may be in close proximity
to the study site at the University of Delaware. Individuals who contact the PI with an
interest in participating will receive an overview of the study by telephone or secure remote
connection, and if they agree, will provide informed consent. Following consent, participants
will be guided to an online screening REDCap questionnaire using a HIPAA-compliant server to
address all inclusion and exclusion criteria except for the clinical auditory-perceptual
evaluation of voice (CAPE-V). Qualified participants will be scheduled for an in-lab
appointment at the University of Delaware STAR campus voice lab. At the beginning of the
appointment, the clinician will assess the participant's voice to confirm normal voice
quality using the CAPE-V. Participants who pass this final screening step (overall CAPE-V
severity score < 10) will proceed to experimental procedures. Others will be excused.
For experimental procedures, first, participants will be trained in the speech tasks that
will be used during the study: introducing themselves to a classroom for 15 seconds,
delivering a two-minute tutorial related to their teaching expertise, sustaining the vowel
/a/ for 3 seconds repeated three times, and producing the CAPE-V phrase, "We were away a year
ago" repeated three times. Participants will then receive instructions for the self-report
questionnaire reflecting self-perceived vocal loudness, vocal effort, and vocal comfort for
the speech stimuli as a set (see 6.0). Following the delivery of these instructions,
instrumentation will be positioned including headset microphones, headphones, and VR glasses
(see 4.0). Participants will then perform the experimental speaking tasks in each of 15
randomly ordered conditions with and without background noise that virtually mimic auditory
and visual properties of real-world rooms varying from small to large lecture hall, with dry
to highly reverberant acoustics and varying speaker-to-listener distances.
Room acoustics will be prescribed to characterize and control acoustically varying conditions
across VR environments (ISO 3382, see 5.0). OBRIR measurements will be obtained in a
classroom, a lecture hall and a school auditorium environments on the University of Illinois
Urbana-Champaign campus, where the dimensions are similar to those of the VR classrooms (see
5.0). Ovation software (Ovation VRSpeaking, LLC, NJ, US) will be used to deliver realistic VR
rooms and 3D listeners (see 4.0). Experimental tasks will include a string of prompted speech
utterances as for the training condition, (i) with no external audio or visual feedback or
background noise (experimental baseline), (ii) in each auditory condition alone, (iii) in
each visual condition alone, and (iv) in each combination of auditory and visual conditions.
All conditions ii-iv will be produced with and without background noise described shortly.
For the audio-only conditions, participants will wear an eye mask to block visual
information. In those conditions, to aid with communicative intent, participants will hear
applause that is convolved with matched room acoustic responses before they initiate each
speech string, which will provide an audio-spatial clue about audience presence,
approximately how crowded the environment is, and how far away listeners are in the
environment. To enhance participants' engagement with the environment, before they speak, the
examiner will ask them to estimate audience size and speaker-to-listener distance. For all
visual conditions, participants will see the respective visual rooms and seated listeners who
will react to the speaker in realistic ways (e.g., moving while sitting, scratching head,
etc.). To further promote participant engagement, the examiner will ask each participant
approximately how many people are in the environment and how far away they are. In all
conditions, participants will be prompted to "speak so that everyone can understand you." As
noted, all conditions will be carried out with and without background noise delivered during
participants' speech. Noise level will corresponds to the level representative of a typical
classroom environment (average of 54 dB).66 After participants have produced speech
utterances in each condition, they will remove VR goggles or eye mask and will be asked to
complete the questionnaire about self-reported loudness, vocal effort, and vocal comfort for
the preceding utterances. The questionnaire will be displayed on a computer screen that
allows for digital responses on VAS scales (see 6.0). Then, participants will proceed to the
next VR condition and so forth, until data collection for all conditions has been completed,
thereby concluding the session. The total duration of the session is expected to be about 120
minutes.
3.3.2 SA3: Participants will be 10 classroom teachers. Following satisfaction of inclusion
and exclusion criteria with the exception of clinical auditory evaluation of voice,
qualifying participants will present to the STAR voice lab for voice evaluation as for SA1
and SA2. Participants who pass the voice screening will then be fitted with relevant
instrumentation (3.3.1; 4.0) and will produce the same utterance strings as for SA1 and SA2,
with a speaker-to-target distance of 4 m, specified by a physical mannequin. Then,
participants will be randomly allocated to one of two training conditions: traditional
clinical room or VR environment. In their respective conditions, participants will receive
training in a therapeutic voicing pattern that has value for healthy speakers as well,
"resonant voice."68-71 Training will be provided by a speech-language pathologist with at
least two years' experience in voice disorders, and who has completed standardized training
in Lessac-Madsen Resonant Voice Therapy (LMRVT).26 Throughout training in both environments,
background noise will be presented as for SA1 and SA2, in free field for the traditional
environment and over headphones in the VR environment. In both training conditions, materials
from Session Two of LMRVT will be used - the first session in which actual voice training
begins in that program. For the traditional clinic room condition, after 30 minutes of LMRVT
training using Session Two materials, participants will be guided to repeat the same
exercises, in order, from Session Two, with instructions to produce voice as if speaking to a
person positioned at 2 meters from the speaker, represented by a physical mannequin, for 5
minutes. Then, participants will repeat the same exercises again, as if speaking to the
mannequin positioned at 4 meters for 5 minutes, and finally, at 6 meters for 5 minutes.
Following training at each of these distances, participants will repeat baseline utterances
which will be recorded for audio data collection. For the VR environment condition,
participants will receive the same LMRVT training as for the traditional room, only in a VR
Classroom (Room 1; Table 1) as representative of real-life classroom conditions, created for
SA1 and SA2 but not dependent on results for those Aims. Participants will receive training
in LMRVT Session 2 exercises for 30 minutes, followed by repetition of the same exercises
from Session Two with instructions to produce relevant utterances speaking to the listeners
in the VR environment positioned at 2 meters, 4 meters, and 6 meters from the speaker for 5
minutes at each distance. As for the traditional environment, audio recordings will be made
using baseline utterances following training at each distance. After all training has been
completed, participants in both conditions will be guided to a standard classroom in the STAR
setting (Room 513; a volume/floor plan of ~2440 m3/69m2). In that setting, participants will
be asked to repeat baseline speech tasks speaking to live seated listeners positioned at 4
meters for a retention test and a novel distance--9 meters--and recordings will be made as
previously. Participants will be then be excused. Total duration is about 90 minutes. Brief
training sessions on the order of about 30 minutes have been shown to produce shifts in voice
production,94 cohering with the PI's extensive clinical experience. We thus expect to find
such shifts in the present series, which prepares the foundations for more extended
longitudinal studies appropriate for a planned R01 growing from the present work.
4.0 Equipment: A digital audio workstation (Reaper Version 6.36, Rosendale, NY, US) and a
head-mounted microphone (AKG C 520, Harman) will be used to capture voice signals for all
SAs. Audio recordings will be sampled at 44.1 kHz. The mic-to-mouth distance will be 5cm with
the microphone positioned a 450 angle from the participant's mouth.72 The microphone will be
connected to an audio interface (Babyface Pro FS 24-Channel USB 2.0, Heimhausen, Germany) and
the combined input/output latency will be less than 5 ms, which is the value below the range
of a noticeable echo (16 and 26 ms).73 The interface will be connected to a computer running
Reaper audio workstation to create audio rendering. Virtual reality glasses (Oculus Rift S)
will be used to produce visual information (rooms and 3D avatar listeners). Room volume
images and listeners will be provided using Ovation software (https://www.ovationvr.com/).
The software allows for the selection multiple classroom environments that replicate
real-world examples and speaking to hundreds of digitally generated 3D audiences (real
people) who respond to the speaker by smiling, clapping, or moving. A recent study has
reported on the effectiveness of this software to create real-world visual scenarios.92 Audio
microphone will be calibrated following published procedures.93 VR glasses will be optimally
positioned for each participant individually. All utterances will be saved as wav files in an
encrypted folder.
5.0 Room acoustics measurements and audio rendering: Rooms similar in size to the ones
selected in Ovation software (a classroom, a lecture hall, and a school auditorium) will be
selected within the University of Illinois Urbana-Champaign campus (by the Co-I Bottalico).
The rooms will be acoustically characterized following the ISO 3382.
In the position where a speaker is typically located in the rooms, Oral-Binaural Impulse
Responses will be measured with a HATS. Specifically, oral-binaural impulse responses
(OBRIRs) will be obtained using the convolution method following published methods.91
Specifically, an exponential sweep signal emitted via the mouth of a Head and Torso Simulator
(HATS, GRAS 45BB KEMAR) will be recorded by HATS' ears. The convolution between the recorded
sweeps (at the HATS' ears) and the inverse of the emitted sweep (by the HATS' mouth) will
generate the OBRIRs. The OBRIRs will be used to acoustically recreate the rooms considering
the ears-mouth path of the speaker.
Real-time audio rendering of the subject's voice in the virtual room acoustics will be
accomplished through the use of real-time convolution plug-ins, such as Analglyph and the
RoomZ developed by our consultant Katz at Sorbonne University.74 The convolution engine will
employ measured room impulse responses. The virtual acoustic rendering will be played back to
the participant over open-back headphones (HD 660S, Sennheiser, Wedemark, Germany), limiting
coloration from hearing one's own voice.
6.0 Measures: For SA1, self-report perception measures will be derived from three separate
questions about vocal loudness, vocal effort, and vocal comfort, using a Visual Analog Scale
(VAS; Table 2). After each study condition, VR glasses will be removed and participants will
complete the VAS using HIPAA-compliant REDCap. The REDCap perception questionnaire will be
displayed on a computer screen. Participants will respond to each perception question
sequentially by moving a slider on a scale from 0 (not at all) to 100 (extreme[ly]). Each
response will be captured numerically in the REDCap database. Questionnaire completion will
take approximately two minutes, which will provide a restperiod after the preceding study
condition and help minimize potential vocal fatigue. For SA2 and SA3, instrumented measures
of voice will include vocal sound pressure level-SPL and Spectral Moments (see 7.0).
7.0 Data extraction and analysis: Extraction of speech parameters will be performed with
Matlab R2021b (MathWorks, Natick, MA, United States) and Praat (version 6.2.14). For each of
the recordings, a time history of SPL90 and fundamental frequency, f0, will be obtained with
a time step of 0.05 s. The f0 will be estimated with an acoustic periodicity detection
algorithm on the basis of an accurate autocorrelation method. This method is more accurate,
noise-resistant and robust than other methods based on the cepstrum or combs, or the original
autocorrelation methods. For the two time histories statistical moments will be calculated
(Spectral mean, standard deviation, skewness, and kurtosis). The ability of spectral moments
to distinguish between different degrees of vocal effort has been reported previously.76
These measures will quantitatively assess key spectral contributions that may be associated
with potential changes in vocal quality, in relation to vocal effort and comfort. However,
perceptual measures of voice quality will not be made in this series.
8.0 Statistical analyses: For all SAs, Linear Mixed-Effects (LME) models (Matlab R2021b) will
be fitted by restricted maximum likelihood (REML). The dependent and independent variables of
these models are listed in the SA section. Participant ID will be used as a random effects
terms. Here, the random effect of "Participant ID" refers to the partial pooling of
observations by "Participating ID", with the slope and intercept of each participant ID being
random. Models will selected based on the Akaike information criterion and the results of
likelihood ratio tests. Tukey's post-hoc pair-wise comparisons will be performed to examine
the differences between all levels of the fixed factors of interest when they are more than 2
levels (in this case both the audio and the visual environments). These are pair-wise z
tests, where the z statistic represents the difference between an observed statistic and its
hypothesized population parameter in units of the standard deviation. The p-values for these
tests will be adjusted using the default single-step method. The LME output will include the
estimates of the fixed effects coefficients, the standard error associated with the estimate,
the degrees of freedom (df), the test statistic (t), and the p-value. The Satterthwaite
method will be used to approximate degrees of freedom and calculate p-values.