1.IntroductionIn recent years, several machine learning and deep learning (DL) technologies have passed the approval process for a medical device to support radiologists in the diagnosis of medical images.1 Still, the reliability of these new medical software devices and the underlying DL networks strongly depends on the training data and how well they represent the variety of real clinical image data (test images). Castro et al.2 described different sources of “shifts” between training and test domains and among those, the “acquisition shift, resulting from the use of different scanners or imaging protocols, which is one of the most notorious and well-studied sources of dataset shift in medical imaging.” This is a well-known challenge ever since researchers try to derive reproducible measurements of physiologic information based on heterogeneous medical image data, e.g., by image harmonization in Radiomics research.3 In recent years, various studies have been dedicated to minimizing these shifts by domain adaptation methods. For instance, differences between a target and a source domain can be reduced by image preprocessing (e.g., normalizing intensities, or aligning images), by fine-tuning models on target domain data, or by translation of source into target domain images using generative adversarial networks (GANs) or transformers.4 These methods have been shown to improve the robustness of artificial intelligence (AI) models but do not provide means for systematic testing and quantification of potential (residual) risks during application. Accordingly, several institutions underline the need for test procedures and published concepts for the evaluation of the robustness and transferability of a model to other data domains.5–8 The ECLAIR guidelines,8 for example, request “to check robustness to variability of acquisition parameters.” This is especially important for magnetic resonance imaging (MRI), because MR acquisition protocols typically have a large number of sequence parameters, which affect the contrast, resolution, and SNR of the acquired images. On the one hand, this allows a wide range of clinical information to be presented by MR images, but on the other hand, it leads to a high heterogeneity between different radiology centers. MR acquisition protocols are often optimized individually at each site and sometimes even for different patients to take patient-specific features (e.g. weight and size) into account.9 Thus, acquisition parameters may vary even for the same type of scan, hence resulting in different image contrasts. There are guidelines providing recommendations on appropriate MR protocols. Among those, e.g., the recently published MAGNIMS–CMSC–NAIMS consensus guidelines10 prescribe the contrast weighting [i.e., T2w, T2w fluid-attenuated inversion recovery (FLAIR), and contrast-enhanced T1w] of the scans that need to be included in the “recommended core” of protocols for the examination of patients with multiple sclerosis (MS). Nevertheless, they lack specific information on contrast-affecting parameters, such as echo, repetition, and inversion time (TE, TR, and TI). A multitude of visualization methods have been developed to identify the features within images that a neural network is most sensitive to.11 Other methods quantify the uncertainty of a network during image processing.12 However, there is no test procedure that predicts whether an AI product can be applied to the images of a particular radiology practice without loss of performance, e.g., given their customized imaging protocols. Further, it is currently not possible to determine which acquisition parameters can be changed without compromising the performance of an AI product. The identification of the influencing factors that a system is most prone to is a well-known problem in the field of process improvement and quality management. It is generally solved by systematic testing based on the “design of experiment (DoE)” concept. DoE is a standardized statistical tool for quality control in Six Sigma processes to systematically evaluate the robustness of a process to its influencing factors (see Ref. 13, Chapter 5.4). It predicts the minimum number of experiments needed to quantify and compare the impact of all influencing factors and their interactions on a system’s outcome or performance metric. Combined with dedicated analysis of the results, the dominating factors can be easily identified. However, to optimize the experimental design to the given problem, regression analysis needs to be performed to estimate the underlying model function that quantifies the dependence of the response variable (here: AI network performance) on the process’ input (here: acquisition parameters), see Ref. 13, Chapter 5.3.3.6. Therefore, the foremost objective of this work is to study the dependency of a network to the most relevant contrast-affecting acquisition parameters. In the above-mentioned neuroimaging T2w FLAIR scans for example, the TE and the TI have the strongest influence on the imaging contrast. But how can models be validated against the typical MR protocol variability of routine scans or even stress tested against rare but realistic maximum domain shifts if the related data are not available? The benchmark dataset CLEVR-XAI aims to create a selective, controlled, and realistic test environment for the evaluation of explainable neural networks in non-medical applications.14 Similar projects for medical applications have just started.15 Using machine learning and neural networks for the simulation and synthesis of medical images is a field of intense research. Attempts have already been made to recreate MRI images through simulation and synthesis, e.g., using GANs or variational autoencoders (VAEs), phantoms, and dedicated multi-parametric MR sequences.16 Other simulators use virtual phantoms, for example from Brainweb and Shepp–Logan, which represent the human brain17,18 to generate images that represent a particular protocol. The limiting factors in all the above-mentioned approaches, however, are either the limited number of anatomies (Brainweb), the lack of anatomical realism (Shepp-Logan), the dependency on specific software (sequences), hardware (phantoms), or the ability to synthesize the result of arbitrary MRI sequences settings with only one model (GANs, etc.). The secondary objective of this study is thus the combination of simulation and synthesis to generate artificial MRI data of arbitrary sequence character (i.e., “shift derivatives”) from a set of real MR images. These data are finally used to stress test a model against variations of acquisition parameters. For the sake of simplicity, the experiments in this study are focusing on the simulation of domain shift derivatives of T2w FLAIR scans for different TE and TI values to describe the performance of MS lesion segmentation networks in dependence of these scan parameters. 2.MethodsThis work comprises two levels of methodology and experiments (see Table 1). First, the simulation of domain shift derivatives given a real baseline image dataset, and second, the use of these data to stress test state-of-the-art (SOTA) MS lesion segmentation networks against these shifts. Those networks are trained on data (Table 2) of heterogeneous contrast (e.g., from different field strengths and using different acquisition protocols). The stress tests intend to evaluate to what extent the networks are robust to changes of image contrast. The simulated data are validated by real MRI scans. The dependency of the models’ performance (F1-score) against changes of the MRI protocol parameters (TI, TE) is modeled by second-order polynomial functions, recommended by the above-mentioned DoE guidelines to quantitatively compare the robustness of the networks against acquisition shifts, by the functions’ coefficients. Table 1Research questions, methodology, and experiments.
Table 2Datasets used in this work. The first dataset (OpenMS* longitudinal) is utilized as baseline data in the simulation, since this is the only dataset, for which all contrast-affecting parameters (TE, TI, and TR) are provided.
The MS data used in this study consist of several open MRI benchmark datasets (see Table 2). 2.1.Concept of Image Generation to Mimic Acquisition ShiftsData simulation uses an in-vivo MRI scan (baseline data) and mimics changes in that baseline scan in response to changing sequence parameters. The concept of image generation is based on the following equation: with , being the simulated signal at pixel position . The contribution of each tissue t to the signal of a pixel or voxel is weighted with its local volume fraction . is the (typically unknown) digital imaging and communications in medicine (DICOM) scaling factor. The texture map is introduced to approximate all texture influences other than tissue, e.g., based on artifacts, field inhomogeneities, noise, etc. The entire image generation process therefore consists of two different steps (Fig. 1). The first step comprises the preliminary estimation of these tissue properties followed by the second step, the final image simulation according to Eq. (1).is the signal as determined by the sequence and the tissue properties, i.e., the parameters of the underlying tissue t [like the spin density and relaxation parameters T1 and T2 of gray matter (GM), white matter (WM), cerebrospinal fluid (CSF), and lesion]. is given by the T2w FLAIR signal equation in Eq. (2) as published in Ref. 25 with , i.e., the sequence parameters.2.1.1.Simulation and synthesis methodsEquations (1) and (2) contain a number of tissue parameters that must be represented as realistic as possible for the data generation process but cannot be easily simulated (e.g., anatomical structures, lesion sizes, and locations). The idea behind the proposed generative approach is thus to combine image synthesis and simulation as follows.
2.1.2.Partial volume estimationFor estimation of the partial volume fractions of each tissue, we apply the method described in Ref. 26. This approach requires that a signal rise or decline from one region to the other is unique for one kind of tissue-tissue interface. However, in case the brain contains lesions, a rise of signal when leaving the WM region may be attributed to either a WM-lesion or a WM-GM interface. The partial volume maps are thus generated in two steps, assuming that lesions are solely located in and surrounded by WM.27 First, as required by the approach, segmentation masks are created. We used Synthseg28 for segmentation of normal tissues, and expert lesion masks were provided through the datasets.29 Second, the T1w scans are used to estimate the PV-maps , , and of normal tissue. Lesion pixels might be falsely assigned to the PV-map of GM, which can be easily corrected by setting the GM maps to 0 at all lesion pixels as given by segmentation. Third, WM and lesion ROIs are extracted from the FLAIR images and are fed through the PV-algorithm, to obtain another and map. The final is initialized with . Finally, in pixels, where , the partial volume fraction in WM is then set to . All steps are summarized in Fig. 2. 2.1.3.Estimation of the DICOM scaling factor and the texture mapA simplified version of Eq. (1) describes the signal of those pixels of the real baseline image that contain only one tissue fraction () Since both and are unknown, the problem of computing is overdetermined. We solve this by introducing the assumption that signal variations are primarily caused by noise and thus the average texture in this region is 0. Eq. (3) can then be written as This allows for a preliminary estimation of the apparent tissue parameters from the ratio of average real and simulated signals for different tissues [the ratio eliminates the unknown in Eq. (4)], or more precisely by comparing the real and simulated contrast metrics given in the following equations: The parameters of are optimized to minimize the cost function Then, can be estimated using Eq. (4). Now, that all unknowns are determined, Eq. (1) is solved to determine the texture map (see Fig. 3). 2.1.4.Experiments - comparison of simulation and measurementMR images of 10 healthy volunteers were acquired to compare the simulations with real measurements. The examinations were approved by the ethics committee of the Physikalisch-Technische Bundesanstalt and are in accordance with the relevant guidelines and regulations. Written informed consent was obtained from all volunteers prior to the measurements. Data were acquired at 3T (Siemens Verio) using the following sequences: a magnetization prepared rapid gradient echo for the estimation of the PV-maps (3D, TR = 2300 ms, TI = 900 ms, TE = 3.2 ms, voxel size: ) and five T2w FLAIR scans as a reference measurement for the simulated images (Multislice 2D, TR = 9000 ms, voxel size: ) with TE and TI values as given in Table 3 to represent the extreme shift derivatives of the possible scan domain and its center (see Fig. 5). The “center” protocol serves as the baseline scan for the simulations of the “corner” protocols. Table 3TE and TI of the five T2w FLAIR acquisition protocols.
Reference T1 values were obtained from saturation-recovery measurements. Eleven T1-weighted images for different saturation delay times (TD = 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0 1.25, 1.5, 2.0, and 8.0 s) were acquired using a fully sampled single-shot centric-reordered GRE readout (TE/TR = 3.0/6.5 ms, flip angle = 6 deg, voxel size: ) implemented in pulseq.30 Final quantitative T1 values were generated using a non-linear least squares curve fitting algorithm31 assuming a simple mono-exponential magnetization recovery. T2 reference values were derived from the two different TEs ( and ) of the FLAIR scans using the following equation: The T2 estimates obtained with TI = 2900 ms and 2200 ms are averaged to deliver the final reference T2 values. The relaxometry estimates described in Sec. 2.1.3 are compared to these reference values and to values given by literature.32–35 Finally, the five real and simulated scans are compared by the theoretical percentage signal deviation per ms relaxometry errors and approximated by error propagation as and in dependence of T1 and T2 to confirm that signal differences are related to relaxometry imperfections. The stress test pipeline is summarized in Fig. 4 and comprises two steps as described in the following sections.2.2.Model Stress Tests to Determine the Influence of Acquisition Shifts2.2.1.Generation of test dataWith the methods described in 2.1, derivatives of the baseline data can be generated that represent arbitrary acquisition shifts of a baseline scan (“shift derivatives”). Typical variations of scan protocols (minimum and maximum TE and TI values) were estimated using literature and real scans. The outcome of that investigation is published in Ref. 36 and is depicted in Fig. 5. test datasets were generated that represent seven different TE values and seven different TI values, since these are the most contrast-affecting parameters in T2w FLAIR sequences. 2.2.2.Modeling the network performance in dependence of sequence parametersThe lesion F1 score of a lesion segmentation network can be determined for all of these data comparing the network prediction with the lesion ground truth segmentation masks. Averaging all lesion F1 scores finally delivers F1 as a function of TE and TI. We use a response surface method (quadratic model, cubic terms neglected) to describe the dependence of F1 on arbitrary values of the influencing factors TE and TI and their interactions as recommended by Ref. 13. Accordingly, the quadratic model in Eq. (10) is fitted to these F1 measurements The coefficients to can each be understood as a measure of the relevance of the influencing factors TE and TI (main factors) and their interactions . 2.2.3.Experiments - stress testing SOTA models against acquisition shiftsTo validate the model function described in Eq. (10), two SOTA models are trained on data with heterogeneous contrast as described in Table 2. First, the nnU-Net framework is used, which utilizes a U-Net architecture and automatically configures its hyperparameters and configuration.37 The first model is a 3D full-resolution nnU-Net, which is chosen by nnU-Net’s auto-configured framework as the best-performing model among 2D and low-resolution 3D counterparts. Training is done by nnU-Net’s self-configured automatic framework, where fivefold cross-validation is employed with 80% for training and 20% for validation, and the best-performing fold is chosen as the final model. The second model is a SegResNet model, which uses ResNet-like blocks and skip connections without the variational autoencoder part.38 The network is trained with cropped blocks for 1000 epochs with an Adam optimizer and learning rate of 0.001 with Pytorch and MONAI tools. The training data are randomly split into fractions of 80% for training and 20% for validation. The “longitudinal” OpenMS dataset is the only open benchmark dataset for which all contrast-affecting parameters (TE, TI, TR) are provided (Table 2). All data are skull stripped using the FSL brain extraction tool (FSL BET)39 prior to all processing steps. The average F1 is determined and modeled as a function of TE and TI as described in 2.2.2. is used to evaluate the appropriateness of the model function in Eq. (10). 3.Results3.1.Comparison of Simulation and MeasurementFigure 6 shows the variation of the estimated and reference relaxation measurements in comparison to the literature ranges. The estimated and measured relaxation times mostly lie within the literature range. As further underlined by the mean relaxometry values in Table 4, the high T1 value and the low FLAIR signal hampers relaxometry in CSF. The literature does not report on CSF T2 measurements at 3T. T2 is independent of the field strength but even at 1.5 T, to our knowledge, the Brainweb catalogue is the only literature source reporting a T2 value for CSF (329 ms), although the values presented in that catalogue (in WM and GM) tend to be lower than most other values at 1.5 T.40 Table 4Mean values for T1 and T2 in normal tissue. All values are given in ms.
Visually, the images obtained by the simulations and measurements agree well (Fig. 7). Small scaling errors of the nulled CSF signal result in high relative signal deviations. In addition, Table 5 lists the relative error between real and simulated images in different manually drawn ROIs. Table 5Comparison of the mean signals of WM, GM, CSF, and skull of simulation and reference MRI with relative percentage error.
The deviation between the simulated and the measured MR signals in WM is higher than in GM. The theoretical error propagation of the relaxometry estimates on the simulated signal is depicted in Fig. 8. 3.2.Results of Stress Testing SOTA Models Against Acquisition ShiftTesting the models with the real baseline data and their simulated counterpart (TE = 140 ms and TI = 2800 ms) yields F1 scores, which differ in the fourth decimal place (OpenMS data: SegResNet: ; nnU–Net: , see Fig. 9). The coefficient of determination of the model fit (second-order polynomial) is 0.991 for the SegResNet results and 0.982 for the nnU-Net results. The coefficients for Eq. (10) in Table 6 show that TE has the highest influence on both segmentation networks. Table 6Coefficients c1 to c7 as given by the model fit (see Eq. 10). Units are given in ms−1 and ms−2 for linear, quadratic, and combined terms, respectively. The highest coefficients are those scaling the influencing factor TE.
In the simulated images of Fig. 10, the lesion-to-WM contrast decreases for lower TE and TI values. This is accompanied by a decline of the F1-score, i.e., the models’ ability to differentiate between the lesion and white matter decreases with lower contrast. 4.Discussion and ConclusionThe image generation method simulates acquisition shift derivatives of a real baseline scan for arbitrary sequence parameters. It was designed to be applicable to common clinical neuroimaging studies that normally contain T2w FLAIR and T1w images. It does not require extra sequences but only knowledge of the scan parameters of the baseline T2w FLAIR data. 4.1.Comparison of Simulation and MeasurementsAt the extreme points of the experimental design, the simulation shows a 19% deviation to the measured values in white matter and lower deviation in gray matter. This can most likely be explained by the inaccuracies of the relaxometry method used in this work. Using the error propagation as a rough guess, the misestimation of 19% could be explained by a 19 ms deviation of T2, which is likely to be realistic considering the reference measurements and the range of literature reference values. Even those reference relaxometry methods suffer from inaccuracies caused by inflow or sequence imperfections, in particular when estimating the T1 and T2 of flowing tissue like blood or CSF.41 One could improve the validation by including T1 and T2 mapping sequences in the same resolution and spatial coverage. Common relaxometry sequences in neuroimaging rely on multiple 3D spoiled gradient recalled echo or inversion recovery sequences for T1 mapping and multi-echo or balanced steady-state free precession sequences at variable flip angles for T2 mapping.41,42 The imaging study in this work was already time-consuming due to the five times repetition of the lengthy T2w FLAIR protocol and the T1 weighted scan. Therefore, there was just limited time for a rough dual echo T2 estimation and for the addition of a time-efficient single-slice T1-mapping protocol (acquisition time ) to examine the T1 estimates in one slice, and thus values were compared ROI-wise. Still, the T1 and T2 values estimated here mostly lie in the range of literature values, and differences in the reference measurements are also comparable to the range of literature values. A one-to-one comparison of real and simulated images is challenging as it requires the exact knowledge of the relaxation times of that particular patient. Precise relaxometry is neither the aim of this work nor is it necessary for the simulation of test data. The relaxometry parameters in Eqs. (1) and (2) are set to arbitrary values to deliver a representative cohort of anatomies. Relaxometry imperfections hamper accurate validation of the simulated values, yet, they manifest only in a misestimation of the DICOM scaling factor and thus in under- or overestimation of the texture amplitude. Unfortunately, for MRI sequences this scaling factor is not part of the DICOM header as it is for the Hounsfield units in CT imaging. Irregularities of the texture amplitude, on the other hand, might be balanced by normalizing the texture amplitude over the entire dataset. Furthermore, the texture amplitude could be also included as another influencing factor in the stress test analysis in addition to the sequence parameters—e.g., as a measure of noise or artifact level. In contrast to using other AI-based generative approaches like GANs, VAEs, or diffusion models,16,43–46 the underlying signal equation allows for the generation of arbitrary but distinct shift derivatives from just one dataset. 4.2.Stress Test ResultsThe stress test results between the two networks differ, either due to their architectures or different data splits used for training and validation. However, in both cases, the F1(TE, TI) measurements seem to be well described by the quadratic function. The metric varies only smoothly so that cubic terms can be neglected. TE seems to be the most influencing factor for all models, which is in line with the nature of the contrast weighting of the sequence (T2w FLAIR). Furthermore, the lesion F1 values are comparable to that of real data (72%47) at least in or close to the baseline representation. The performance decreases towards the extreme points of the experimental grid (particularly for low TE values), where the lesion-WM contrast decreases. As one can see in Fig. 10 (example training images), the lesion-WM contrast of the training images was generally higher than in the low-TE simulations, which might explain the performance drop towards low TE values. In previous work, using fully simulated data, we showed that the maximum of the response surface plot and its shape are dependent on the contrast distribution of training and test data.36 The stress test result can thus be a measure of model analysis and optimization. One has to bear in mind that these extreme points are mathematical constraints, given by the minimum and maximum combinations of TE and TI of real sequences. The boundary of the experimental grid does not represent the boundary of the typical scan domain. The latter does not necessarily contain the combination of extreme values of both TE and TI at the same time. Those extreme data simulations are thus not part of the training data therefore causing severe drops in the F1 value. The high F1 scores for the two “high-TE corners” (Fig. 9) can also be explained by the high lesion contrast for these protocols. In contrast, the low lesion contrast yielded by low TE and TI values comes with low F1 scores, respectively. Another contribution of this work is thus a proof-of-concept for the description of the performance metric of an AI model in dependence of its influencing factors. The modeling yields a quantitative comparison of the relevance of all influencing factors. This concept of surface response modeling is based on well-established experimental designs and could be easily transferred to other common metrics48 (e.g., confusion matrix and derivatives or even uncertainty estimates49) or other models (e.g., classification models). Now, that the model function was confirmed, the number of experiments could be reduced significantly in future studies to reduce the computational effort. For the optimal “positioning” of these sample points on the “domain grid” for meaningful sampling of the surface response curve, state-of-the-art guidelines in the field of experimental design offer several recommendations depending on the number of influencing factors.13 4.3.LimitationsOne important limitation is the small number of test datasets used in this study. Thus, the absolute results of the stress tests might not be representative for a larger cohort of patients and lesions. They can only serve as a sample domain grid to confirm an appropriate model function and to demonstrate the proof of concept. Unfortunately, all open MS data are provided in NIfTI format and the OpenMS data are the only data that come at least with the information on TE, TI, and TR and thus all sequence parameters needed in the simulation. In real-world applications, one can assume that manufacturers of models have access to the entire DICOM header that also includes tags for TE, TI, TR, and many more. Thus, in theory, more acquisition shifts caused by other sequence parameters could be incorporated as influencing factors in the stress tests. However, since the number of sampling points on the domain grid quickly rises with every additional influencing factor, a prior prioritization is crucial. An intrinsic limitation of the T2w FLAIR and T1w sequences is that the CSF signal is very low or even nulled hampering partial volume estimation and relaxometry in this tissue. Accordingly, the differences between the simulations and measurements become most apparent in CSF compared to the other tissues, limiting the validation of the approach in CSF. Future work should investigate if tissue and relaxometry estimation can be improved by additionally incorporating the contrast of conventional T2w sequences in the first step of the image generation pipeline, as in these images CSF shows up brightly. All three scans (T2w, T2w FLAIR, and the post Gd T1w scan) constitute the “recommended core” in current MS scanning guidelines.10 Another limitation is the assumption that the average texture contribution to the signal is zero. This is not true in the case of artifacts resulting from inhomogeneities of B0, B1, or the receive coil sensitivity profile.50,51 The method is further only applicable to baseline images, of which the contrast can be fully described by the parameters accessible in the DICOM header; e.g., the parameter in Eq. (2) is approximated by , since it is not part of the DICOM header. In the real volunteer scans, the true value for was 30% higher. In these experiments, changing the parameter to the correct value did not have any influence on the outcome of the comparison (due to the long TR value). Still, there might be other measures of contrast manipulation in T2w FLAIR studies that are not accessible by the DICOM tags and that prevent an accurate estimation of the DICOM scaling factor and thus the texture amplitude (e.g., modulated RF pulses to prevent the signal from decaying in long echo trains, acceleration techniques and dedicated -space ordering, particularly common in 3D sequences,25,52–55 blood inflow,56 etc). Future work should elaborate to what extent these influences and their impact can be modeled and incorporated either in the simulation, e.g., by random guesses or in the stress tests represented by additional influencing factors. Despite these limitations, the image simulation and stress test methodology presented in this work allows for investigation of the robustness of AI models in response to arbitrary data shifts. Due to the lack of a gold standard, the metrological proof of the F1 response to parameter changes is not possible and absolute predictions about these values remain uncertain. However, influencing parameters in the MR sequence can be compared with each other by the surface model coefficients and—given a tolerated performance drop—“safe” parameters settings can be at least roughly assessed (Fig. 4). Using the simulation algorithm as an alternative augmentation method also allows for introducing a priori knowledge on MR signal variations into the AI-model development process. DisclosuresThe co-authors Mehmet Yigit Avci und Mehmet Yigitsoy are employed by the deepc GmbH, Munich. Code and Data AvailabilityThe MS data utilized in this study are listed Table 2. The data policy of the clinical study does not allow free access to the volunteer MRI data. Due to the collaboration agreement with the industrial partner, the code cannot be made available. AcknowledgmentsThis project was funded by the Zentrales Innovationsprogramm Mittelstand (ZIM) of the German Federal Ministry for Economic Affairs and Climate Action (BMWK) (Grant No. KK5050201 LB0). This work has further been partly supported by Collaborative Research Centre “Matrix in Vision” funded by German Research Foundation (DFG) (Grant No. CRC-1340). References“ACR List of FDA cleared AI medical products,”
https://aicentral.acrdsi.org/
(2022).
Google Scholar
D. C. Castro, I. Walker and B. Glocker,
“Causality matters in medical imaging,”
Nat. Commun., 11
(1), 3673 https://doi.org/10.1038/s41467-020-17478-w NCAOBW 2041-1723
(2020).
Google Scholar
E. Stamoulou et al.,
“Harmonization strategies in multicenter MRI-based radiomics,”
J. Imaging, 8
(11), 303 https://doi.org/10.3390/jimaging8110303
(2022).
Google Scholar
H. Guan and M. Liu,
“Domain adaptation for medical image analysis: a survey,”
IEEE Trans. Biomed. Eng., 69
(3), 1173
–1185 https://doi.org/10.1109/TBME.2021.3117407 IEBEAX 0018-9294
(2022).
Google Scholar
“Whitepaper for the ITU/WHO Focus Group on Artificial Intelligence for Health,”
(4 April 2024). https://www.itu.int/en/ITU-T/focusgroups/ai4h/Documents/FG-AI4H_Whitepaper.pdf Google Scholar
S. Reddy et al.,
“Evaluation framework to guide implementation of AI systems into healthcare settings,”
BMJ Health Care Inf., 28
(1), e100444 https://doi.org/10.1136/bmjhci-2021-100444
(2021).
Google Scholar
L. Oala et al.,
“Machine learning for health: algorithm auditing & quality control,”
J. Med. Syst., 45
(12), 105 https://doi.org/10.1007/s10916-021-01783-y JMSYDA 0148-5598
(2021).
Google Scholar
P. Omoumi et al.,
“To buy or not to buy—evaluating commercial AI solutions in radiology (the ECLAIR guidelines),”
Eur. Radiol., 31
(6), 3786
–3796 https://doi.org/10.1007/s00330-020-07684-x
(2021).
Google Scholar
J. Denck et al.,
“Automated protocoling for MRI exams—challenges and solutions,”
J. Digital Imaging, 35
(5), 1293
–1302 https://doi.org/10.1007/s10278-022-00610-1 JDIMEW
(2022).
Google Scholar
M. P. Wattjes et al.,
“2021 MAGNIMS–CMSC–NAIMS consensus recommendations on the use of MRI in patients with multiple sclerosis,”
Lancet Neurol., 20
(8), 653
–670 https://doi.org/10.1016/S1474-4422(21)00095-8
(2021).
Google Scholar
P. Linardatos, V. Papastefanopoulos and S. Kotsiantis,
“Explainable AI: a review of machine learning interpretability methods,”
Entropy, 23
(1), 18 https://doi.org/10.3390/e23010018 ENTRFG 1099-4300
(2020).
Google Scholar
B. McCrindle et al.,
“A radiology-focused review of predictive uncertainty for AI interpretability in computer-assisted segmentation,”
Radiol. Artif. Intell., 3
(6), e210031 https://doi.org/10.1148/ryai.2021210031
(2021).
Google Scholar
W. F. Guthrie, NIST/SEMATECH e-Handbook of Statistical Methods (NIST Handbook 151), National Institute of Standards and Technology(
(2020). Google Scholar
L. Arras, A. Osman and W. Samek,
“CLEVR-XAI: a benchmark dataset for the ground truth evaluation of neural network explanations,”
Inf. Fusion, 81 14
–40 https://doi.org/10.1016/j.inffus.2021.11.008
(2022).
Google Scholar
“Syreal-Synthesizing realistic variations in data for reliable medical machine learning at scale,”
https://www.hhi.fraunhofer.de/en/departments/ai/projects/syreal.html Google Scholar
A. F. Frangi, S. A. Tsaftaris and J. L. Prince,
“Simulation and synthesis in medical imaging,”
IEEE Trans. Med. Imaging, 37
(3), 673
–679 https://doi.org/10.1109/TMI.2018.2800298 ITMID4 0278-0062
(2018).
Google Scholar
R. K.-S. Kwan, A. C. Evans and G. B. Pike,
“MRISIM: Tissue MR parameters,”
https://brainweb.bic.mni.mcgill.ca/brainweb/tissue_mr_parameters.txt
(9 April 2024).
Google Scholar
L. A. Shepp and B. F. Logan,
“The Fourier reconstruction of a head section,”
IEEE Trans. Nucl. Sci., 21
(3), 21
–43 https://doi.org/10.1109/TNS.1974.6499235 IETNAE 0018-9499
(1974).
Google Scholar
Ž. Lesjak et al.,
“Validation of white-matter lesion change detection methods on a novel publicly available MRI image database,”
Neuroinformatics, 14
(4), 403
–420 https://doi.org/10.1007/s12021-016-9301-1 1539-2791
(2016).
Google Scholar
Ž. Lesjak et al.,
“A novel public MR image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus,”
Neuroinformatics, 16
(1), 51
–63 https://doi.org/10.1007/s12021-017-9348-7 1539-2791
(2018).
Google Scholar
A. Carass et al.,
“Longitudinal multiple sclerosis lesion segmentation: resource and challenge,”
NeuroImage, 148 77
–102 https://doi.org/10.1016/j.neuroimage.2016.12.064 NEIMEF 1053-8119
(2017).
Google Scholar
M. Styner et al.,
“3D segmentation in the clinic: a grand challenge II: MS lesion segmentation,”
MIDAS J.,
(2008).
Google Scholar
O. Commowick et al.,
“MSSEG-2 challenge proceedings: multiple sclerosis new lesions segmentation challenge using a data management and processing infrastructure,”
in MICCAI 2021 - 24th Int. Conf. Med. Image Comput. and Computer Assist. Interv.,
126
(2021). Google Scholar
E. Roura et al.,
“Automated detection of lupus white matter lesions in MRI,”
Front. Neuroinf., 10 33 https://doi.org/10.3389/fninf.2016.00033
(2016).
Google Scholar
J. N. Rydberg et al.,
“Contrast optimization of fluid-attenuated inversion recovery (FLAIR) imaging,”
Magn. Reson. Med., 34
(6), 868
–877 https://doi.org/10.1002/mrm.1910340612 MRMEEN 0740-3194
(1995).
Google Scholar
J. Tohka, A. Zijdenbos and A. Evans,
“Fast and robust parameter estimation for statistical partial volume models in brain MRI,”
Neuroimage, 23
(1), 84
–97 https://doi.org/10.1016/j.neuroimage.2004.05.007 NEIMEF 1053-8119
(2004).
Google Scholar
J. Linn, M. Wiesmann, H. Brückmann,
“Infektiöse und entzündlich-demyelinisierende Erkrankungen,”
Atlas Klinische Neuroradiologie des Gehirns, 311
–384 Springer, Berlin, Heidelberg
(2011). Google Scholar
B. Billot et al.,
“SynthSeg: segmentation of brain MRI scans of any contrast and resolution without retraining,”
Med. Image Anal., 86 102789 https://doi.org/10.1016/j.media.2023.102789
(2023).
Google Scholar
B. Billot et al.,
“Partial volume segmentation of brain MRI scans of any resolution and contrast,”
Lect. Notes Comput. Sci., 12267 177
–187 https://doi.org/10.1007/978-3-030-59728-3_18 LNCSD9 0302-9743
(2020).
Google Scholar
K. J. Layton et al.,
“Pulseq: a rapid and hardware-independent pulse sequence prototyping framework,”
Magn. Reson. Med., 77
(4), 1544
–1552 https://doi.org/10.1002/mrm.26235 MRMEEN 0740-3194
(2017).
Google Scholar
M. Newville et al.,
“lmfit/lmfit-py: 1.2.1,”
Zenodo(
(2023). Google Scholar
R. E. Gabr et al.,
“Patient-specific 3D FLAIR for enhanced visualization of brain white matter lesions in multiple sclerosis,”
J. Magn. Reson. Imaging, 46
(2), 557
–564 https://doi.org/10.1002/jmri.25557
(2017).
Google Scholar
A. Parry et al.,
“White matter and lesion T1 relaxation times increase in parallel and correlate with disability in multiple sclerosis,”
J. Neurol., 249
(9), 1279
–1286 https://doi.org/10.1007/s00415-002-0837-7
(2002).
Google Scholar
G. J. Stanisz et al.,
“T1, T2 relaxation and magnetization transfer in tissue at 3T,”
Magn. Reson. Med., 54
(3), 507
–512 https://doi.org/10.1002/mrm.20605 MRMEEN 0740-3194
(2005).
Google Scholar
A. Parry et al.,
“MRI Brain T1 relaxation time changes in MS patients increase over time in both the white matter and the cortex,”
J. Neuroimaging, 13
(3), 234
–239 https://doi.org/10.1111/j.1552-6569.2003.tb00184.x JNERET 1051-2284
(2003).
Google Scholar
C. Posselt et al.,
“Novel concept for systematic testing of AI models for MRI acquisition shifts with simulated data,”
Proc. SPIE, 12467 124671B https://doi.org/10.1117/12.2653883 PSISDG 0277-786X
(2023).
Google Scholar
F. Isensee et al.,
“nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,”
Nat. Methods, 18
(2), 203
–211 https://doi.org/10.1038/s41592-020-01008-z 1548-7091
(2021).
Google Scholar
A. Myronenko,
“3D MRI brain tumor segmentation using autoencoder regularization,”
Lecture Notes in Comput. Sci., 11384 311
–320 https://doi.org/10.1007/978-3-030-11726-9_28
(2018).
Google Scholar
S. M. Smith,
“Fast robust automated brain extraction,”
Hum. Brain Mapp., 17
(3), 143
–155 https://doi.org/10.1002/hbm.10062 HBRME7 1065-9471
(2002).
Google Scholar
, “MRISIM: tissue MR parameters,”
https://brainweb.bic.mni.mcgill.ca/brainweb/tissue_mr_parameters.txt
(1996).
Google Scholar
S. C. L. Deoni,
“Quantitative relaxometry of the brain,”
Top. Magn. Reson. Imaging, 21
(2), 101
–113 https://doi.org/10.1097/RMR.0b013e31821e56d8 TMRIEY 0899-3459
(2010).
Google Scholar
M. Tranfa et al.,
“Quantitative MRI in multiple sclerosis: from theory to application,”
Am. J. Neuroradiol., 43
(12), 1688
–1695 https://doi.org/10.3174/ajnr.A7536
(2022).
Google Scholar
S. Kazeminia et al.,
“GANs for medical image analysis,”
Artif. Intell. Med., 109 101938 https://doi.org/10.1016/j.artmed.2020.101938 AIMEEW 0933-3657
(2020).
Google Scholar
T. Wang et al.,
“A review on medical imaging synthesis using deep learning and its clinical applications,”
J. Appl. Clin. Med. Phys., 22
(1), 11
–36 https://doi.org/10.1002/acm2.13121
(2021).
Google Scholar
G. Müller-Franzes et al.,
“A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis,”
Sci. Rep., 13 12098 https://doi.org/10.1038/s41598-023-39278-0
(2023).
Google Scholar
L. X. Nguyen et al.,
“A new chapter for medical image generation: the stable diffusion method,”
in Int. Conf. Inf. Networking (ICOIN),
483
–486
(2023). Google Scholar
P. Schmidt et al.,
“Automated segmentation of changes in FLAIR-hyperintense white matter lesions in multiple sclerosis on serial magnetic resonance imaging,”
Neuroimage Clin., 23 101849 https://doi.org/10.1016/j.nicl.2019.101849
(2019).
Google Scholar
L. Maier-Hein et al.,
“Metrics reloaded: recommendations for image analysis validation,”
Nat. Methods, 21 195
–212 https://doi.org/10.1038/s41592-023-02151-z
(2024).
Google Scholar
T. Nair et al.,
“Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation,”
Med. Image Anal., 59 101557 https://doi.org/10.1016/j.media.2019.101557
(2020).
Google Scholar
U. Vovk, F. Pernus and B. Likar,
“A review of methods for correction of intensity inhomogeneity in MRI,”
IEEE Trans. Med. Imaging, 26
(3), 405
–421 https://doi.org/10.1109/TMI.2006.891486 ITMID4 0278-0062
(2007).
Google Scholar
O. Dietrich, M. F. Reiser and S. O. Schoenberg,
“Artifacts in 3-T MRI: physical background and reduction strategies,”
Eur. J. Radiol., 65
(1), 29
–35 https://doi.org/10.1016/j.ejrad.2007.11.005 EJRADR 0720-048X
(2008).
Google Scholar
S. J. P. Meara and G. J. Barker,
“Evolution of the longitudinal magnetization for pulse sequences using a fast spin-echo readout: application to fluid-attenuated inversion-recovery and double inversion-recovery sequences,”
Magn. Reson. Med., 54
(1), 241
–245 https://doi.org/10.1002/mrm.20541 MRMEEN 0740-3194
(2005).
Google Scholar
R. F. Busse et al.,
“Fast spin echo sequences with very long echo trains: design of variable refocusing flip angle schedules and generation of clinicalT2 contrast,”
Magn. Reson. Med., 55
(5), 1030
–1037 https://doi.org/10.1002/mrm.20863 MRMEEN 0740-3194
(2006).
Google Scholar
R. F. Busse et al.,
“Effects of refocusing flip angle modulation and view ordering in 3D fast spin echo,”
Magn. Reson. Med., 60
(3), 640
–649 https://doi.org/10.1002/mrm.21680 MRMEEN 0740-3194
(2008).
Google Scholar
III J. P. Mugler,
“Optimized three-dimensional fast-spin-echo MRI,”
J. Magn. Reson. Imaging, 39
(4), 745
–767 https://doi.org/10.1002/jmri.24542
(2014).
Google Scholar
S. Naganawa et al.,
“Comparison of flow artifacts between 2D-FLAIR and 3D-FLAIR sequences at 3 T,”
Eur. Radiol., 14
(10), 1901
–1908 https://doi.org/10.1007/s00330-004-2372-7
(2004).
Google Scholar
BiographyChristiane Posselt worked as a research assistant on the NeuroTEST project at the University of Applied Sciences in Landshut. Her main focus was the exploration of the simulation and stress test methods used in this work. She holds a master’s degree in electrical engineering from the University of Applied Sciences in Landshut, Germany. Mehmet Yigit Avci received his bachelor’s degree in electrical and electronics engineering from Bogazici University, Istanbul. He is currently a master’s student at Technical University of Munich with a specialization in biomedical computing. His research interests are medical imaging and machine learning. Patrick Schuenke studied physics at the University of Heidelberg in Germany and completed his PhD in physics at the German Cancer Research Center (DKFZ) in 2017. Afterwards, he worked as a postdoctoral researcher at the Leibniz-Forschungsinstitut für Molekulare Pharmakologie in Berlin. In 2020, he transitioned to the Physikalisch-Technische Bundesanstalt, where he currently focuses on advancing quantitative MRI techniques and developing open-source MRI software. Christoph Kolbitsch is head of the research group “Quantitative MRI” at the Physikalisch-Technische Bundesanstalt in Berlin, Germany. He received his PhD from King's College London working on motion compensation for high-resolution cardiac MRI. His group is mainly working on advanced image reconstruction techniques combining detailed physical models of the imaging process with the flexibility of deep learning. He is also an advocate for open-source image reconstruction software: https://github.com/PTB-MR/mrpro. Tobias Schaeffter is the head of division of Medical Physics and metrological IT at the Physikalisch-Technische Bundesanstalt in Berlin, Germany. He is a professor in Biomedical Imaging at TU-Berlin and the Einstein Centre Digital Future. He studied electrical engineering at TU-Berlin and did his PhD at University Bremen. From 1996 to 2006, he worked as a principal scientist at the Philips Research Laboratories in Hamburg, Germany. From 2006 to 2015, he was professor of imaging sciences at King’s College London. Stefanie Remmele is a professor of medical technologies at the University of Applied Sciences in Landshut, Germany since 2012. Prior to that, she conducted research on quantitative MR methods at Philips Research in Hamburg, Germany. Her current research interests include image simulation and synthesis in radiology and image-guided therapy, and she is the head of the Research Group of Medical Technologies at the University in Landshut. |
Magnetic resonance imaging
Image segmentation
Data modeling
Photovoltaics
Tissues
Education and training
Artificial intelligence