Comparison of Phase Estimation Methods for Quantitative Susceptibility Mapping Using a Rotating-Tube Phantom

Quantitative Susceptibility Mapping (QSM) is an MRI tool with the potential to reveal pathological changes from magnetic susceptibility measurements. Before phase data can be used to recover susceptibility (Δχ), the QSM process begins with two steps: data acquisition and phase estimation. We assess the performance of these steps, when applied without user intervention, on several variations of a phantom imaging task. We used a rotating-tube phantom with five tubes ranging from Δχ=0.05 ppm to Δχ=0.336 ppm. MRI data was acquired at nine angles of rotation for four different pulse sequences. The images were processed by 10 phase estimation algorithms including Laplacian, region-growing, branch-cut, temporal unwrapping, and maximum-likelihood methods, resulting in approximately 90 different combinations of data acquisition and phase estimation methods. We analyzed errors between measured and expected phases using the probability mass function and Cumulative Distribution Function. Repeatable acquisition and estimation methods were identified based on the probability of relative phase errors. For single-echo GRE and segmented EPI sequences, a region-growing method was most reliable with Pr (relative error <0.1) = 0.95 and 0.90, respectively. For multiecho sequences, a maximum-likelihood method was most reliable with Pr (relative error <0.1) = 0.97. The most repeatable multiecho methods outperformed the most repeatable single-echo methods. We found a wide range of repeatability and reproducibility for off-the-shelf MRI acquisition and phase estimation approaches, and this variability may prevent the techniques from being widely integrated in clinical workflows. The error was dominated in many cases by spatially discontinuous phase unwrapping errors. Any postprocessing applied on erroneous phase estimates, such as QSM's background field removal and dipole inversion, would suffer from error propagation. Our paradigm identifies methods that yield consistent and accurate phase estimates that would ultimately yield consistent and accurate Δχ estimates.

Repeatability and reproducibility of QSM have been assessed in phantoms and human subjects using different scanners, magnetic field strengths, and data processing methods. While some studies report high repeatability [11][12][13][14][15][16][17][18], both in vivo and in phantoms, recent in vivo studies report lower reproducibility across MRI scanners with the same data processing method [19] and across QSM algorithms using the same input data [20]. ese conflicting results limit the clinical adoption of QSM.
In its standardization efforts, the QSM community has actively evaluated competing methods [17,22,23], in particular for methods in Steps 3 and 4 of the process [20]. However, the selection of the "best" QSM method is difficult for various reasons: (a) the appropriate definition of a quality metric, for example, accuracy versus repeatability; (b) competing quality metrics that favor different algorithms [20]; (c) the lack of a gold standard in vivo; (d) algorithm performance depending on imaging application (in vivo versus phantoms); and (e) the large number of methods for each QSM step, which would render any exhaustive validation effort to be combinatorial and quickly untenable.
We present an experimental setup that allows for an exhaustive quantitative analysis of all four QSM steps. is framework uses a rotating-tube phantom design introduced in the work of Erdevig et al. [24], which uses tubes that rotate, within a background solution, relative to the main magnetic field, B 0. e design enables the analysis of MRI data obtained in objects at any orientation, using common QSM techniques. e closed-form theoretical relationship between the magnetic field and magnetic susceptibility in the sample allows for mapping the magnetic field to magnetic susceptibility without having to solve the dipole inversion problem [25].
Our contributions include (a) a framework for evaluation of repeatability and reproducibility of QSM algorithms and (b) rigorous analysis of common methods in Steps 1 and 2 of the QSM process. We focus on the performance of QSM Steps 1 and 2 for the following reasons: (a) ere exist a large number of methods in each of the four QSM steps. We replicate here approximately 90 different combinations of Steps 1 and 2 methods. If we analyze all four steps simultaneously, the resulting data set would grow combinatorially and become difficult to interpret. (b) Errors introduced early in the QSM process propagate downstream and have been overlooked in validation studies. For example, Olsson et al. used only one acquisition sequence and one phase estimation algorithm [22]. (c) It is well understood that phase unwrapping, a common algorithm used in Step 2, is a nondeterministic polynomial-time-hard problem (in two dimensions and higher) that often relies on user intervention and careful parameter tuning. erefore, it is important to isolate the impact of such a problem in the QSM processing methods. (d) Susceptibility weighted imaging [26,27], electrical properties tomography [28], thermometry [29], flow [30], and elastography [31] use MR phase information, and this work can inform those applications.
We analyze Steps 1 and 2 of the QSM process to understand which are sufficiently robust to be executable without user intervention and independent of scanner, sequence, and other parameter variations.

Methods
We used a rotating-tube phantom (Figure 1(a)) to explore the reproducibility of phase estimates obtained after Steps 1 and 2. MRI data was acquired with different pulse sequences (Step 1), at nine angles, and the performance of a wide variety of phase estimation methods was compared to theory. e rotating-tube phantom was designed to take advantage of the analytical model for a long cylinder at an angle, θ, with respect to B 0 . e internal z-axis field offset can be derived from Maxwell's equations and is shown to be [25] where Δχ is the susceptibility difference between the inside and outside of the cylinder and χ 2 terms are ignored.

Experimental Setup.
e rotating-tube phantom consists of five cylindrical, polypropylene tubes (80 mm length and 10 mm outer diameter; Nalgene cryogenic storage vials) alternated orthogonally along the central axis of a larger cylinder (610 mm length and 140 mm outer diameter) containing water (Figure 1). Each tube contains one of the following: 0.5 mM GdCl 3 , 1.0 mM GdCl 3 , or 3.2 mM CuSO 4 ( Figure 1(b)). is phantom is modifiable, and the range of susceptibility values can be focused on particular areas of interest. Here, the range of susceptibility values was selected to span those observed in vivo from venous blood (1.0 mM GdCl 3 ) to deep grey matter structures (0.5 mM GdCl 3 ) to the lower limits of MRI detection (3.2 mM CuSO 4 ) [32]. A rod extends from the internal rotation gears through the phantom and outside the MRI scanner, allowing the tubes to be manually rotated. Example MR images are shown for the three primary planes (Figure 1(c)) and in a 3D rendering of the tubes (Figure 1(d)). Temperature of the water was continuously monitored via a fiber optic probe (OpSens Medical, Québec, QC, Canada). e paramagnetic salt solution Δχ values were estimated from susceptibility theory and corrected using Curie's law with the experimentally measured temperatures (21.0°C-22.0°C). Δχ values were 0.336 ppm and 0.168 ppm for the 1 mM GdCl 3 and 0.5 mM GdCl 3 , respectively, and 0.0804 ppm for the 3.2 mM CuSO 4 . Additional details on the calculation of Δχ are in the Supplementary Material.

Data
Acquisition. MR data was collected on a 3T Siemens Biograph mMR (MR-PET scanner, Syngo MR E11 software) with a 6-element torso array and 9-element spine array coil for a total of 15 elements. To assess the reproducibility of phase estimation across pulse sequences, we acquired data with four gradient echo (GRE) pulse sequences (details in Table 1): is is a commonly chosen protocol with QSM and other susceptibility-based techniques [33], wherein a singleecho time TE is measured as close to T * 2 of tissue of interest (here, the target is 60 ms for 1 mM GdCl 3 ). is maximizes the phase SNR at this T * 2 . To maximize the magnitude SNR at the chosen TE and TR, we set the readout bandwidth at its lowest possible value.
(ii) Segmented Echo Planar Imaging (sEPI): A recently proposed sEPI sequence was shown to possess similar quality phase images as SEGE [34], while acquiring full brain coverage much faster than SEGE. As with SEGE, phase images were generated at a single TE at the center of the echo train.
(iii) Multiecho GRE (MEGE): is protocol acquires multiple TEs in a single TR. e challenge with this technique is the choice of the echo spacing ΔTE and readout bandwidth BW. A short ΔTE reduces the likelihood of aliasing in the phase domain but introduces noise. A long ΔTE yields phase images with better SNR but suffers from potentially unrecoverable phase-aliasing errors. For example, to unwrap frequency offsets of ±150 Hz, ΔTE must be less than 3.33 ms. A common approach [33,35] is to acquire data with a short ΔTE and, in order to recover SNR efficiency similar to SEGE sequences [36], acquire as many echoes as possible in a TR. However, due to hardware limitations, the readout bandwidth, BW, places a lower limit on ΔTE. We aimed to select the shortest ΔTE possible at the highest BW attainable with the MR system. is choice minimizes the likelihood of phase wrap errors, which may not be recoverable by all phase unwrapping algorithms. We elaborate on this choice further in the Discussion section. (iv) MAGPI: is is an MEGE sequence that uses preoptimized echo times and bandwidths selection that, when paired with a corresponding phase estimation algorithm, yields maximum-likelihood optimal phase estimates in the presence of wrapping, noise, and phase-offset errors [37,38].
All sequences were 3D excitations of a 128 mm × 128 mm × 128 mm slab (64 slices) with sagittal slab-selection and phase encode along the B 0 direction. We used anisotropic voxels to boost SNR, a common practice for QSM and Susceptibility-Weighted Imaging [26,27,39]. MAGPI and the sEPI sequences used alternating gradient polarity, while the other sequences did not. Autoshimming was completed prior to the first data acquisition, and then the same shim parameters were used over time, over all sequences and methods.
We assessed the effect of in-plane resolution and TR on phase estimate reproducibility with each protocol (Table 1). We also examined the reproducibility of phase estimates across nine different angles by advancing the apparatus approximately 18 degrees per turn.

Phase Estimation.
Images generated in Steps 2-3 of QSM are commonly referred to as frequency, phase, or field maps, depending on the units of the data. We interchangeably use these names in this work depending on context, and, in our analysis, we convert all phase images to frequency via a simple scalar multiplication. Ten phase algorithms were selected to estimate the frequency offset image [37,[40][41][42][43][44][45][46][47]. Table 2 lists the methods used for each pulse sequence; multiple codes were downloaded from freely available resources (e.g., MEDI) and integrated with the pipeline.
All phase estimation methods were applied with default parameters in 3D over the entire acquisition volume and were implemented and run blinded to the theoretical solution. Apart from the MAGPI algorithm, which operates on raw k-space channel-uncombined data, all methods (including MAGPI-unopt) operated on unprocessed phase data obtained using the vendor-provided adaptive-coil-combine method. While this is "unprocessed phase data," different vendors may apply different filters, phase corrections, or other adjustments that could influence the phase quality.
Here, adaptive-coil-combine describes the algorithm used by this vendor to combine multichannel coil data [48]. Each unwrapping method used the SNR in the magnitude image to guide unwrapping orientation in the phase domain: for example, the Laplacian-based methods used this SNR to mask the entire image, while others (region-growing, GBC) masked the phase values in regions with poor SNR. SNR was measured as approximately 30 dB in water, similar to in vivo values ranging from 25 to 30 dB.
For MEGE sequences, the multiecho data is processed using the five following categories of algorithms: (1) Spatial phase unwrapping at each echo, followed by temporal combination of the resulting images using a weighted averaging method, that is, phase SNRoptimal [49].
(2) Spatial phase unwrapping at each echo with weighted averaging (as in 1), but a 1D phase unwrapping step is used just before weighted averaging. is is meant to correct any remaining aliasing that spatial unwrapping failed to correct. (3) Direct temporal phase estimation (Slope, Division) applied in complex domain. ese methods correctly unwrap the phase over time, provided the inherent frequency is less than the Nyquist frequency associated with the echo spacing. (4) Temporal phase combination (as in 3), followed by 3D spatial phase unwrapping to correct errors encountered with temporal phase estimation. (5) Maximum-likelihood-based combination of multiecho and multichannel data (MAGPI) [37]. is method solves the phase estimation problem on a voxel-by-voxel basis, without resorting to spatial averaging techniques.
All phase images are eventually converted to frequency offset (Hz) by dividing by 2π x phase evolution time.

Adjusting for Field due to Rotation of Apparatus.
To accurately estimate Δχ from the phase images, we need to remove the global field effects resulting from the tubes rotating in the magnetic field, as well as field effects from the apparatus itself. We call this process "frequency referencing" (FR). e scalar magnitude of the complete field inside a voxel can be written as follows: where r is the spatial coordinate of the voxel and θ is the rotation angle of the tube relative to B, δB Δχ (r, θ) is the field caused by magnetic susceptibility variations (such as the one due to a homogeneous cylindrical object immersed in a homogeneous sphere), and δB 0 (r, θ) is an unknown field offset component. We separate δB 0 (r, θ) into a component that varies only spatially and a component that varies only with the rotation of the apparatus: At a given angle of rotation, sources of spatially varying global offsets δB bkg 0 (r) are field inhomogeneity (imperfections of magnetic field/coils), bulk magnetic susceptibility of the apparatus [50,51], and coil phase offset [52]. As the apparatus is rotated, in the absence of "shimming" at the console, the center frequency will be shifted due to the bulk magnetic susceptibility of the entire apparatus [53,54]. Our goal is to extract Δχ by fitting δB(r, θ) to the angle of rotation θ, inside the tube. δB bkg 0 (r) is a nuisance parameter that can be easily accounted for during the fitting process by allowing for a constant shift to the cosine.
First, we compute an estimate of δB rot 0 (θ) using the average frequency in a static region outside the "Tube + -Sphere" system (Figure2(a)). e average (indicated by < >) is taken over pixels in a region r out distant from local susceptibility effects: where we define C r out to be a variable that is only a function of the referencing region. en, the referencing step consists of thus removing δB rot 0 (θ). Our goal is to use the referenced field, δB ref (r, θ), at every pixel in the tube center, r � r in , to fit the field variation to the angle of rotation θ. e only component that varies with θ is the first term on right side of equation (5). e estimation step is a simple fit with respect to θ, along with an arbitrary shift for the constant: c � δB bkg 0 (r) − C r out . For the case of a cylinder (equation (1)), the estimate can be obtained by solving Fitting c effectively amounts to shifting the midline of the data to match the model (across all angles). We used a bisquare-weighting method to fit this midline. We include a few examples of Δχ estimation using equation (6) in Table S.1. We apply this frequency referencing method after each phase estimation algorithm. To investigate repeatability with respect to the location of frequency reference estimate, we apply this process in 13 different regions selected across static areas of the phantom (Figure 2(a)).

Error Analysis.
We use the theoretically determined values of Δχ to predict the field values at each angle that would have been measured with ideal methods in Steps 1 and 2. We then compute the error (Hz) between measured frequency δB ref (r in , θ) and expected frequency offset: Radiology Research and Practice where c * is obtained by solving equation (6) with Δχ set to Δχ true .
To account for any dependence on tube content, we compute the absolute relative error: where the absolute value is used instead of the signed error due to the irrelevance of sign in this context. Error statistics were computed for each voxel in the tube ROI (mean tube ROI sizes were 17 pixels at 1 mm resolution and 57 pixels at 0.5 mm resolution), for each slice (2 slices per tube), each tube (5 total), each angle (9 total rotations), using each applicable phase estimation method (from a possible 10), each background phase removal ROI (13 total), and each sequence with its respective resolution and TR variations (11 total). Cumulatively, 2.36 million frequency values were analyzed in this experiment. e large number of data points allows us to extract statistics about ε and ε r including their probability mass function, Pr(ε) and Pr(ε r ). e probability of ε can exhibit a multimodal distribution and therefore is not a Gaussian. While we can report the absolute bias and standard deviations from such a distribution, it would not be descriptive of Pr(ε). A more practical measure is the likelihood of observing absolute relative errors less than or equal to a threshold, τ. is is obtained by integrating Pr(ε r ) between 0 and τ a measure known as the Cumulative Distribution Function (CDF),  Radiology Research and Practice e CDF can be used to capture phase errors that are dominated by outliers, as well as phase errors that result from generally poor/unreliable model fitting. e ideal CDF is a step function, and any presence of outliers/large errors yields a CDF with slow convergence to 1. e frequency of large errors is seen from the magnitude of the deviation of the CDF from 1.0 at any given threshold.  Figure 2(b) shows frequency offset images corresponding to different rotations. Figure 2(c) shows a typical field difference between two angles (δB(r, θ 2 ) − δB(r, θ 1 )). According to equation (3), this difference is equal to δB Δχ (r, θ 2 ) − δB Δχ (r, θ 1 ) + δB rot 0 (θ 2 ) − δB rot 0 (θ 1 ), predicting spatial variations only in locations close to areas with susceptibility changes and a constant field in homogeneous locations.

Results
is is precisely what we observe in Figure 2(c). Figure 2(d) shows a plot of the frequency values for a voxel inside the tube prior to frequency referencing (symbol "x"). e resulting plot does not follow the expected sinusoid (solid line). After applying the frequency referencing step, we observe the expected sinusoidal shape (symbol "o"). Figure 3 shows the resulting histogram of frequency errors and resulting CDF. From Figures 3(a) and 3(b), we note the probability of ε for this example exhibits a multimodal distribution. We can see from the CDF in Figure 3(c) that the probability of observing errors less than 10% (F ε r (0.1)) is about 66%. Table 3 summarizes the error statistics for all combination of sequences and algorithms. e first column lists the sequence type, and the second and third columns indicate the name and category of each postprocessing algorithm, respectively. We report the mean and standard deviations of both ε and ε r . We also report F ϵ r (0.1) pooled over all background ROIs, as well as the range (minimum, F ε rW (0.1), and maximum, F ε rB (0.1)) of F ε r (0.1) encountered in those ROIs. ese quantities are computed from data that includes all tubes and angles. A representative subset of these results is selected for more detailed analysis and illustration in Figures 4-6. Figure 4 shows a subset of frequency-offset images for different sequence + algorithm pairs, at 3 of the 9 angles of rotation. is figure illustrates typical challenges with phase estimation methods. For example, in Figure 4(a), we observe phase unwrapping errors in SEGE + GBC, with abrupt jumps across contiguous regions. e corresponding frequency versus angle plot (last column in Figure 4(a)) shows that incorrect frequency referencing in these areas (square in figure) yields occasional mismatch between measurement and predictions at certain angles. SEGE + Laplace demonstrates a smoothly varying frequency map across the FOV; however, the resulting data deviates from the expected theoretical values at almost every angle (Figure 4(b)). MEGE + Slope, a direct temporal phase estimation method (MEGE category 3), exhibits phase wrapping errors when the underlying frequency value is larger than the bandwidth allowable by ΔTE (Figure 4(c)). Placing a frequency referencing ROI in these areas yields incorrect values at the respective angles. Note that the example frequency reference ROI (#9 in Figure 2) is meant to highlight the phase errors or artifacts observed. Figures 4(d) and 4(e) show that the results from MAGPI-unopt and MAGPI are consistent with those predicted from theory. e CDF of ε r collects the errors, such as those observed in Figure 4, over a variety of acquisition and processing parameters. Figure 5 shows F ε r (τ) for all algorithms when ε r is pooled over all voxels, background ROIs, slices, tubes, and sequence variations. is represents an overall summary of algorithm behavior, irrespective of which parameter was used in acquisition and postprocessing. We see that MAGPI attains a nearly ideal CDF, with 0.91 probability of relative errors, ε r , less than 0.1 (F ε r (0.1) � 0.91) and rapidly converges to 1 ( Figure 5). MAGPI and MAGPI-unopt achieve similar CDFs, with MAGPI performing slightly better, as expected. MEDI.RG and GBC phase unwrapping methods, both based on region growing, have similar CDFs, with F ε r (0.1) � 0.69 and F ε r (0.1) � 0.70, respectively. e unprocessed phase images have the most artifacts and, thus, the lowest CDF across all ε r . Figure 5 focuses on the CDF for ε r in [0, 1.0] to highlight the different convergence pattern (distribution/frequency of errors) in that domain. e CDF extends beyond ε r � 1.0 for any occurrence of relative errors greater than 100%.
Next, we explore the behavior of F ε r (τ) as a function of data acquisition strategy. In Figures 6(a)-6(c), we group results by three sequence types: SEGE, sEPI, and MEGE. Since MAGPI is a multiecho sequence, we include MAGPI in the MEGE category. For each CDF curve F ϵ r (τ), ϵ r is pooled across all pixels, background ROIs, slices, tubes, and variations of TR and resolution within that sequence type. We also explore variability of F ϵ r (τ) with the frequency referencing method. For each sequence, we show CDFs in the ROI with the maximum F ϵ r (τ � 0.1) (F ϵ rB , Figures 6(d)-6(f )) and the ROI with the minimum F ϵ r (τ � 0.1) (F ϵ rW , Figures 6(g)-6(i)) for a given data set. e separation between F ϵ rB and F ϵ rW demonstrates the robustness (or lack thereof ) of a method to frequency reference ROI selection. Table 4 shows the dependence of the CDF on scan variability and tube contents. Because the performance of some methods is dominated by frequency reference ROI (seen in Table 3 and Figure 6), Table 4 shows results for the largest F ϵ r (10%) for a given sequence + method pair observed over all frequency reference ROIs, for every method, sequence, scan variation, tube, and slice.

Discussion
A reliable pulse sequence protocol, with repeatable and reproducible phase estimation, is a necessary step to develop robust QSM methods for clinical use. Previous work examined the reproducibility of certain QSM methods using phantoms [22,25,55], simulation [22,25,56], and human subjects [19,56]. We used a rotating-tube phantom to Radiology Research and Practice Table 3: Error statistics for each of the estimation methods and acquisition protocols. e mean and standard deviation of error measurements ε (Hz), as well as the absolute relative errors ε r , are provided in columns 4-7. Columns 8-9 show F ϵ r (0.1) of the overall data, as well as the range (min and max) of F ϵ r (0.1) across reference backgrounds.

Protocol
Postprocessing   Radiology Research and Practice quantitatively evaluate methods used for data acquisition and phase estimation. is is not the first study using long tubes nor is it the first to position tubes relative to B 0 ; however, compared to previous work [22,24,25,55,[57][58][59], more acquisition and phase estimation methods were considered. Specifically, we used ∼90 different combinations of pulse sequences and phase estimation methods to analyze millions of measurements from different ROIs, tube contents, rotations, and sequence parameters. is vast amount of data ultimately allowed us to estimate the probability distribution of phase error with every QSM method, along with other important statistics. Additional phase estimation methods could be retrospectively used on the data set, which we aim to make publicly available.
Our results showed varying degrees of accuracy and precision over all tested methods. For example, while the majority of methods resulted in μ ϵ less than 1 Hz (note the particularly small μ ϵ with MEDI.LP, Laplace, MEGE + MAGPI-unopt, and MAGPI), the only methods with μ ϵ r <10% are SEGE + MEDI.RG (7.6%), MEGE + MAGPI-unopt (5.3%), and MAGPI (5.0%). We observed a similar trend with precision, whereby methods  Figure 5: CDF F ε r of the absolute relative error obtained for each of the phase estimation methods studied in this work, when the error is pooled over all sequences, slices, tubes, angles, and frequency referencing ROIs. Note that some methods do not converge to probability of 1, due to the presence of errors greater than 100%.     Table 4: Probability of relative errors <0.1 in the background ROI that maximizes F ϵ r (0.1) across sequences, scan variability (ID is as listed in Table 1), and phantom tube components. e tube contents are 1.0 mM GdCl 3 in tubes 1 and 2, 0.5 mM GdCl 3 in tubes 3 and 4, and 3.2 mM CuSO 4 in tube 5. e last column is the standard deviation of all F ϵ rB (0.1) for the given method.

Method
Sequence  Figure 6: e CDFs for each acquisition and phase estimation method pair. Each column corresponds to a different type of acquisition: single-echo acquisitions (SEGE) are shown in the first column, accelerated acquisitions with sEPI in the second, and multiecho acquisitions (MEGE and MAGPI) in the third. e first row (a-c) shows the CDF when all the error data is pooled, over all voxels, slices, tubes, and background ROIs. Given the large variability in performance for different background ROIs, we show in the second (d-f ) and third (g-i) rows the CDFs obtained in the ROIs with the maximum and minimum F ϵ r (0.1), respectively. A highly reproducible method will have a similar curve shape across all plots.
with the lowest relative standard deviation σ ϵ r were MEGE + MEDI.LP (8.9%), MEGE + MAGPI-unopt (4.8%), and MAGPI (4.1%). e detailed behavior of the error is captured by the CDF (or PDF) of the data ( Figure 5). A summary of the CDF is in the second-to-last column of Table 3 where we show the probability of observing relative errors <10%, F ϵ r (10%), which captures the frequency by which relatively acceptable errors occur. An advantage of the Laplace-based methods is that they had smooth phase maps with qualitatively no apparent phase jumps. However, analysis showed that Laplace phase images result in quantitatively larger errors (a low F ϵ r (10%)) than other methods, suggesting incorrect phase unwrapping results, similar to the work of Chen et al. [60]. Other methods with low F ϵ r (10%) were the unprocessed phase data, MEGE + Slope/Div, and MEGE + MEDI.RG, which had large phase unwrapping errors in a significant proportion of the data. F ϵ r (10%) is an arbitrary point at which we highlight the behavior of F ϵ r and does not represent the entirety of the distribution of error (or CDF). For example, F ϵ r (10%) of SEGE + MEDI.RG was comparable to MEGE + MAGPI-unopt and MAGPI, despite the comparatively poorer (larger) μ ϵ r and σ ϵ r of SEGE + MEDI.RG.
is is due to relative errors falling mostly within the chosen 10% threshold for these methods.
We explored the dependence of errors on frequency reference ROI location ( Figure 6). Since phase estimation errors (particularly large errors) are undesirable anywhere in the FOV, any spatial variation of the CDF highlights the potential dependence of the method on user intervention and/or its automated processing. We show the range of F ϵ r (10%) observed across the 13 different frequency reference ROIs in the last column of Table 3. e results suggest that the most repeatable methods across background ROIs are SEGE + MEDI.RG, sEPI + MEDI.RG, MEGE + MAGPIunopt, and MAGPI. e MEGE data was processed using four broad classes of postprocessing algorithms. We note the following about these algorithms: (i) e results obtained with methods in Categories 1 and 2 were fundamentally similar. at is, additional 1D-temporal processing does not alter the performance of 3D spatial unwrapping methods (with the exception of Laplacian-based methods, which we discuss below). is redundancy is due to the inherent Nyquist limitation associated with the echo spacing. As a result, we focus on the distinctive results of Category 1: spatial phase unwrapping + weighted averaging of echoes (Table 3). (ii) Some postprocessing methods used in MEGE Categories 1 and 2 performed more poorly with MEGE than with SEGE (e.g., MEDI.RG). We believe this is due to the relatively larger bandwidth used with MEGE acquisitions, resulting in noisier images at each echo. Higher BW acquisitions were needed with MEGE to accommodate temporal methods (MEGE Categories 3 and 4), which require short echo spacing. It is possible that MEGE + MEDI.RG would perform better with lower BW (wider echo spacing). Due to time/complexity constraints, we were unable to explore every possible MEGE variation that favors specific algorithms. is is a limitation of this study. (iii) MEGE Category 3 methods (Division/Slope) were straightforward to apply but resulted in a wide range of errors. is is due to large errors observed in frequency reference ROIs where the underlying frequency-offset value is larger than what is allowable by the smallest echo spacing. While such errors are avoidable with shorter echo spacing, this is not always possible (as was the case here) due to hardware constraints on readout bandwidth, resolution, FOV, and so forth. While Slope and Division are straightforward to apply, they result in a suboptimal combination of echoes, with noisy phase estimates. (iv) MEGE Category 4 methods have a similar performance to that of Category 3 methods. at is, spatial phase unwrapping did not seem to markedly improve the performance of temporal phase unwrapping. is is potentially due to the hard-tounwrap noisy boundary lines observed with Category 3 methods, as shown in MEGE + Slope example in Figure 4, which are still present with Category 4 methods. Finally, we explored the dependence of the CDF on scan variability and tube contents (Table 4), and we note the following: (1) Not all sequence + method pairs were invariant to scan and/or tube content. sEPI + Phun and sEPI + Unprocessed had greatest variability, followed by all of the Laplace-based methods (LP, MEDI.LP), indicating that some methods produced inconsistent phase even in their best-case scenario. Laplace-based techniques generated smoothly varying, though numerically inaccurate phase maps, irrespective of sequence, and sEPI + Phun suffered from sequencedependent errors. (2) Considering the best-case ROI scenario, we did not observe consistent performance differences between pairing methods with either SEGE or sEPI. e sEPI sequence provides a significant acceleration over SEGE via its segmented GRE approach and performed better with some methods (GBC) and worse with others (Phun).  Overall, the consistently good performance across tube content, scan, and rotation angle in the absence of phase estimation errors validates the Δχ model (equation (1)) and demonstrates that the rotating-tube phantom itself did not introduce unexpected detrimental effects to the phase measurements. In previous work, QSM performance improved with higher isotropic spatial resolution and higher coverage [56,61]. Here, slice thickness was 2.0 mm for all scans, which may explain why performance did not drastically change with resolution. Additionally, Zhou et al. [61] and Karsa et al. [56] examined the entire QSM process, including inversion, which was not addressed here. Olsson et al. [22] used a phantom with tubes of Gd in comparable concentrations, though only one tube was used when varying angle with respect to B 0 (five angles). at study used one method to estimate QSM [62][63][64] over multiple spatial resolutions, volumes, and inversion parameters. Similar to the work of Karsa et al. [56], Olsson et al.'s results improved with increasing resolution and volume coverage, and, similar to our results, Olsson et al. observed errors in phase estimation using [47], compared to the theoretical result.
A limitation of this work is that the geometric structure of the phantom was not identical to that encountered in vivo. In vivo imaging may present different sources of phase errors not included here (e.g., eddy currents, susceptibility-induced signal drops), and the phantom could present some challenges that are not encountered or are less prominent in the brain. is is a common problem with nearly all phantom studies, and it is counterbalanced by the advantage of having a known truth, which is not possible in vivo. Errors observed in phantoms are frequently observed in vivo, even when the geometry of the phantom is a gross simplification of human anatomy. For example, in T 1 estimation, while phantoms can be used to refine a method, errors are amplified when methods are applied in vivo [65]. e many parameters, sequences, and processing steps considered here are useful to evaluate the robustness of phase estimation techniques, and this work can be viewed as a complement to other efforts seeking to evaluate the accuracy of QSM techniques, such as the use of simulated data [66,67].
We introduced a wide range of variability to test repeatability and reproducibility of many data acquisition scenarios. While the performance of each method could be improved with additional "intervention" and potentially adapting the acquisition parameters to the intended postprocessing methods to be used later, our intent was to assess the ability of existing techniques in diverse imaging scenarios encountered in reality. e degree to which Steps 3 and 4 of QSM are sensitive to the errors introduced in Steps 1 and 2 requires further investigation. is phantom validation study allowed us to set a quantitative limit on the performance of various Steps 1 and 2 methods. Ongoing work focuses on evaluating the performance of a subset of these methods, paired with Steps 3 and 4 methods.

Conclusion
In this work, we used a rotating-tube phantom to explore sources of error in QSM data acquisition and phase estimation. To assess the robustness and repeatability of methods, we did not manually intervene. e two most impactful parameters on reproducibility of measurements were (a) acquisition protocol (e.g., single echo or multiple echoes) and (b) phase errors.
e most repeatable and reproducible approaches were MAGPI and MAGPI-unopt, both methods based on the maximumlikelihood approach in phase estimation. For the remaining methods, performance varied greatly, even when systematically applied to the same underlying data from the same sequence or with the same method across different sequences.
is assessment of which methods are repeatable and reproducible without manual intervention is an important step towards using QSM pipelines in clinical settings without experienced users.