A Semi-Automated Usability Evaluation Framework for Interactive Image Segmentation Systems

For complex segmentation tasks, the achievable accuracy of fully automated systems is inherently limited. Specifically, when a precise segmentation result is desired for a small amount of given data sets, semi-automatic methods exhibit a clear benefit for the user. The optimization of human computer interaction (HCI) is an essential part of interactive image segmentation. Nevertheless, publications introducing novel interactive segmentation systems (ISS) often lack an objective comparison of HCI aspects. It is demonstrated that even when the underlying segmentation algorithm is the same throughout interactive prototypes, their user experience may vary substantially. As a result, users prefer simple interfaces as well as a considerable degree of freedom to control each iterative step of the segmentation. In this article, an objective method for the comparison of ISS is proposed, based on extensive user studies. A summative qualitative content analysis is conducted via abstraction of visual and verbal feedback given by the participants. A direct assessment of the segmentation system is executed by the users via the system usability scale (SUS) and AttrakDiff-2 questionnaires. Furthermore, an approximation of the findings regarding usability aspects in those studies is introduced, conducted solely from the system-measurable user actions during their usage of interactive segmentation prototypes. The prediction of all questionnaire results has an average relative error of 8.9%, which is close to the expected precision of the questionnaire results themselves. This automated evaluation scheme may significantly reduce the resources necessary to investigate each variation of a prototype's user interface (UI) features and segmentation methodologies.


Introduction
To the best of our knowledge, there is not one publication in which user based scribbles are combined with standardized questionnaires in order to assess an interactive image segmentation system's quality. This type of synergetic usability measure is a contribution of this work. In order to provide a guideline for an objective comparison of interactive image segmentation approaches, a prototype providing a semi-manual pictorial user input, introduced in Section 2.2.1, is compared to a prototype with a guiding menu-driven UI, described in Section 2.2.2. Both evaluation results are analyzed with respect to a joint prototype, defined in Section 2.2.3, incorporating aspects of both interface techniques. All three prototypes are built utilizing modern web technologies. An evaluation of the interactive prototypes is performed utilizing pragmatic usability aspects described in Section 4.2, as well as hedonic usability aspects analyzed in Section 4.3. These aspects are evaluated via two standardized questionnaires (System Usability Scale and AttrakDiff-2) which form the ground truth for a subsequent prediction of the questionnaires' findings via a regression analysis outlined in Section 3.3. The outcome of questionnaire result prediction from interaction log data only is detailed in Section 4.4. This novel automatic assessment of pragmatic as well as hedonic usability aspects is a contribution of this work. Our source code release for the automatic usability evaluation from user interaction log data can be found at https://github.com/mamrehn/interactive image segmentation evaluation.
2 International Journal of Biomedical Imaging 1.1. Image Segmentation Systems. Image segmentation can be defined as the partitioning of an image into a finite number of semantically non-overlapping regions. A semantic label can be assigned to each region. In medical imaging, each individual region of a patients' abdominal tissue might be regarded as healthy or cancerous. Segmentation systems can be grouped into three principal categories, each differing in the degree of involvement of an operating person (user): manual, automatic, and interactive. (1) During manual tumor segmentation, a user provides all elements in the image grid which have neighboring elements ( ) of different labels than . The system then utilizes this closed curve contour line information to infer the labels for remaining image elements via simple region growing. This minimal assistance by the system causes the overall segmentation process of one lesion to take up to several minutes of user interaction time. However, reaching an appropriate or even perfect segmentation result (despite noteworthy interobserver difference [1]) is feasible [2,3]. In practice, few time-consuming manual segmentations are performed by domain experts, in order to utilize the results as a reference standard in radiotherapy planning [4]. (2) A fully automated approach does not involve a user's interference with the system. The introduced deficiency in domain knowledge for accurately labeling regions may be restored partially by automated segmentation approaches. The maximum accuracy of the segmentation result is therefore highly dependent on the individual set of rules or amount of training data available. If the segmentation task is sufficiently complex, a perfect result may not be reachable. (3) Interactive approaches aim at a fast and exact segmentation by combining substantial assistance by the system with knowledge about a very good estimate of the true tumor extent provided by trained physicians during the segmentation process [5]. In contrast to fully automated solutions, prior knowledge is (also) provided during the segmentation process. Although, interactive approaches are also costly in terms of manual labor to some extent, they can supersede fully automated techniques in terms of accuracy. Due to their exact segmentation capabilities, interactive segmentation techniques are frequently chosen to outline pathologies during imaging assisted medical procedures, like hepatocellular carcinomata during trans-catheter arterial chemoembolization (see Section 1.6).

Evaluation of Image Segmentation Systems.
Performance evaluation is one of the most important aspects during the continuous improvement of systems and methodologies. With non-interactive computer vision and machine learning systems for image segmentation, an objective comparison of systems can be achieved by evaluating pre-selected data sets for training and testing. Similarity measures between segmentation outcome and ground truth images are utilized to quantify the quality of the segmentation result.
With interactive segmentation systems (ISS), a complete ground truth data set would also consist of the adaptive user interactions which advance the segmentation process. Therefore, when comparing ISS, the user needs to be involved in the evaluation process. User interaction data however is highly dependent on (1) the users' domain knowledge and the unique learning effect of the human throughout a period of exposure to the problem domain, (2) the system's underlying segmentation method and the users' preferences towards this technique, and (3) the design and usability (the user experience [6,7]) of the interface which is presented to the user during the interactive segmentation procedure [3,8]. This includes users' differing preferences towards diverse interaction systems and tolerances for unexpected system behavior. Considering (1)- (3), an analytically expressed objective function for an interactive system is hard to define. Intuitively, the user wants to achieve a satisfying result in a short amount of time with ease [9]. A direct assessment of a system's usability is enabled via standardized questionnaires, as described in Section 2.3. Individual usage of ISS can be evaluated via the segmentation result's similarity to the ground truth labeling according to the Sørensen-Dice coefficient (Dice) [10] after each interaction. The interaction data utilized for these segmentations has to be representative in order to generalize the evaluation results.
1.3. Types of User Interaction. As described by Olabarriaga et al. [11] as well as Zhao and Xie [12], user interactions can be categorized with regard to the type of interface an ISS provides. The following categories are emphasized. (1) A pictorial mask image is the most intuitive form of user input. Humans use this technique when transferring knowledge via a visual medium [13]. The mask overlayed on the visualization of the image I ∈ R 푤,ℎ to segment consists of structures called scribbles, where is the width and ℎ is the height of the 2-D image I in pixels. Scribbles are seed points, lines, and complex shapes, each represented as a set of individual seed points. One seed point is a tuple s 푖 = (p 푖 , ℓ 푖 ), where p 푖 ∈ R 2 describes the position of the seed in image space. The class label of a scribble in a binary segmentation system is represented by ℓ 푖 ∈ {background, foreground}. Scribbles need to be defined by the user in order to act as a representative subset S of the ground truth segmentation G = {s 1 , s 2 , . . .}.
(2) A menu-driven user input scheme as in [14,15] limits the user's scope of action. Users trade distinct control over the segmentation outcome for more guidance provided by the system. The locations or the shapes of newly created scribbles are fixed before presentation to the user. It is challenging to achieve an exact segmentation result using a method from this category. Rupprecht et al. [14] describe significant deficits in finding small objects and outline a tendency of the system to automatically choose seed point locations near the object border, which cannot be labeled by most users' visual inspection and would therefore not have been selected by the users themselves. Advantages of menu-driven user input are the high level of abstraction of the process, enabling efficient guidance for inexperienced users in their decision which action to perform for an optimal segmentation outcome (regarding accuracy over time or number of interactions) [11,16].

Generation of Representative User Input.
Nickisch et al. [17] describe crowd sourcing and user studies as two methods to generate plausible user input data. The cost efficient crowd sourcing method often lacks control and knowledge of the International Journal of Biomedical Imaging 3 users' motivation. Missing context information for crucial aspects of the data acquisition procedure creates a challenging task objectifying the evaluation results. Specialized fraud detection methods are commonly used in an attempt to prefilter the recorded corpus and extract a usable subset of data. McGuinness and O'Connor [18] proposed an evaluation of ISS via extensive user experiments. In these experiments, users are shown images with descriptions of the objects they are required to extract. Then, users mark foreground and background pixels utilizing a platform designed for this purpose. These acquisitions are more time-consuming and cost intensive than crowd sourcing, since they require a constant involvement of users. However, the study's creators are able to control many aspects of the data recording process, which enables detailed observations of user reactions. The data samples recorded are a representative subset of the focus group of the finalized system. A user study aims at maximizing repeatability of its results. In order to increase the objectivity of the evaluation in this work, a user study is chosen to be conducted. The study is described in Section 3.2.

State-of-the-Art Evaluation of Interactive
Segmentation Systems 1.5.1. Segmentation Challenges. In segmentation challenges like SLIVER07 [19] (mainly) fully automated approaches are competing for the highest score regarding a predefined image quality metric. Semi-automatic methods are allowed for submission if the manual interaction with the test data is strictly limited to pre-processing and (single seed point) initialization of an otherwise fully automated process. ISS may be included into the contests' final ranking, but are regarded as non-competing, since the structure of the challenges is solely designed for automated approaches. The PROMISE12 challenge [20] had a separate category for proposed interactive approaches, where the user (in this case, the person also describing the algorithm) may add an unlimited number of hints during segmentation, without observing the experts' ground truth for the test set. No group of experts was provided to operate the interactive method for comparative results. The submitted interactive methods' scores in the challenge's ranking are therefore highly dependent on the domain knowledge of single operating users and can not be regarded as an objective measure.

Comparisons for Novel Segmentation Approaches.
In principle, with every new proposal of an interactive segmentation algorithm or interface, the authors have to demonstrate the new method's capabilities in an objective comparison with already established techniques. The effort spent for these comparisons by the original authors varies substantially. According to [9], many evaluation methods only consider a fixed input. This approach is especially unsuited for evaluation, without simultaneously defining an appropriate interface, which actually validates that a real person utilizing this UI is capable of generating similar input patterns to the ones provided. Although, there are some overview publications, which compare several approaches [11,18,[21][22][23], the number of publications outlining new methods is disproportionately greater, leaving comparisons insufficiently covered. Olabarriaga et al. [11] main contribution is the proposition of criteria to evaluate interactive segmentation methods: accuracy, repeatability, and efficiency. McGuinness et al. [18] utilized a unified user interface with multiple underlying segmentation methods for the survey they conducted. They recorded the current segmentation masks after each interaction to gauge segmentation accuracy over time. Instead of utilizing a standardized questionnaire, users were asked to rate the difficulty and perceived accuracy of the segmentation tasks on a scale of 1 to 5. Their main contribution is an empirical study by 20 subjects segmenting with four different segmentation methods in order to conclude that one of the four methods is best, given their data and participants. Their ranking is primarily based on the mean accuracy over time achieved per segmentation method. McGuinness et al. [22] define a robot user in order to simulate user interactions during an automated interactive segmentation system evaluation. However, they do not investigate the similarity of their rule-based robot user to seed input pattern by individual human subjects. Zhao et al. [21] concluded in their overview over interactive medical image segmentation techniques, that there is a clear need of well-defined performance evaluation protocols for interactive systems.
In Table 1, a clustering of popular publications describing novel interactive segmentation techniques is depicted. The evaluation methods can be compared by the type of data utilized as user input. Note that there is a trend towards more elaborate evaluations in more recent publications. The intent and perception of the interacting user are a valuable resource worth considering when comparing interactive segmentation systems [24]. However, only two of the 42 related publications listed in Table 1 make use of the insights about complex thought processes of a human utilizing an interactive segmentation system for the ranking of novel interactive segmentation methods. Ramkumar et al. [25,26] acquire these data by well-designed questionnaires, but do not automate their evaluation method. We propose an automated, i.e. scalable, system to approximate pragmatic as well as hedonic usability aspects of a given interactive segmentation system.

Clinical Application for Interactive Segmentation.
Hepatocellular carcinoma (HCC) is among the most prevalent malignant tumors worldwide [63,64]. Only 20-30% of cases are curable via surgery. Both, a patient's HCC and hepatic cirrhosis in advanced stages may lead on to the necessity of alternative treatment methods. For these inoperable cases, trans-catheter arterial chemoembolization (TACE) [65] is a promising and widely used minimally invasive intervention technique [66,67]. During TACE, extra-hepatic collateral vessels are occluded, which previously supplied the HCC with oxygenated blood. To locate these vessels, it is crucial to find the exact shape as well as the position of the tumor inside the liver. Interventional radiology is utilized to generate a volumetric cone-beam C-arm computed tomography (CBCT) [68] image of the patient's abdomen, which is processed to precisely outline and label the lesion. The toxicity of TACE Table 1: Overview of seed point location selection methods for a set of influential publications in the field of interactive image segmentation. Additional unordered seed information can be retrieved in arbitrary order by (a) manually drawn seeds or (b) randomly generated seeds. Seeds can be inferred rule-based from the ground truth segmentation by (c) sampling the binary mask image, (d) from provided bounding box mask images, (e) random sampling from tri-maps generated by erosion and dilation, or (f) by a robot user, i.e. user simulation. A tri-map specifies background, foreground, and mixed areas. Seeds can also be provided by real users via the (g) final seed masks after all interactions on one input image or (h) the ordered iterative scribbles. (i) Questionnaire data from Goals, Operators, Methods, and Selection rules (GO) as well as National Aeronautics and Space Administration Task Load Index (TL) may be retrieved by interviewing users after the segmentation process. Check marks indicate the usage of seeds in the publications listed. Publications with check marks in brackets display these seeds but do not utilize them for evaluation.
The efficacy of the therapy increases, the less cancerous tissue is falsely labeled as healthy [69]. However, precisely outlining the tumor is challenging, especially due to its variations in size and shape, as well as a high diversity in Xray attenuation coefficient values representing the lesion as illustrated in Figure 1. While fully automated systems may yield insufficiently accurate segmentation results, ISS tend to be well suited for an application during TACE.

Methods
In the following Section, the segmentation method underlying the user interface prototypes is described in Section 2.1 in order to subsequently adequately outline the different characteristics of each novel interface prototype in Section 2.2. Usability evaluation methods utilized are detailed regarding questionnaires in Section 2.3, semi-structured feedback in Section 2.4, and the test environment in Section 2.5.

Segmentation Method.
GrowCut [59] is a seeded image segmentation algorithm based on cellular automaton theory. The automaton is a tuple (G I , Q, ), where G I is the data the automaton operates on. In this case G I is the graph of image I, where the pixels/voxels act as nodes k 푒 . The nodes are connected by edges on a grid defined by the Moore neighborhood system. Q defines the automaton's possible states and the state transition function utilized.
As detailed in Equation (1), Q is the set of each node's state, where p 푒 is the node's position in image space and ℓ 푡 푒 is the class label of node at GrowCut iteration . 0 ≤ Θ 푡 푒 ≤ 1 is the strength of at iteration . The feature vector c 푒 describes the node's characteristics. The pixel value I(p 푒 ) at image location p 푒 is typically utilized as feature vector c 푒 [59]. Here, we additionally define h 푡 푒 ∈ N 0 as a counter for accumulated label changes of during the GrowCut iteration, as described in [31], with h 푡=0 푒 = 0. Note that this extension of GrowCut is later utilized for seed location suggestion in two of the three prototypes tested. A node's strength Θ 푡=0 푒 is initialized with 1 for scribbles, i. e. (p 푒 , ℓ 푡=0 푒 ) ∈ S t=0 , and 0 otherwise.
are performed utilizing local state transition rule : starting from initial seeds, labels are propagated based on local intensity features c. At each discrete time step , each node attempts to conquer its direct neighbors. A node is conquered if the condition in Equation (2) is true.
If node is conquered, the automaton's state set is updated according to Equation (4). If is not conquered, the node's state remains unchanged, i. e. Q 푡+1 The process is guaranteed to converge with positive and bounded node strengths (∀ 푒,푡 Θ 푡 푒 ≤ 1) monotonously decreasing (since (.) ≤ 1). The image's final segmentation mask after convergence is encoded as part of state Q 푡=∞ , specifically in (p 푒 , ℓ 푡=∞ 푒 ) for each node .

Interactive Segmentation
Prototypes. Three interactive segmentation prototypes with different UIs were implemented for usability testing. The segmentation technique applied in all prototypes is based on the GrowCut approach as described in Section 2.1. GrowCut allows for efficient and parallelizable computation of image segmentations while providing an acceptable accuracy from only few initial seed points. The method is also chosen due to its tendency to benefit from careful placement of large quantities of seed points. It is therefore well suited for an integration into a highly interactive system. A learning-based segmentation system was not utilized for usability testing due to its inherent dependence of segmentation quality on the characteristics of prior training data, which potentially adds a significant bias to the test results, given only a small data set as utilized in the scope of this work.
All three user interfaces provided include an undo button to reverse the effects of the user's latest action. A finish button is used to define the stopping criterion for the interactive image partitioning. The transparency of both, the contour line and seed mask displayed, is adjustable to one of five fixed values via the opacity toggle button. The image contrast and brightness (windowing) can be adapted with standard control sliders for the window width and the window center operating on the image intensity value range [70]. All protoypes incorporate a help button used to provide additional guidance for the prototype's usage during the segmentation task. The segmentation process starts with a set of predefined background-labels S 0 along the edges of the image, since an object is assumed to be located in its entirety inside the displayed region of the image.

Semi-Manual Segmentation
Prototype. The UI of the semi-manual prototype, depicted in Figure 2, provides several interaction elements. A user can add seed points as an overlay mask displayed on top of the image. These seed points have a pre-defined label of either foreground for the object or background used for all other image elements. The label of the next brush strokes (scribbles) can be altered via the buttons named object seed and background seed. After each interaction ∈ N, a new iteration of the seeded segmentation is started given the image I as well as the updated set of seeds

Guided Segmentation Prototype.
The system selects two seed point locations p 푛 1 and p 푛 2 , each with the lowest label certainty values assigned by the previous segmentation process. The seed point locations are shown to the user in each iteration , as depicted in Figure 3. There are four possible labeling schemes for those points in the underlying two-class classification problem, since each seed point The interface providing advanced user guidance displays the four alternative segmentation contour lines, which are a result of the four possible next steps during the iterative interactive segmentation with respect to the labeling of the new seed points s 푛 1 and s 푛 2 . The user selects the only correct labeling, where all displayed object and background seeds are inside the object of interest and the image background, respectively. The alternative views on the right act as four buttons to define a selection. To further assist the user in their decision making, the region of interest, defined by p 푛 1 and p 푛 2 , is zoomed in for the option view on the right and displayed as a cyan rectangle in the overview image on the left of the UI. The differences regarding the previous iteration's contour line and one of the four new options each are highlighted by dotted areas in the four overlay mask images. After the user selects one of the labelings, the two new seed points are added to the current set of scribbles S 푛 . The scribbles S 푛 fl S 푛−1 ∪{s 푛 1 , s 푛 2 } are utilized as input for the next iteration, on which basis two new locations p 푛+1 1 and p 푛+1 2 are computed. The system-defined locations of the additional seed points can be determined by argmax 푒 h 푡=∞,푛−1 푒 , the location(s) with maximum number of label changes during GrowCut segmentation. Frequent changes define specific image elements and areas in which the GrowCut algorithm indicates uncertainty in finding the correct labels. Two locations in h 푡=∞,푛−1 are then selected as p 푛 1 and p 푛 2 , which stated the most changes in labeling during the previous segmentation with input image I and seeds S 푛−1 . Figure 4 is a combination of a pictorial interaction scheme and a menu-driven approach. (1) A set of ∈ N preselected new seeds is displayed in each iteration. The seeds' initial labels are set automatically, based on whether their position is inside (foreground) or outside (background) the current segmentation mask. The user may toggle the label of each of the new seeds, which also provides an intuitive undo functionality. The automated suggestion process for new seed point locations is depicted in Figure 5. The seed points are suggested deterministically based on the indices of the maximum values in an element-wise sum of three approximated influence maps. These maps are the gradient magnitude image of I, the previous label changes h 푡=∞ per element in G I weighted by an empirically determined factor of 17/12, and an influence map based on the distance of each element in I to the current contour line. Note that for the guided prototype (see Section 2.2.2), only h was used for the selection of suggested seed point locations. This scheme was extended for the joint prototype, since extracting ≈ 20 instead of only the top two points solely from h potentially introduces suggested point locations forming impractical local clusters instead of spreading out with higher variance in the image domain. This process approximates the true influence or entropy (information gain) of each possible location for a new seed.

Joint Segmentation Prototype. The joint prototype depicted in
When all seed points {s 푛 1 , s 푛 2 , . . . , s 푛 퐽 } presented to the user are toggled to their correct label, the user may click on the new points button to initiate the next iteration with an updated set of seed points S 푛 = S 푛−1 ∪ {s 푛 1 , s 푛 2 , . . . , s 푛 퐽 }. Another set of seed points {s 푛+1 1 , s 푛+1 2 , . . . , s 푛+1 퐽 } is generated and displayed. (2) In addition to preselected seeds, a single new seed point s 푛 0 can be added manually via a user's long-press on any location in the image. A desired change in the current labeling of this region is interpreted given this user action. Therefore, the new seed point's initial label is set by inverting the current label of the given location. A new segmentation is initiated by this interaction based on Note that the labels of s 푛 푖 are still subject to change via toggle interactions until the new points button is pressed.

System Usability Scale (SUS).
The SUS [71,72] is a widely used, reliable, and low-cost survey to assess the overall usability of a prototype, product, or service [73]. Its focus is on pragmatic quality evaluation [74,75]. The survey is technology agnostic, which enables a utilization of the usability of many types of user interfaces and ISS [76]. The questionnaire consists of ten statements and an unipolar fivepoint Likert scale [77]. This allows for an assessment in a time span of about three minutes per participant. The statements are as follows: (1) I think that I would like to use this system frequently.
(2) I found the system unnecessarily complex.
(3) I thought the system was easy to use.
(4) I think that I would need the support of a technical person to be able to use this system.
(5) I found the various functions in this system were well integrated.
(6) I thought there was too much inconsistency in this system.
(7) I would imagine that most people would learn to use this system very quickly.
(8) I found the system very cumbersome to use.
(9) I felt very confident using the system.  . SUS scores enable simple interpretation schemes, understandable also in multidisciplinary project teams. The result of the SUS survey is a single scalar value, in the range of zero to 100 as a composite measure of the overall usability. The score is computed according to Equation (5), as outlined in [71], given participants, where x SUS 푠,푖 is the response to the statement by subject .
A neutral participant (∀ 푖 x SUS 푠,푖 = 2) would produce a SUS score of 50. Although the SUS score allows for straightforward comparison of the usability throughout different systems, there is no simple intuition associated with the resulting scalar value. SUS scores do not provide a linear mapping of a system's quality in terms of overall usability. In practice, a SUS of less than 80 is often interpreted as an indicator of a substantial usability problem with the system. Bangor et al. [76,78] proposed an interpretation of the score in a seven-point scale. They added an eleventh question to 959 surveys they conducted. Here, participants were asked to describe the overall system as one of these seven items of an adjective rating scale: worst imaginable, awful, poor, OK, good, excellent, and best imaginable. The resulting SUS scores could then be correlated with the adjectives. The mapping from scores to adjectives resulting from their evaluation is depicted in Figure 6. This mapping also enables an absolute interpretation of a single SUS score.

Semantic Differential AttrakDiff-2.
A semantic differential is a technique for the measurement of meaning as defined by Osgood et al. [79,80]. Semantic differentials are based on the theory, that the implicit anticipatory response of a person to a stimulus object is regarded as the object's meaning. Since these implicit responses themselves cannot be recorded directly, more apparent responses like verbal expressions have to be considered [81,82]. These verbal responses have to be sensitive to and maximally dependent  on meaningful states while independent from each other [80]. Hassenzahl et al. [83,84] defined a set of 28 pairs of verbal expressions suitable to represent a subject's opinion on the hedonic as well as pragmatic quality (both aspects of perception) and attractiveness (an aspect of assessment) of a given interactive system separately [85]. During evaluation, the pairs of complementary adjectives are clustered into four groups, each associated with a different aspect of quality. Pragmatic quality (PQ) is defined as the perceived usability of the interactive system, which is the ability to assist users to reach their goals by providing utile and usable functions [86]. The attractiveness (ATT) quantizes the overall appeal of the system [87]. The hedonic quality (HQ) [88] is separable into hedonic identity (HQ-I) and hedonic stimulus (HQ-S). HQ-I focuses on a user's identification with the system and describes the ability of a product to communicate with other persons benefiting the user's self-esteem [89]. HQ-S describes the perceived novelty of the system. HQ-S is associated with the desire to advance ones knowledge and proficiencies. The clustering into these four groups for the 28 word pairs are defined as depicted in Table 2.
For each participant, the order of word pairs and order of the two elements of each pair are randomized prior to the survey's execution. A bipolar [90] seven-point Likert scale is presented to the subjects to express their relative tendencies towards one of the two opposing statements (poles) of each expression pair, where index three denotes the neutral element. For the questionnaire's evaluation for subject ∈ {0, 1, . . . , − 1}, each of the seven adjective pairs ∈ {0, 1, . . . , 6} per group ∈ {PQ, ATT, HQ-I, HQ-S} is assigned a score x 푔 푠,푖 ∈ {1, 2, . . . , 7} by each participant, reflecting their tendency towards the positive of the two adjectives. The overall ratings per group are defined in [83] as the mean scores computed over all subjects and statements , as depicted in Equation (6). Here, is the number of participants in the survey. attrakdiff Therefore, a neutral participant would produce an AttrakDiff-2 score of four. The final averaged score of each group ranges from one (worst) to seven (best rating). An overall evaluation of the AttrakDiff-2 results can be conducted in the form of a portfolio representation [86]. HQ is the mean of a system's HQ-I and HQ-S scores. PQ and HQ scores of a specific system and user are visualized as a point in a two-dimensional graph. The 95% confidence interval is an estimate of plausible values for rating scores from additional study participants, and determines the extension of the rectangle around the described data point in each dimension. A small rectangle area represents a more homogeneous rating among the participants than a larger area. If a rectangle completely lies inside one of the seven fields with associated adjectives defined in [86], this adjective is regarded as the dominant descriptor of the system. Otherwise, systems can be particularized by overlapping fields' adjectives. If the confidence rectangles of two systems overlap in their onedimensional projection on either HQ or PQ, their difference in AttrakDiff-2 scores in regard to this dimension is not significant.

Qualitative Measures.
In order to collect, normalize, and analyze visual and verbal feedback given by the participants, a summative qualitative content analysis is conducted via abstraction [91,92]. The abstraction method reduces the overall transcript material while preserving its substantial contents by summarization. The corpus retains a valid mapping of the recording. An essential part of abstraction is the formulation of macro operators like elimination, generalization, construction, integration, selection, and bundling. The abstraction of statements is increased iteratively by the use of macro operators, which map statements of the current level of abstraction to the next, while clustering items based on their similarity [93].

HCI Evaluation.
A user study is the most precise method for the evaluation of the quality of different interactive segmentation approaches [17]. Analytical measures as well as subjective measures can be derived from standardized user tests [94]. From interaction data recorded during the study, the reproducibility of segmentation results as well as the achievable accuracy with a given system per time can be estimated. The complexity and novelty of the system can be expressed via the observed convergence to the ground truth over time spent by the participants segmenting multiple images each. The user's satisfaction with the interactive approaches is expressed by the analysis of questionnaires, which the study participant fills out immediately after their tests are conducted and before any discussion or debriefing has started. The respondent is asked to fill in the questionnaire as spontaneously as possible. Intuitive answers are desired as user feedback instead of well-thought-out responses for each item in the questionnaire [71].
For the randomized A/B study, individuals are selected to approximate a representative sample of the intended users of the final system [95]. During the study, subjects are given multiple interactive segmentation tasks to fulfill each in a limit time frame. The user segments all images provided with two different methods (A and B). All subjects are given 2 ⋅ tasks in a randomized order to prevent a learning effect bias, which would allow for higher quality outcomes for the later tasks. Video and audio data of the subjects are recorded. Every user interaction recognized by the system and its time of occurrence are logged. Figure 7 the data set used for the usability test is depicted. For this evaluation, the RGB colored images are converted to grayscale in order to increase similarity to the segmentation process of medical images acquired from CBCT. The conversion is performed in accordance with the ITU-R BT.709-6 recommendation [96] for the extraction of true luminance I ∈ R 푤,ℎ defined by the International Commission on Illumination (CIE) from contemporary cathode ray tube (CRT) phosphors via Equation (7), where I 耠 푅 ∈ R 푤,ℎ , I 耠 퐺 ∈ R 푤,ℎ , and I 耠 퐵 ∈ R 푤,ℎ are the linear red, green, and blue color channels of I 耠 ∈ R 푤,ℎ,3 respectively.

Data Set for the Segmentation Tasks. In
Image Figure 7(b) is initially presented to the study participants in order to familiarize themselves with the upcoming segmentation process. The segmentation tasks associated with images Figures 7(a), 7(c), and 7(d) are then displayed sequentially to the subjects in randomized order. The images are chosen to fulfill two goals of the study. (1) Ambiguity of the ground truth has to be minimized in order to suppress noise in the quantitative data. Each test person should have the same understanding and consent about the correct outline of the object to segment. Therefore, clinical images can only be utilized with groups of specialized domain experts.
(2) The degree of complexity should vary between the images displayed to the users. Image (b), depicted in Figure 7, of moderate complexity with regard to its disagreement coefficient [97], is displayed first to learn the process of segmentation with the given prototype. Users are asked for an initial testing of a prototype's features utilizing this image without any time pressure. The subsequent interactions during the segmentations of the remaining three images are International Journal of Biomedical Imaging recorded for each prototype and participant. The complexity increases from (a) to (d), according to the GTs' Minkowski-Bouligand dimensions [98]. The varying complexity enables a more objective and extended differentiation of subjects' performances with given prototypes.

Usability Test Setup.
Two separate user studies are conducted to test all prototypes described in Section 2.2, in order to keep the time for each test short (less than 10 minutes per prototype), thus retaining the focus of the participants, while minimizing the occurrence of learning effect artifacts in the acquired data. Note that the participants use this time not only to finish the segmentation tasks, but also to familiarize themselves with the novel interaction system, as well as to form opinions about the system while testing their provided interaction features. (1) The first user test is a randomized A/B test of the semi-manual prototype (Section 2.2.1) and the guided prototype (Section 2.2.2). Ten individuals are selected as test subjects due to their advanced domain knowledge in the fields of medical image processing and mobile input devices. The subjects are given the task to segment = 3 different images with varying complexity, which are described in Section 3.1, in random order. A fourth input image of medium complexity is provided for the users to familiarize themselves with the ISS before the tests. As an interaction device, a mobile tablet computer is utilized, since the final segmentation method is intended for usage via such a medium. The small 10.1 inch (13.60cm ⋅ 21.75cm) WUXGA display and fingers utilized as a multitouch pointing device further exacerbate the challenge to fabricate an exact segmentation for the participants [99]. The user study environment is depicted in Figure 8. Audio and video recordings are evaluated via a qualitative content analysis, described in Section 2.4, in order to detect possible improvements for the tested prototypes and their interfaces. After segmentation, each participant fills out the SUS (Section 2.3.1) and AttrakDiff-2 (Section 2.3.2) questionnaires.
(2) The second user test is conducted for the joint segmentation prototype (Section 2.2.3). The data set and test setup are the same as in the first user study and all test persons of study (1) also participated in study (2). One additional subject participated only in study (2). Two months passed between the conduction of the two studies, in which the former participants were not exposed to any of the prototypes. Therefore, the learning effect bias for the second test is neglectable.

Prediction of Questionnaire
Results. The questionnaires' PQ, HQ, HQ-I, HQ-S, ATT, and SUS results are predicted, based on features extracted from the interaction log data. For the prediction, a regression analysis is performed. Stochastic Gradient Boosting Regression Forests (GBRF) are an additive model for regression analysis [100][101][102]. In several stages, shallow regression trees are generated. Such a tree is a weak base learner each resulting in a prediction error = + V, with high bias and low variance V. These regression trees are utilized to minimize an arbitrarily differentiable loss function each on the negative gradient of the previous stage's outcome, thus reducing the overall bias via boosting [103]. The Huber loss function [104] is utilized for this evaluation due to its increased robustness to outliers in the data with respect to the squared error loss.
The collected data set of user logs is split randomly in a ratio of 4 : 1 for training and testing. An exhaustive grid search over 20, 480 parameter combinations is performed for each of the six GBRF estimators (one for each questionnaire result) with scorings based on an eightfold cross-validation on the training set.  S 9 S 1 S 0 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 1 S 0 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 10 S 9 Q 10 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 10 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 10 Figure 9: Results of the SUS questionnaires per prototype. Values are normalized in accordance with Equation (5), such that 4 is considered the best possible result for each question. The semi-manual prototype's SUS mean is 88, guided prototype's mean is 67, and joint prototype's mean SUS score is 82.

Feature Selection for SUS Prediction.
For the approximation of SUS results, a feature selection step is added to decrease the prediction error by an additional three percent points: here, after the described initial grid search, 1% (205) of the GBRF estimators, with the lowest mean deviance from the ground truth, are selected to approximate the most important features. From those estimators, the most important features for the GBRFs are extracted via a 1/ -weighted feature importance voting. This feature importance voting by 205 estimators ensures a more robust selection than deciding the feature ranking from only a single trained GBRF. After the voting, a second grid search over the same 20, 480 parameter combinations, but with a reduction from 238 to only 25 of the most important features is performed.

Overall Usability.
The result of the SUS score is depicted in Figure 9. According to the mapping ( Figure 6) introduced in Section 2.3.1, the adjective rating of the semi-manual and joint prototypes are excellent (88 respective 82), and the adjective associated with the guided prototype is good (67). A graph representation of the similarity of individual usability aspects, based on the acquired questionnaire data, is depicted in Figure 10. Based on the Pearson correlation coefficients utilized as a metric for similarity, the SUS score S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9  Figure 11: Results of the AttrakDiff-2 questionnaires per prototype. A value of 7 is considered the best possible result. The semi-manual prototype's AttrakDiff-2 mean is 5.46, guided prototype's mean is 4.50, and joint prototype's mean AttrakDiff-2 score is 5.22.
has the most similarity to the pragmatic (PQ) and attractiveness (ATT) usability aspects provided by the AttrakDiff-2 questionnaire.

Pragmatic Quality.
The PQ results of the AttrakDiff-2 questionnaire are illustrated in Figure 11. The PQ scores for semi-manual, guided, and joint prototypes are 88%, 50%, and 74% of the maximum score, respectively. Since each of the 95% confidence intervals are non-overlapping, the prototypes' ranking regarding PQ are significant. The quantitative evaluation of recorded interaction data is depicted in Figure 12. Dice scores before the first interaction are zero, except for the guided prototype (0.82 ± 0.02), where few fixed seed points had to be provided to initialize the system. Utilizing the semi-manual prototype and starting from zero, a similar Dice measure to the guided prototype's initialization is reached after about seven interactions, which takes 13.06 ± 2.05 seconds on average. The median values of final Dice scores per prototype are 0.95 (semi-manual), 0.94 (guided), and 0.82 (joint). The mean overall elapsed wall time in seconds spent for interactive segmentations per prototype are 73 ± 11 (semi-manual), 279 ± 36 (guided), and 214 ± 24 (joint). Since segmenting with the guided version takes the longest time and does not yield the highest final Dice scores, the initial advantage from preexisting seed points does not bias the top ranking of a prototype in this evaluation.

Identity and Stimulus.
The AttrakDiff-2 questionnaire provides a measure for the HQ of identity and stimulus introduced in Section 2.3.2. The HQ scores for semi-manual, guided, and joint prototypes are 72%, 70%, and 77% of the maximum score, respectively. Since the 95% confidence intervals are overlapping for all three prototypes, no system ranks significantly higher than the others. An overall evaluation of the AttrakDiff-2 results is conducted in the form of a portfolio representation depicted in Figure 13.  Hedonic Quality (HQ) Figure 13: AttrakDiff-2 portfolio representation, according to [86], depicting results from the evaluation of the semi-manual segmentation prototype (blue), guided prototype (green), and joint prototype (red). The rectangular areas illustrate the 95% confidence intervals for the mean value in each dimension. The mean intervals are 5.5% for PQ and 4.0% for HQ.

Qualitative Content Analysis.
A summative qualitative content analysis as described in Section 2.4 is conducted on the audio and video data recorded during the study. After generalization and reduction of given statements, the following user feedback is extracted with respect to three problem statements: positive usability aspects, negative usability aspects, and user suggestions concerning existing functions or new functions.

Feedback for Multiple Prototypes
(1) Responsiveness: the most common statement concerning the semi-manual and joint version is that the user expected the zoom function to be more responsive and thus more time efficient.
(2) Visibility: 20% of the participants had difficulties distinguishing between the segmentation contour line and either the background image or the foreground scribbles in the overlay mask, due to the proximity of their assigned color values.
(3) Feature suggestion: deletion of individual seed points instead of all seeds from last interaction using undo.

Semi-manual Segmentation Prototype
(1) Mental model: 30% of test persons suggested clearly visible indication whether the label for the scribble drawn next will be foreground or background.
(2) Visibility: hide previously drawn seed points, in order to prevent confusion with the current contour line and occultation of the underlying image.

Guided Segmentation Prototype
(1) Responsiveness: 50% of test persons suggested an indicator for ongoing computations during their time of waiting.
(2) Control: users would like to influence the location of new seed points, support for manual image zoom, and fine grained control for the undo function.
14 International Journal of Biomedical Imaging

Joint Prototype
(1) Visibility: 64% of users intuitively found the toggle functionality for seed labels without prior explanation.
(2) Visibility: 64% of participants suggested visible instructions for manual seed generation.

Prediction of Questionnaire
Results from Log Data. The questionnaires' results are predicted via a regression analysis, based on features extracted from the interaction log data. A visualization of the feature importances for the regression analysis with respect to the GBRF is depicted in Figure 14.
An evaluation with the test set is conducted as depicted in Table 3. The mean prediction errors for the questionnaires' results are 15.7% for PQ and 7.4% for HQ. In both cases, the error of these (first) estimates is larger but close to the average 95% confidence intervals of 5.5% (PQ) and 4.0% (HQ) for the overall questionnaire results in the portfolio representation. The similarity graph for the acquired usability aspects introduced in Figure 10 can be extended to outline the direct relationship between questionnaire results and recorded features. Such a graph is depicted in Figure 15. Notably, there is no individual feature, which strongly correlates with one of the questionnaire results. However, as the results of the regression analysis in Table 3 depict, there is a noteworthy dependence of the usability aspects measured by the SUS and AttrakDiff-2 questionnaires and combinations of the recorded features. The most important features for the approximation of the questionnaire results are depicted in Table 4.

Usability Aspects.
Although the underlying segmentation algorithm is the interactive GrowCut method for all three prototypes tested, the measured user experiences varied significantly. In terms of user stimulus HQ-S a more innovative interaction system like the joint prototype is preferred to a traditional one. Pragmatic quality aspects, evaluated by SUS as well as AttrakDiff-2's PQ, clearly outline that the semimanual approach has an advantage over the other two techniques. This conclusion also manifests in the Dice coefficient values' fast convergence rate towards its maximum for this prototype. The normalized median ΣWall time spent for the overall segmentation of each image are 100% (semi-manual), 550% (guided), and 380% (joint). As a result, users prefer the simple, pragmatic interface as well as a substantial degree of freedom to control each iterative step of the segmentation. The less cognitively challenging approach is preferred [26]. The other methods provide more guidance for aspects which the user aims to control themselves. In order to improve the productivity of an ISS, less guidance should be imposed in these cases, while providing more guidance on aspects of the process not apparent to the users' focus of attention [105].   Table 4 and Figure 14 (left). In comparison, PQ 41%, HQ 36%, HQ-I 18%, ATT 14%, and HQ-S 9%.

Conclusion
For sufficiently complex tasks like the accurate segmentation of lesions during TACE, fully automated systems are, by their lack of domain knowledge, inherently limited in the achievable quality of their segmentation results. ISS may supersede fully automated systems in certain niches by cooperating with the human user in order to reach the common goal of an exact segmentation result in a short amount of time. The evaluation of interactive approaches is more demanding and less automated than the evaluation with other approaches, due to complex human behavior.
However, there are methods like extensive user studies to assess the quality of a given system. It was shown, that even a suitable approximation of a study's results regarding pragmatic as well as hedonic usability aspects is achievable from a sole analysis of the users' interaction recordings. Those records are straightforward to acquire during normal (digital) prototype usage and can lead to a good first estimate of the system's usability aspects, without the need to significantly increase the temporal demands on each participant by a mandatory completion of questionnaires after each system usage.
This mapping of quantitative low-level features, which are exclusively based on measurable interactions with the system (like the final Dice score, computation times, or relative seed positions), may allow for a fully automated assessment of an interactive system's quality.

Outlook
For the proposed automation, a rule-based user model (robot user) like [27,34] or a learning-based user model could interact with the prototype system instead of a human user. This evaluation scheme may significantly reduce the amount of resources necessary to investigate each variation of a prototype's UI features and segmentation methodologies. An estimate of a system's usability can therefore be acquired fully automatically with dependence only on the chosen user model. In addition, the suitable approximation of a usability study's result can be used as a descriptor, i.e. feature vector, for a user. These features can be utilized for a clustering of users, which is a necessary step for the application of a personalized segmentation system. Such an interactive segmentation system might benefit from prior knowledge about a user's preferences and input patterns in order to achieve accurate segmentations from less interactions.

A. Example for SUS Evaluation Equation (5)
The result of the SUS survey is a single scalar value, in the range of zero to 100 as a composite measure of the overall usability. The score is computed according to Equation (5), as outlined in [71], given participants, where x SUS 푠,푖 is the response to the statement by subject . In this case, sus(x) = 50. Note that the factor 2.5 in (5) normalizes the SUS score to a value 0 ≤ sus(.) ≤ 100.

Data Availability
The interaction log data used to support the findings of this study can be requested from the corresponding author.