For complex segmentation tasks, the achievable accuracy of fully automated systems is inherently limited. Specifically, when a precise segmentation result is desired for a small amount of given data sets, semi-automatic methods exhibit a clear benefit for the user. The optimization of human computer interaction (HCI) is an essential part of interactive image segmentation. Nevertheless, publications introducing novel interactive segmentation systems (ISS) often lack an objective comparison of HCI aspects. It is demonstrated that even when the underlying segmentation algorithm is the same throughout interactive prototypes, their user experience may vary substantially. As a result, users prefer simple interfaces as well as a considerable degree of freedom to control each iterative step of the segmentation. In this article, an objective method for the comparison of ISS is proposed, based on extensive user studies. A summative qualitative content analysis is conducted via abstraction of visual and verbal feedback given by the participants. A direct assessment of the segmentation system is executed by the users via the system usability scale (SUS) and AttrakDiff-2 questionnaires. Furthermore, an approximation of the findings regarding usability aspects in those studies is introduced, conducted solely from the system-measurable user actions during their usage of interactive segmentation prototypes. The prediction of all questionnaire results has an average relative error of 8.9%, which is close to the expected precision of the questionnaire results themselves. This automated evaluation scheme may significantly reduce the resources necessary to investigate each variation of a prototype’s user interface (UI) features and segmentation methodologies.
To the best of our knowledge, there is not one publication in which user based scribbles are combined with standardized questionnaires in order to assess an interactive image segmentation system’s quality. This type of synergetic usability measure is a contribution of this work. In order to provide a guideline for an objective comparison of interactive image segmentation approaches, a prototype providing a semi-manual pictorial user input, introduced in Section
Image segmentation can be defined as the partitioning of an image into a finite number of semantically non-overlapping regions. A semantic label can be assigned to each region. In medical imaging, each individual region of a patients’ abdominal tissue might be regarded as healthy or cancerous. Segmentation systems can be grouped into three principal categories, each differing in the degree of involvement of an operating person (user): manual, automatic, and interactive.
Performance evaluation is one of the most important aspects during the continuous improvement of systems and methodologies. With non-interactive computer vision and machine learning systems for image segmentation, an objective comparison of systems can be achieved by evaluating pre-selected data sets for training and testing. Similarity measures between segmentation outcome and ground truth images are utilized to quantify the quality of the segmentation result.
With interactive segmentation systems (ISS), a complete ground truth data set would also consist of the adaptive user interactions which advance the segmentation process. Therefore, when comparing ISS, the user needs to be involved in the evaluation process. User interaction data however is highly dependent on
As described by Olabarriaga et al. [
Nickisch et al. [
In segmentation challenges like SLIVER07 [
In principle, with every new proposal of an interactive segmentation algorithm or interface, the authors have to demonstrate the new method’s capabilities in an objective comparison with already established techniques. The effort spent for these comparisons by the original authors varies substantially. According to [
In Table
Overview of seed point location selection methods for a set of influential publications in the field of interactive image segmentation. Additional unordered seed information can be retrieved in arbitrary order by (a) manually drawn seeds or (b) randomly generated seeds. Seeds can be inferred rule-based from the ground truth segmentation by (c) sampling the binary mask image, (d) from provided bounding box mask images, (e) random sampling from tri-maps generated by erosion and dilation, or (f) by a robot user, i.e. user simulation. A tri-map specifies background, foreground, and mixed areas. Seeds can also be provided by real users via the (g) final seed masks after all interactions on one input image or (h) the ordered iterative scribbles. (i) Questionnaire data from
Year | Publication | Arbitrary Seeds | Seeds Derived from GT | Multiple User Data based Seeds | ||||||
---|---|---|---|---|---|---|---|---|---|---|
(a) | (b) | (c) | (d) | (e) | (f) | (g) | (h) | (i) | ||
Manual | Random | Binary Mask | Box | Tri-maps | Robot | Final Seeds | Scribbles | Questionnaire | ||
2019 | Amrehn [ |
|
||||||||
2018 | Chen [ |
( |
|
|
||||||
Amrehn [ |
|
|||||||||
2017 | Liew [ |
( |
( |
|
||||||
Wang [ |
|
|||||||||
Wang [ |
|
|
||||||||
Amrehn [ |
✓ |
|
||||||||
Amrehn [ |
|
|||||||||
2016 | Ramkumar [ |
|
||||||||
Ramkumar [ |
|
|||||||||
Jiang [ |
✓ [ |
|
||||||||
Xu [ |
(✓) | (✓) |
|
|||||||
Chen [ |
|
|||||||||
2015 | Andrade [ |
✓ | ||||||||
Rupprecht [ |
|
|
||||||||
2014 | Bai [ |
|
✓ | |||||||
2013 | Jain [ |
|
|
|||||||
He [ |
|
|||||||||
2012 | Kohli [ |
|
|
|
( |
|
||||
2011 | Zhao [ |
|
|
|||||||
Top [ |
( |
|
|
|||||||
McGuinness [ |
( |
|
|
|||||||
2010 | Nickisch [ |
|
|
( |
|
|||||
Gulshan [ |
|
( |
||||||||
Batra [ |
|
|
||||||||
Ning [ |
|
|||||||||
Price [ |
|
|
|
|||||||
Moschidis [ |
|
|||||||||
2009 | Moschidis [ |
|
|
|||||||
Singaraju [ |
|
|
||||||||
2008 | Duchenne [ |
|
|
|||||||
Levin [ |
|
|||||||||
Vicente [ |
|
|||||||||
2007 | Protiere [ |
|
||||||||
2006 | Boykov [ |
|
||||||||
Grady [ |
|
|||||||||
2005 | Vezhnevets [ |
|
||||||||
Cates,[ |
( |
|
||||||||
2004 | Li [ |
|
||||||||
Rother [ |
|
( |
( |
✓ | ||||||
Blake [ |
|
|
||||||||
2001 | Martin [ |
|
✓ |
Hepatocellular carcinoma (HCC) is among the most prevalent malignant tumors worldwide [
Liver lesion segmentations. Depicted are central slices through the volumes of interest of reconstructed images acquired by a C-arm CBCT scanner. The manually annotated ground truth segmentation is displayed as an overlay contour line in green.
In the following Section, the segmentation method underlying the user interface prototypes is described in Section
GrowCut [
Iterations
Three interactive segmentation prototypes with different UIs were implemented for usability testing. The segmentation technique applied in all prototypes is based on the GrowCut approach as described in Section
All three user interfaces provided include an
The UI of the semi-manual prototype, depicted in Figure
Semi-manual segmentation prototype user interface. The current segmentation’s contour line (light blue) is adjusted towards the user’s estimate of the ground truth segmentation by manually adding foreground (blue) or background (red) seed points.
The system selects two seed point locations
Guided segmentation prototype user interface. The current segmentation displayed on the upper left can be improved by choosing one of the four segmentation alternatives displayed on the right. The user is expected to choose the upper-right option in this configuration, due to the two new seeds’ matching background and foreground labels.
The system-defined locations of the additional seed points can be determined by
The joint prototype depicted in Figure
Joint segmentation prototype user interface. The user toggles the labels of prepositioned seed points, of which positions are displayed to them as colored circles, to properly indicate their inclusion into the set of object or background representatives. New seeds can be added at the position of current interaction via a long-press on the overlay image. The segmentation result and the displayed contour line adapt accordingly after each interaction.
The approximated influence map for new seed point locations for the joint segmentation prototype. The map is generated by a weighted sum of gradient magnitude image, number of cell changes
When all seed points
The SUS [ I think that I would like to use this system frequently. I found the system unnecessarily complex. I thought the system was easy to use. I think that I would need the support of a technical person to be able to use this system. I found the various functions in this system were well integrated. I thought there was too much inconsistency in this system. I would imagine that most people would learn to use this system very quickly. I found the system very cumbersome to use. I felt very confident using the system. I needed to learn a lot of things before I could get going with this system.
The Likert scale provides a fixed choice response format to these expressions. The
Mapping from a SUS score to an adjective rating scheme proposed by Bangor et al. [
A semantic differential is a technique for the measurement of meaning as defined by Osgood et al. [
AttrakDiff-2 statement pairs. The pairs of complementary adjectives are clustered into four groups, each associated with a different aspect of quality. All
Pragmatic quality (PQ) | Attractiveness (ATT) | Hedonic identity (HQ-I) | Hedonic stimulus (HQ-S) |
---|---|---|---|
complicated, simple | bad, good | alienating, integrating | cautious, bold |
confusing, clearly structured | disagreeable, likeable | cheap, premium | conservative, innovative |
cumbersome, straightforward | discouraging, motivating | isolating, connective | conventional, inventive |
impractical, practical | rejecting, inviting | separates me from, brings me closer to people | dull, captivating |
technical, human | repelling, appealing | tacky, stylish | ordinary, novel |
unpredictable, predictable | ugly, attractive | unpresentable, presentable | undemanding, challenging |
unruly, manageable | unpleasant, pleasant | unprofessional, professional | unimaginative, creative |
For each participant, the order of word pairs and order of the two elements of each pair are randomized prior to the survey’s execution. A bipolar [
An overall evaluation of the AttrakDiff-2 results can be conducted in the form of a portfolio representation [
In order to collect, normalize, and analyze visual and verbal feedback given by the participants, a summative qualitative content analysis is conducted via abstraction [
A user study is the most precise method for the evaluation of the quality of different interactive segmentation approaches [
For the randomized A/B study, individuals are selected to approximate a representative sample of the intended users of the final system [
In Figure
In the top row, image data utilized in the usability tests are depicted. In the bottom row, the ground truth segmentations of the images are illustrated. The image of a contrast enhanced aneurysm (a) and its ground truth annotation by a medical expert were composed for this study. Images (b)–(d) are selected from the GrabCut image database initially created for [
Two separate user studies are conducted to test all prototypes described in Section
User testing setup for the usability evaluation of the prototypes. In this environment, a user performs an interactive segmentation on a mobile tablet computer while sitting. RGB cameras record the hand motions on the input device and facial expressions of the participant. In addition, each recognized input is recorded on the tablet device (the interaction log).
The questionnaires’ PQ, HQ, HQ-I, HQ-S, ATT, and SUS results are predicted, based on features extracted from the interaction log data. For the prediction, a regression analysis is performed. Stochastic Gradient Boosting Regression Forests (GBRF) are an additive model for regression analysis [
The collected data set of user logs is split randomly in a ratio of
The collected data contains
For the approximation of SUS results, a feature selection step is added to decrease the prediction error by an additional three percent points: here, after the described initial grid search,
The result of the SUS score is depicted in Figure
Results of the SUS questionnaires per prototype. Values are normalized in accordance with Equation (
A graph representation of the similarity of individual usability aspects, based on the acquired questionnaire data, is depicted in Figure
Pearson correlation coefficients for the AttrakDiff-2 (blue) and SUS (red) questionnaire results, based on the acquired questionnaire data. The line thickness is proportionate to correlation strength of the different aspects of quality measured.
The PQ results of the AttrakDiff-2 questionnaire are illustrated in Figure
Results of the AttrakDiff-2 questionnaires per prototype. A value of
The quantitative evaluation of recorded interaction data is depicted in Figure
Evaluation of the user interaction data. The segmentations’ similarity to the ground truth according to the Dice score is depicted per interaction. The median Dice rating and the
The AttrakDiff-2 questionnaire provides a measure for the HQ of identity and stimulus introduced in Section
AttrakDiff-2 portfolio representation, according to [
A summative qualitative content analysis as described in Section
Responsiveness: the most common statement concerning the semi-manual and joint version is that the user expected the zoom function to be more responsive and thus more time efficient. Visibility: Feature suggestion: deletion of individual seed points instead of all seeds from last interaction using
Mental model: Visibility: hide previously drawn seed points, in order to prevent confusion with the current contour line and occultation of the underlying image.
Responsiveness: Control: users would like to influence the location of new seed points, support for manual image zoom, and fine grained control for the
Visibility: Visibility:
The questionnaires’ results are predicted via a regression analysis, based on features extracted from the interaction log data. A visualization of the feature importances for the regression analysis with respect to the GBRF is depicted in Figure
Relative absolute prediction errors for AttrakDiff-2 and SUS test set samples. Predictions are computed by six separately trained Stochastic Gradient Boosting Regression Forests (GBRFs), one for each figure of merit. Note that each training process only utilizes the interaction log data. Results displayed are the median values of
Relative Error | ATT | HQ | HQ-I | HQ-S | PQ | SUS |
---|---|---|---|---|---|---|
Mean | 11.5% | 7.4% | 10.5% | 8.0% | 15.7% | 10.4% |
Median | 8.9% | 6.3% | 9.4% | 6.2% | 13.7% | 8.8% |
Std | 8.0% | 5.5% | 6.7% | 6.9% | 12.0% | 7.1% |
Relative feature importance measures from
The similarity graph for the acquired usability aspects introduced in Figure
The five most important features per GBRF estimator/label. Italic indicates the most frequently used features in the trained decision trees of the GBRFs. Bold highlights semantically similar feature pairs. The abbreviations represent the receiver operating characteristic area under the curve (ROC_AUC), logistic loss (LOG), and relative absolute area/volume difference (RAVD).
1. | 2. | 3. | 4. | 5. | |
---|---|---|---|---|---|
ATT |
|
|
|
Med(OBJ_TPR)/Med( |
Med( |
HQ-I |
|
|
|
|
|
HQ | Med(Jaccard/ |
|
|
Mean(OBJ_TPR/ |
|
HQ-S |
|
Med(Med_wtime/ |
Med(LOG) |
|
Med(MSE) |
PQ | PCA_VAL_16 | Mean( |
Mean(Dice)/Mean( |
PCA_VAL_11 |
|
SUS | PCA_VAL_2 | PCA_VAL_18 |
|
Med(Med_wtime) | PCA_VAL_20 |
Features from user interaction logs (green) correlated with SUS (red) and AttrakDiff-2 (blue) questionnaire results. Bold feature names highlight top five most important features with regard to GBRFs. Only relations with a Pearson correlation coefficient
Although the underlying segmentation algorithm is the interactive GrowCut method for all three prototypes tested, the measured user experiences varied significantly. In terms of user stimulus HQ-S a more innovative interaction system like the joint prototype is preferred to a traditional one. Pragmatic quality aspects, evaluated by SUS as well as AttrakDiff-2’s PQ, clearly outline that the semi-manual approach has an advantage over the other two techniques. This conclusion also manifests in the Dice coefficient values’ fast convergence rate towards its maximum for this prototype. The normalized median
For ATT and HQ-I, the most discriminative features selected by GBRFs are the receiver operating characteristic area under the curve (ROC_AUC) of the final interactive segmentations over the elapsed real time which passed during segmentation (
For sufficiently complex tasks like the accurate segmentation of lesions during TACE, fully automated systems are, by their lack of domain knowledge, inherently limited in the achievable quality of their segmentation results. ISS may supersede fully automated systems in certain niches by cooperating with the human user in order to reach the common goal of an exact segmentation result in a short amount of time. The evaluation of interactive approaches is more demanding and less automated than the evaluation with other approaches, due to complex human behavior.
However, there are methods like extensive user studies to assess the quality of a given system. It was shown, that even a suitable approximation of a study’s results regarding pragmatic as well as hedonic usability aspects is achievable from a sole analysis of the users’ interaction recordings. Those records are straightforward to acquire during normal (digital) prototype usage and can lead to a good first estimate of the system’s usability aspects, without the need to significantly increase the temporal demands on each participant by a mandatory completion of questionnaires after each system usage.
This mapping of quantitative low-level features, which are exclusively based on measurable interactions with the system (like the final Dice score, computation times, or relative seed positions), may allow for a fully automated assessment of an interactive system’s quality.
For the proposed automation, a rule-based user model (robot user) like [
The result of the SUS survey is a single scalar value, in the range of zero to
For the questionnaire’s evaluation for subject
Group PQ:
Group ATT:
Group HQ-I:
Group HQ-S:
After evaluation via (
The confidence intervals
The interaction log data used to support the findings of this study can be requested from the corresponding author.
The concept and software presented in this paper are based on research and are not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Thanks are due to Christian Kisker and Carina Lehle for their hard work with the data collection.