1Click1View: Interactive Visualization Methodology for RNAi Cell-Based Microscopic Screening

Technological advancements are constantly increasing the size and complexity of data resulting from large-scale RNA interference screens. This fact has led biologists to ask complex questions, which the existing, fully automated analyses are often not adequate to answer. We present a concept of 1Click1View (1C1V) as a methodology for interactive analytic software tools. 1C1V can be applied for two-dimensional visualization of image-based screening data sets from High Content Screening (HCS). Through an easy-to-use interface, one-click, one-view concept, and workflow based architecture, visualization method facilitates the linking of image data with numeric data. Such method utilizes state-of-the-art interactive visualization tools optimized for fast visualization of large scale image data sets. We demonstrate our method on an HCS dataset consisting of multiple cell features from two screening assays.


Introduction
High Content Screening (HCS) is of growing importance, allowing for efficient large scale experiments in biological research including drug discovery [1,2]. HCS uses microscopy images of cells and advanced image analysis methods to detect the effects of RNAis or small chemical compounds on a cellular process of interest; in a multivariate approach, multiple cell parameters are collected, allowing for a complex analysis approach. HCS data include information about the library of bioactive molecules, up to several million microscopy images as well as image analysis data with more than 100,000 rows and more than 20 columns (matrix: RNAi sample × cell features). For quanti�ed image processing results, there is a need to generate statistical, quality control and bioinformatics information up to the �nal hit list. e use of any data analysis tool requires the researcher to appropriately tune these parameters for a speci�c dataset, in order to avoid arbitrary results in terms of the number of data clusters, size, or density.
Today, HCS experiments are routinely performed under multiple experimental conditions on multiple test samples and with multiple staining channels. For example, in an experiment with two genotypes and two time points, a scientist might be interested in �nding genes that are similarly expressed at the �rst time point in both genotypes but expressed differently at another point in the genotypes. Exploring such data can bene�t substantially from interactive visualization tools that bring the problem of data mining and analysis closer to the individual researcher in the �eld, by allowing real-time visual data manipulation.
Applied to the context of this paper, HCS data analyses belong to information visualization. Shneiderman [3] published the oen-cited Visual Information Seeking Mantra: �Overview �rst, zoom and �lter, then details on demand. � Data visualization played an important role already in early reports in cancer microarray studies. For instance, Khan et al. [4] summarized their analysis results in a planar visualization that shows a clear separation of diagnostic cases. VizRank [5] can score the visualizations according to the degree of class separation and can work through projection candidates to �nd those with the highest scores. Differently to the approach proposed in this paper, information in their visualization could not be traced back to the original genes as the plot was obtained by multidimensional scaling and used features craed in several data preprocessing steps (feature selection through neural network learning and feature construction by principal component analysis). eir visualization was therefore not a result of an explicit search. McCarthy et al. [6] were the �rst to show how RadViz algorithm which can be applied to the analysis of class-labeled datasets from biomedicine. ey focused on feature subset selection to reduce the number of genes in visualization and feature grouping (clustering anchors of correlated features) and showed that in such an arrangement the visualizations can provide for a clear separation of instances of different class. We present here unique soware methodology which not only provides interactive visualization of data but also links quanti�ed image processing results with raw microscopic image data based on KNIME work�ow environment [7].
Sharing requires centralized-image databases repositories which require common image data formats. is is particularly challenging for multidimensional microscopy image data, which are acquired from a variety of microscopes. 1Click1View (1C1V) method uses the sophisticated open standard format Bio-Formats that supports different types of screening image data (e.g., Tiff 8bit, 16bit, JPEG, JPEG2000, etc.). A user can easily access basic image processing functions from image viewer.
In order to facilitate data preparation for visual analysis, preprocessing and postprocessing steps should be simple and should not require programming knowledge. We suggest to implement 1C1V in work�ow systems which is crucial for enabling users to deal with data preprocessing and postprocessing. e concept of work�ow is not new, and it has been used by many organizations, over the years, to improve productivity and increase efficiency on data preparation. A work�ow system is highly �exible and can accommodate any changes or updates whenever new or modi�ed data and corresponding analytical tools become available. 1C1V is a general-purpose interactive visualization methodology for multiparametric screening assays that is easy to integrate with KNIME [7] work�ow system. e basic idea of 1C1V is to provide the data in one visual frame that allows users to gain insight into the data and generate hypotheses by directly interacting with image data. e advantage of 1C1V is that users are directly involved in the image processing results to combine the �exibility, creativity, and general knowledge of the scientist with the enormous amount of numerical rows connected to image �les. 1C1V is developed to help scientists and answer complex queries through interactive visual exploration of screening datasets. One key attribute of the 1C1V that distinguishes it from past and current image data exploration methodologies is that the original image data, their image processing results, and metadata (additional information captured by acquisition soware about an image, such as the instruments used, camera, acquisition settings, image size, and resolution) are linked together and are available for �ltering, clustering in interactive view.

Systematic Errors and Quality Control in HCS
HCS has already proven successful as a method to deliver more relevant information simultaneously in one experiment, rather than delivering a single readout in a series of sequential experiments [8][9][10][11]. A prototype scenario might be the series of simultaneously available readouts obtained from a cellular assay. One parameter identi�es cells (i.e., membrane dye at �rst wavelength), another determines the stage of mitotic change (e.g., fragmented and condensed nuclei at a second wavelength), and a third parameter classi�es the apoptotic stage using morphological criteria at a third wavelength. Certainly, these analyses can already be performed almost autonomously with very high throughput. HCS operates with samples in microliter volumes that are arranged in two-dimensional plates. A typical HCS plate contains 96 (12 × 8) or 384 (24 × 16) samples. e quality control and normalization procedure in plate-based primary HCS screens can be mainly performed by using interactive visual analysis.
Quality of measurements has a number of advantages, including objectivity, reproducibility, and ease of comparison across screens. Random and systematic errors can cause a misinterpretation of candidates to be a hit. ey can induce either underestimation (false negatives) or overestimation (false positives) of measured parameters. Various methods dealing with quality control are available in the scienti�c literature. ese methods are discussed in details in the papers by Brideau et al. [12], Heyse [13], and Zhang et al. [14,15]. ere are various sources of systematic errors: (i) Systematic errors caused by ageing, reagents evaporation or decay of cells can be recognized as smooth trends in the plate's means/medians; (ii) Errors in liquid handling and malfunction of pipettes can also generate localized deviations of expected data values; (iii) Variation in incubation time, time dri in measuring different wells or different plates, and reader effects may be recognized as smooth attenuations of measurement over an assay.
Brideau et al. [12] demonstrated examples of systematic signal variations that are present in all plates of an assay. For instance, Brideau et al. [12] illustrates a systematic error caused by the positional effect of detector. To guarantee this reliability avoiding systematic errors, data quality control at different levels is a must. is begins in the optimization phase of the assay: In test runs with a small number of compound plates, the assay has to possess a sufficient signal window (e.g., Z-factor [15]), stability, and sensitivity (e.g., measured by the effects of known control compounds) [16,17]. If problems occur, the parameters of the assay or even its format should be tuned to match the quality criteria of HCS. Data quality control on the level of an individual assay seeks again to guarantee assay stability and sensitivity, which must be monitored constantly using the appropriate controls. At the same time, it tries to pick up on process artifacts caused by failures in the screening machinery or the test system (e.g., a blocked pipettor needle, air bubbles in the system, or a changing metabolic state of reporter cells) [13].
If unnoticed, these artifacts can result in a high number of false positives, but seemingly �highly speci�c hits�. Oen, such process artifacts can be detected by changes in the overall signal or by speci�c �signal patterns� on plates (e.g., pipettor line patterns), if the compound library is randomized across the screening plates. Visual analysis of QC with link to images is preferably done directly aer the screening run to ensure that such patterns can be traced back to their origin (e.g., the pipettor may be inspected the next morning) and can be unambiguously classi�ed as artifacts�or nonartifacts. is distinguishes false positives from real actives that should be more or less randomly dispersed when considering a whole series of plates from reasonably randomized compound collections.
e hypothesis underlying HCS data analysis is that the measured image descriptors for each single siRNA represent its relative number of observed objects at �uorescence image. Within-plate reference controls are typically used for these purposes ( Figure 1). Controls help to identify plate-to-plate variability and establish assay background levels.

Interactive Image Data Exploration in �or��o� Environment
High dimensionality of datasets makes it di�cult to �nd interesting patterns. To cope with high dimensionality, low dimensional projections of the original dataset are generated; human perceptual skills are more effective in 2D or 3D displays. Mechanisms have been developed to enable researchers to select low dimensional projections, but most are static or do not allow users to make one view frame and crosslink what is interesting to them. Soware applications based on those mechanisms require from users many independent clicks which in consequence change the content of main view panel (Figures 2(a) and 2(b)). ere is, therefore, a need for a mechanism that allows users to choose the property of projections that they are interested in, rapidly examine projections, make �ltering, and locate interesting view panels to detect patterns, clusters, or outliers. In 1C1V, data can be visualized at several stages of analysis and linked to raw image data keeping concept of star model (Figures 2(c) and 2(d)): one-click, one-view. �nmodi�ed and transformed datasets can be plotted interactively as scatter plots, displayed in histograms, viewed in image viewer/editor or viewed as tables. Entire experiments can be displayed in various overview plots in the context of how they are annotated, and �gures and tables can be exported for publication. Data can also be exported for custom analyses (e.g., for algorithms that are very expensive to computer power and time) and local development of new analysis methods and in various de�ned formats for use in external postanalysis applications.
Data can be visualized at several stages of analysis. �nmodi�ed images and extracted results datasets can be plotted interactively as scatter plots, displayed in histograms, or viewed as tables ( Figure 3). An entire result table can be displayed in various overview plots, and tables can be exported for publication. From any data-analysis step, the experiment can be imported into a data-visualization interface in in which the data can be browsed and viewed.
Tools based on 1C1V method should provide a number of projections which interact with image data, �lters, tables, and colors ( Figure 3). First step to be done aer loading the data (images) and extracting image features is con�guring G�I, that is, deciding which variable is associated with which axis. is con�guration is done in real-time and can be changed easily. Once the axes are set and the data is mapped to the unit cube, the user can start to explore and identify interesting objects by performing the following operations.
(1) Create 2D scatter plots of any two experiments by orienting the third axis perpendicularly to the display plane.
(3) Assign color and shape to selected objector groups of genes.
Nauru offers four different ways to select objects. ey are as follows.
(1) Mouse Click: a gene can be selected by a mouse click.
Annotation for the selected gene and its parameter values for all experiments are shown.
(2) Selection Plane: the selection plane is a 2D rectangle. A user can move and resize the rectangle. All genes that lie within the rectangle aer projection on to the 2D screen are selected. A user can browse through the list of annotation and expression value information of selected genes.
(3) Range Selection: a range of differential cell parameters values can be speci�ed for each of the axes and color. All genes falling outside the speci�ed range are not shown. is selection mechanism is useful for setting cut-off values at which the differential expression can be considered signi�cant.
Visualizations are the key to analyzing data in 1C1V. A variety of visualization types can be used to provide the best view of the data: 3.1. Image Viewer (IV). e 1C1V image viewer ( Figure 4) as a key element for HCS data allows users to visualize and edit images and image processing results. IV is highly integrated with interactive table (e.g., sample annotation), plots, data visualization features, and data controls. Image Viewer displays images very quickly, and these images may be viewed in full screen, as slideshows or as thumbnails. It is quite capable of processing images; user can rotate, adjust brightness and color, and apply �lters or ��T tables for each independent staining channel. e editor of image viewer has a variety of tools like a Fire, 3-3-2 RGB, Grays, siRNA's Negative control Positive control Short side 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Short side siRNA's Negative control Positive control 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 F 1: Location of controls on a 384-well plate. In a screening process, the designed biological assay is performed by using a robot to add the cell line and speci�c reagents (siR��) to each well, which already contains a di�erent oligonucleotide or control. �er incubation or other required manipulations, �uorescence images are acquired and obtained for every well by an automated microscope. �ese raw data represent the images of each oligonucleotide or control against a speci�ed target. (a) �enerally, in a siR��, 2�� di�erent oligonucleotides (blue) are stored in the middle of a 384-well plate, and wells on the �rst two and last two columns are le empty. (b) Ideally, controls should be located randomly among the 384 wells of each plate. �nly the �rst two and the last two columns are typically available for controls. �espite this limitation, edge-related bias can be minimized by alternating the sixteen positive controls (red) and the sixteen negative controls (yellow) in the available wells, such that they appear equally on each of the sixteen rows and each of the 4 available columns.   Ice, Spectrum, and export �gures for publication and great effects that can be applied on selected image or entire image collection. Moreover, thanks to Bio-Formats plugin, it reads many formats, including Tiff: (8,16, and 32 bits), JPEG, and JPEG2000. A very special feature of image viewer is a raw data security system. All image processing steps are saved as independent feature mask in settings �le. Raw images stay unchanged. Batch processing options (apply the same image processing steps for all images) can be carried out on more than one �le at a time by clicking �Save�. It offers nearly instantaneous hotkey zooming. It also allows having several running multiple instances of image panel if you like to browse in different windows.
IV also recognizes image metadata data and optionally displays it alongside the images. Metadata panel can display the basic image metadata like microscope features or exchangeable image �le format (exif) tags. Another outstanding plus of IV is the transformation of image metadata into an interactive table (described below). Metadata Panel can also decode CR2, CR�, and exif �les and extract the embedded headers. Exif data are embedded within the image �le itself. Exif is a standard that speci�es the formats for images, ancillary tags used by digital cameras, scanners and other systems handling image, and sound �les recorded by digital cameras. e speci�cation uses the following existing �le formats with the addition of speci�c metadata tags: JPEG DCT for compressed image �les, TIFF Rev. 6.0 (RGB or �CbCr) for uncompressed image �les, and RIFF �AV for audio �les (Linear PCM or ITU-T G.�11 -Law PCM for uncompressed audio data, and IMA-ADPCM for compressed audio data). It is not supported in JPEG 2000, PNG, or GIF. e exif format has also standard tags for location information. Currently, few cameras and a growing number of mobile phones have a built-in GPS receiver that stores the location information in the exif header when the picture is taken.

Scatter Plots.
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. However, they have a very speci�c purpose. Scatter plots show how much one variable is affected by another. Each record (or row) in the dataset is represented by a marker whose position depends on its values corresponding to the and axes.
e above picture demonstrates how scatter plots can be used. Say, for example, that a user wants to show whether there exists a consistency in cell number across all the screening samples. A third variable can be set to correspond to the color or size of the markers, thus adding yet another dimension to the plot. Two-dimensional scatter plots are the default visualization of many datasets.

Bar Charts.
A bar chart is a way of summarizing a set of categorical data. It displays the data using a number of bars of the same width, each of which represents a particular category. e length of each bar is proportional to the count, sum, or the average of the values in the category it represents, such as age group or geographical location. In Nauru, it is also possible to color or split each bar into another categorical column in the data, which enables you to see the contribution from different categories to each bar in the bar chart. Tables. e table visualization presents the  data as a table of rows and columns. e table can handle the same number of rows and columns as any other visualization in Nauru. In the table, a row represents a record. By clicking on a row, you make that record active, and by holding down the mouse button and dragging the pointer over several rows, you can mark them. You can sort the rows in the table according to different columns by clicking on the column headers or �lter out unwanted records by using the query devices. Different types of visualizations can be shown simultaneously. ey are linked and are updated dynamically when the query devices are manipulated (see below). Visualizations can be made to re�ect high-dimensional data by letting values control visual attributes such as size, color, shape, rotation, and text labels. You can sort the vertical order of the rows in the table. is can be done in several steps, for example: �rst sort, according to the values in column 1, then by the values in column 5, then by the values in column 3, and so forth.

e Filter Panel.
Filter panels are used to �lter data. Filter panel devices appear by demand in several forms, and scientist can easily select a type of query device that best suits user's needs (e.g., combo boxes, sliders, etc.). When manipulating a �lter by moving a slider or selecting a multiple box, all visualizations are immediately updated to re�ect the new selection of image data. In such cases, researchers want to quickly �nd only RNAi constructs, genes, or compounds similar to the expected pattern.

�. �or��o� Platform for
Preprocessing and Postprocessing e concept of work�ow is not new, having been used by many organizations, over several years, to improve productivity and increase e�ciency. A work�ow system is highly �exible and can accommodate any changes or updates as when new or modi�ed data and corresponding analytical tools become available. Raw metadata are not always in ready format for visualization. A work�ow environment allows biologists themselves to prepare data for visualization without involving any programming. Work�ow systems are different from programming scripts and macros in one important respect. Programming systems and macros use text-basedlanguages to create lines of code, while applications like Nauru use a graphical programming language. In general, work�ow systems concentrate on creation of abstract process work�ows to which data can be applied when the design process is complete. In contrast, work�ow systems in the life science domain are oen based on a data�ow model, due to the data-centric and data-driven nature of many scienti�c analyses. A comprehensive understanding of biological phenomena can be achieved only through the integration of all available biological information and different data analysis tools and applications. A work�ow environment allows HCS researchers themselves to perform the integration without involving any programming. As such, the work�ow system allows the construction of complex in silico experiments in the form of work�ows and data pipelines. Data pipelining is a relatively simple concept. Any computational component or node has data inputs and data outputs. Data pipelining views these nodes as being connected together by "pipes" through which data �ow ( Figure 5). Nauru builds a �ow by dragging and dropping nodes from the node repository into the main panel and connecting them. Nodes are the basic processing units of a work�ow ( Figure 5). Each node has input and/or output ports. Data are transported through connections from these node output ports into connected input ports. Aer positioning the nodes, the inputs of each node are fully connected to outputs of a predecessor node. is is achieved by clicking on an output port and dragging the connection to the input port that should receive data from this output. All data �owing between nodes are wrapped within a class called DataTable, which holds metainformation concerning the column headers and the actual data (e.g., numeric data, image name, image path, image processing parameters, and gene annotations). e data can be accessed by iterating over instances of DataRow. Each row contains a unique identi�er (or primary key), a speci�c number of DataCell ob�ects, image name, and image path. A work�ow system is highly �exible and is designed to accommodate any changes on table before interactive visualization.

Comparison of Existing Visualization Methods in HCS
We compared existing solutions for visualizing of HCS data and made a summary and comparison of few methodologies and solutions in Table 1.

Case Studies
Visualization of image processing features extracted from raw data helps to discover systematic plate-to-plate variation, making measurements comparable across plates. Systematic errors decrease the validity of results by either over-or underestimating true values. ese biases can affect all measurements equally or can depend on factors such as well as location, liquid dispensing, and signal intensity. Although recent improvements in automation can minimize bias, and thereby provide more reproducible results, equipment malfunctions can nonetheless introduce systematic errors, which must be corrected at the data processing and analysis stages. For illustration, two screens have been taken into visual quality control analysis using Nauru soware described in Table 2. For each cell of the samples, the image processing soware (Advanced Cell Classi�er [18�) quanti�ed the images by calculating more than a hundred parameters, which mainly fall into the following categories: (i) geometric properties such as the area, perimeter, and shape of the cell nucleus; the location of a cell; average distance of a cell to its neighbours; (ii) intensity information such as the content of a protein, as re�ected by the intensity of the corresponding �uorescent dye, and the variance, skewness, and kurtosis of the intensity distribution.
As a result of the analysis of screens 1 and 2, we were able to detect signi�cant errors: (i) plate-to-plate variation (Figure 6

Method and solution Description Reference
Visual data exploration (VDE) e basic idea of VDE is to present the data in some visual forms that allow users to gain insight into the data and generate hypotheses by directly interacting with the data. e advantage of VDE is that users are directly involved in the data mining process to combine the �exibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of computers. is process is especially useful when little is known about the data and the exploration goals are vague, such as in analyzing a huge number of RNAi-HCS images. However, access to plots, data tables, and image viewer in one view frame available with one click are not available. [19] Cellomics Discovery ToolBox and visualization method e methods focus on visualizing simple quantitative readouts of markers instead of images and especially the relationships among images that convey profound information closely related to effects of chemical compounds, gene functions, and biological processes.
http://www.cellomics.com and [20] PhotoFinder and Personal Photo Libraries ose methods focus on image database visualization targeted at personal photo albums, which are much smaller than HCS image databases and did not consider computational needs speci�c to HCS image analyses. [21,22] ImCellPhen-interactive mining of cellular phenotypes is is a method and a tool for interactive mining of cellular phenotypes which provides intelligent interfaces for visualizing large-scale RNAi-HCS image databases and interactive mining of cellular phenotypes. However, this method does not provide easy-to-use (with one click access) �ltering functionality for image properties and image processing results.
[19] http://combio.cs.brandeis.edu/ imcellphen/ e Open Microscopy Environment (OME) OME provides an open-source browser to navigate HCS image databases that are described as a quasi-hierarchical structure representing the relationship between projects and datasets. However, this navigation scheme was not designed to facilitate discovery of screening hits among all available parameters and categories of data. [23] Advanced Cell Classi�er Advanced Cell Classi�er (ACC) is a data analyzer method program to evaluate cell-based high-content screens. e basic aim is to provide a very accurate analysis with minimal user interaction using advanced machine learning methods and visual learning of image data sets. However, ACC do not provide full interactivity, and at same time, �ltering options for 3 data sources on same view: library, images, and image processing results. [24] (vii) mistake on dispensing-cell seeding (Figure 7(b)), (viii) cell seeder and seeding distribution over 384 well plates (Figure 7(c)).

Conclusion
In this paper, we offer interactive methodology 1C1V for the analyses of HCS datasets. High dimensionality of the datasets hinders users from recognizing important patterns in the datasets. We add a scatter plot browsing mechanism that helps users select interesting 2D projections of the high dimensional multiparametric dataset. In addition, an even more difficult problem is to understand the biological signi�cance of the patterns found in the datasets. Considering the needs of working professionally in the �eld of HCS data analysis, 1C1V is an effective and efficient method for interactive image data exploration, detection of systematic errors, and quality control. We have demonstrated through simple examples how quickly a researcher could investigate problems of a dataset before making a �nal decision. is work is a part of our continuing effort to give users more controls over data mining processes and to enable more interactions with analysis results through interactive information visualization techniques. ese efforts are designed to help users perform exploratory data analysis, establish meaningful hypotheses, and verify results. In this paper, we show how those visualization methods can help molecular is screen studied the biogenesis of the small ribosomal subunit in human cells (40S subunit). Towards this, an assay for visual detection of nuclear maturation defects of the 40S subunit was developed. In the cellular process, a HeLa cell line bearing a �uorescently tagged ribosomal protein of the 40S subunit (Rps2-YFP), which is expressed only upon induction, was used to selectively visualize newly synthesized 40S subunits. is assay has been developed to detect nuclear 40S maturation defects upon depletion of a protein by RNAi. is allowed to identify proteins functioning in 40S biogenesis in human cells and, according to the classi�cation, assign their requirement to nucleolar or nucleoplasmic maturation events. Screening results data for QC: 76747 data points corresponding to individual siRNA oligonucleotide (4 siRNAs per targeted gene), including 4 numeric parameters and gene/siRNA annotation.
(2) miRNA biogenesis-Antagomirs Screen e goal of the screen was to monitor the levels of two individual miRNAs (mir16 with GFP and mir22 with mCherry) within the cells and to investigate in modulators of the miRNA expression/function. In addition, the experiment was designed to study an inhibitor of miRNAs (Antagomir)-interesting point was to know how it is taken up and how it does act (which genes are important). Screening results data for QC: 26880 data points corresponding to pooled siRNAs (4 oligonucleotides per targeted gene), including 5 numeric parameters and gene/siRNA annotation.
biologists analyze and understand multidimensional RNAi cell-based screening data.

��n��c� �f �n�eres�s
ere is no con�ict of interests to be declared.