^{1}

^{2}

^{1}

^{1}

^{2}

Recent and forthcoming advances in instrumentation, and giant new surveys, are creating astronomical data sets that are not amenable to the methods of analysis familiar to astronomers. Traditional methods are often inadequate not merely because of the size in bytes of the data sets, but also because of the complexity of modern data sets. Mathematical limitations of familiar algorithms and techniques in dealing with such data sets create a critical need for

Astronomy is undergoing a rapid, unprecedented and accelerating growth in both the amount and the intrinsic complexity of data. This results partly from past and future large sky surveys: the Sloan Digital Sky Survey [^{3}–10^{4 }objects appear annually. The increasing availability of multiple-object spectrographs deployed at ground-based observatories enables observers to obtain spectra of hundreds of objects in a single exposure [

In addition, mathematically new (for astronomy) forms of data are starting to appear, such as those of the ESA Planck mission in which the cosmic microwave background (CMB) is characterized by a

Other great challenges arise from the so-called three-dimensional (3D) reconstructions. For example, a very important and difficult problem of solar astrophysics is 3D reconstruction of coronal mass ejections [

The richness and complexity of new data sets will provide astronomy with a wealth of information and most of the research progress expected from such sets inherently rests in their enormity and complexity. In order to take full advantage of immense multispectral, multitemporal data sets, their analysis should be

The astronomical community is becoming increasingly aware of the fact that new and advanced methods of applied mathematics, statistics and information technology should be developed and utilized. Three of the State of the Profession Position Papers submitted to the Astronomy and Astrophysics Decadal Survey strongly emphasized astronomy’s need for new computational and mathematical tools in the coming decade [

However, there is a large communication gap between astronomy and other fields where adequate solutions exist or are being developed: applied mathematics, statistics and artificial intelligence.

The principal objectives of this paper are twofold. First, we wish to bring attention to some specific needs for new data analysis techniques. Second, we describe some innovative approaches and solutions to some of these problems and give examples of novel tools that permit effective analysis, representation, and visualization of the new multispectral/multitemporal data sets that offer enormous richness if they are mined with the appropriate tools. The extensive amount of relevant work already accomplished in disciplines outside of astronomy does not allow us to offer a complete review of all aspects of these complex topics and problems, but we have selected a number of important examples.

The structure of the paper is as follows. In Section

Vast data sets demand automated or semi-automated image processing and quality assessment of the processed images. Indeed, the sheer number of observed objects awaiting analysis makes obvious the need for sophisticated automation of object detection, characterization and classification. Adapting recent advances of computer vision and image processing for astronomy and designing and implementing an image processing framework (see below) that would utilize these continuing achievements, remains, however, a major challenge.

Developers and users alike have realized that there is more to creating an application than simple programming. The objective of creating a flexible (see below) application is customarily achieved by exploiting the object-oriented paradigm [

The keystone elements of a system that unifies a wide range of methods for astronomical image processing should be computational and visualization modules. Such a platform-independent framework with an integrated environment was described in Pesenson et al. [

ESO-La Silla; courtesy of A. Grado, INAF-Osservatorio Astronomico di Capodimonte. (a) overlaid pre- and post-processed images; the red cross shows approximately the edge of the diffuse halo. (b) Flux-cut through the overlaid pre- and post-processed images (red and green, resp.); the vertical grey line (close to the center of the plot) corresponds to the red cross in the plot on the left; after the preprocessing the average level “outside” of the halo is lower than the average level “inside”, thus enabling a better automated separation of the diffuse halo from the background.

Morphology unveiling [

Overlaid pre- and postprocessed images [

Three different visualizations of the bow shock structure [

The framework deals primarily with image regularization and segmentation. These are fundamental first steps for the detection and characterization of image elements or objects and as such they play principal roles in the realization of automated computer vision applications. Image segmentation can roughly be described as the process that subdivides an image into its constituent parts (objects) and extracts those parts of interest. Since the inception of image segmentation in the 1960s, a large number of techniques and algorithms for image segmentation have been developed.

However, due to revolutionary advances in instrumentation, the complexity of images has changed significantly: the extension of grey level images to multi- and hyperspectral images, from 2D images to 3D, from still images to sequences of images, tensor-valued images (polarization data), and so forth. Some modern, cutting-edge methods for image processing have been developed lately, and are being developed today, by information scientists outside of astronomy. The substantial progress in this direction made by the image processing and computer vision communities [

Multiscale image representation and enhancement are such approaches. They have become important parts of computer vision systems and modern image processing. The multiscale approach has proven to be especially useful for image segmentation for feature, and artifact detection. It enables a reliable search for objects of widely different morphologies, such as faint point sources, diffuse supernova remnants, clusters of galaxies, undesired data artifacts, as well as unusual objects needing detailed inspection by a scientist. It is well known that in astronomical images one often sees both point sources and extended objects such as galaxies embedded in extended emission (see, e.g., Figures

One effective approach to denoising is based on partial differential equations and may be seen as the local adaptive smoothing of an image along defined directions that depend on local intensities. One wants to smooth an image while preserving its features by performing a local smoothing mostly along directions of the edges while avoiding smoothing orthogonally to these edges. Many regularization schemes have been developed so far for the case of simple two-dimensional scalar images. An extension of these algorithms to vector-valued images is not straightforward. For a gray-scale image, the gradient is always perpendicular to the level set objects of the image; however, in the multi-channel case, this quality does not hold. Applying nonlinear diffusion to each channel or spectral band separately is one possible way of processing multi- and hyperspectral cubes; however, it does not take advantage of the richness of the multi/hyperspectral data. Moreover, if the edge detector acts only on one channel, it may lead to an undesirable effect, such as color blurring, where edges in color images are blurred due to the different local geometry in each channel. Hence, a coupling between image channels should appear in the equations through the local vector geometry. We achieve this by implementing a nonlinear diffusion on a weighted graph, thus generalizing the approach adopted by Pesenson et al. [

Our framework handles two-dimensional scalar images and paves the way to the semi- and automated image processing and image quality assessment. However, the ability to extract useful knowledge from high-dimensional data sets is becoming more and more important (Section

Astronomy has long found the use of multiple data dimensions to be crucial aids to progress. For example, surveys of H-alpha sources when cross-correlated with spectral types guided astronomy to discover interesting new types of objects such as Herbig AeBe stars [

Simple color-magnitude diagrams are another traditional tool taking advantage of multiple data dimensions (e.g., cataloging YSO candidates from Spitzer survey data, Whitney et al. [

As multiwavelength data have become available for huge numbers of objects in the past few decades, the number of data dimensions for a typical object has grown beyond what can be visualized and studied using classical color-color plots and correlations using only a few dimensions. For example, Egan et al. [

Color-color plots are a familiar form of scientific visualization. This example (from [

While Figure

The catalog information available for a great many objects already includes such information as magnitude, several optical and IR colors, metallicities, spectral types, and so on. A galaxy catalog would of course include other parameters such as morphology and redshift. When spectra are considered and compared in detail, the huge number of emission and absorption features obviously compounds the problem vastly. Still more complexity is added when one attempts to correlate a large grid of models with a large data set having many dimensions (e.g., the YSO analysis [

Testing the algorithm on synthetic data. A simulated three-dimensional set of a thousand uniformly distributed random points with a double-diamond pattern created by assigning large weights to the edges connecting the points in the pattern (while the rest of the weights are negligible). (a) and (b): two screen shots from a running animation—each point in the set oscillates (in this case in three dimensions) with its own, random frequency. (c) Synchronization of the points that are connected with high-weight edges allows to reveal the pattern visually (to avoid extra clutter, the edges are not displayed in the animation), or automatically, by selecting synchronized points and highlighting them.

Other examples of high-dimensional data include, but are not limited to, multiparametric data sets (e.g., the manifold of galaxies, [

All these examples clearly demonstrate that the automated and semiautomated processing required by the unprecedented and accelerating growth in both the amount and the complexity of astronomical data demands new ways of information representation. In the next section we discuss such approaches and describe some of the original algorithms we have developed in the course of this ongoing work.

Because approaches to complex data require advanced mathematics, astronomers who wish to take advantage of them will need at least some basic knowledge of new, unfamiliar mathematical concepts and terminology. The full practical adoption of such methods requires interdisciplinary scientists who understand the new approaches in depth and are interested in working with astronomers to adapt and apply the methods to astronomical data sets. This is already happening as part of some research. The first step for a “neophyte” astronomer, however, is to learn “what is out there” as a basis for further investigation and consideration of the utility of various methods. The goal of this section is to offer a very basic introduction and explanation of some of these unfamiliar concepts.

Machine learning [

Bellman [^{20} sample points. Thus, in some sense, the 10-dimensional hypercube can be said to be a factor of 10^{18} “larger” than the unit interval.

Euclidian spaces are usually used as models for traditional astronomical data types (scalars, arrays of scalars). Another

These sorts of problems demonstrate that in order to make practical the extraction of meaningful structures from multiparametric, high-dimensional data sets, a low-dimensional representation of data points is required. Dimension reduction (DR) is motivated by the fact that the more we are able to reduce the dimensionality of a data set, the more regularities (correlations) we have found in it and therefore, the more we have learned from the data. Data dimension reduction is an active branch of applied mathematics and statistics [

Classical approaches to dimension reduction not unfamiliar to astronomy are principal components analysis (PCA) [

PCA has a serious drawback in that it does not explicitly consider the structure of the manifold on which the data may possibly reside. In differential geometry an

We have recently developed some advanced, original methods for performing nonlinear DR, which do not suffer from the limitations of PCA. In what follows, we briefly describe these methods. First, we introduce some more concepts and methods that have proved to be effective in the area of machine learning.

In the context of data retrieval and processing, dimensionality reduction methods based on

The graph representation of structured data provides a fruitful model for the relational data mining process. A graph is a collection of nodes and links between them; the nodes represent data points and the weights of the links or edges indicate the strength of relationships. A graph in which each graph edge is replaced by a directed graph edge is called a directed graph, or diagraph [

The modern approach to multidimensional images or data sets is to approximate them by graphs or Riemannian manifolds. The first important, and very challenging, step is to convert such a data cloud to a weighted finite graph. The next important problem is the choice of “right” weights that should be assigned to edges of the constructed graph. The weight function describes a notion of “similarity” between the data points and as such strongly affects the analysis of the data. The weights should entirely be determined by application domain. The most obvious way to assign weights is to use a positive kernel like an exponential function whose exponent depends on the local Euclidean distance between data points and a subjectively chosen parameter called “bandwidth”. (There are also other ways of assigning weights, which depend on more complex mathematics than we discuss in this article.)

Next, after constructing a weighted graph, one can introduce the corresponding combinatorial Laplace operator [

Most existing data mining and network analysis methods are limited to pairwise interactions. However, sets of astronomical objects usually exhibit multiple relationships, so restricting analysis to the dyadic (pairwise) relations leads to loss of important information and to missing discoveries. Triadic, tetradic or higher interactions offer great practical potential. This has led to the approach based on hypergraphs (e.g., [

Data sets having fractional dimensions (“fractals”: Mabdelbrot [

Obviously, an important first step in practical dimensionality reduction is a good estimate of the intrinsic dimension of the data. Otherwise, DR is no more than a risky guess since one does not know to what extent the dimensionality can be reduced. To enable analysis of astronomical data sets that exhibit a fractal nature, we are currently developing a practical concept of spectral dimensionality, as well as original algorithms for sampling, compression and embedding fractal data.

The approaches described above dealt with compact manifolds and finite graphs. However, massive data sets are more adequately described by noncompact manifolds and infinite graphs. In order to deal with extremely large data sets we extended dimension reduction to

Despite the important and appealing properties of the above mentioned dimension reduction algorithms, both linear and nonlinear approaches are sensitive to noise, outliers, and missing measurements (bad pixels, missing data values). Because noise and data imperfections may change the local structure of a manifold, locality preservation means that existing algorithms are topologically unstable and not robust against noise and outliers [

This is obviously a serious drawback because astronomical data are always corrupted by noise. Budavári et al. [

Thus the practical usage of dimension reduction demands careful improvement of signal-to-noise ratio without smearing essential features. Implementing a nonlinear diffusion on weighted graphs (Section

In what follows we briefly describe a new, original unifying approach to segmentation of images in particular and pattern recognition and information visualization in general. Image segmentation (see Section

Solutions to the manifold learning problem can be based on intuition derived from physics. A good example of this is the approach of Horn and Gottlieb [

We developed [

The approaches presented in this section enable interpolation, smoothing and immersions of various complex (dozens or hundreds of useful parameters associated with each astronomical object) and large data sets into lower-dimension Euclidean spaces. Classification in lower dimensional space can be done more reliably than in high dimensions. Thus, DR can be significantly beneficial as a pre-processing step for many existing astronomical packages, such as for example, the popular source extractor SExtractor [

Data intensive astrophysics requires an interdisciplinary approach that will include elements of applied mathematics [

The problem of processing data that lay on a manifold is very important for cosmological data analysis. The standard, powerful data analysis package HealPix [

The essential part of the general analysis based on the wavelet-like constructions is sampling of bandlimited functions. The mathematical foundations of sampling on arbitrary Riemannian manifolds of bounded geometry were laid down by Pesenson [

Multiscale data analysis has proved to be a very powerful tool in many fields. Applications of multiscale image analysis to astronomy were pioneered by Starck et al. [

Spectral methods and diffusion maps have recently emerged as effective approaches to nonlinear dimensionality reduction [

Manifold learning may be seen as a DR procedure aiming at capturing the degrees of freedom and structures (clusters, patterns) within high-dimensional data. Manifold learning nonlinear algorithms such as isometric mapping (ISOMAP) by Tenenbaum et al. [

The AstroNeural collaboration group [

Comparato et al. [

Draper et al. [

The Center for Astrostatistics (CASt) at Pennsylvania State University provides a wealth of resources (codes, data, tutorials, programs, etc.) related to challenges in statistical treatments of astrophysical data:

A large amount of practical and up-to-date information (texts, tutorials, preprints, software, etc.) related to Bayesian inference in astronomy and other fields is provided by T. Loredo on the website Bayesian Inference for the Physical Sciences (BIPS) at

The International Computational Astrostatistics (InCA) Group at Carnegie Mellon University develops and applies new statistical methods to inference problems in astronomy and cosmology, with an emphasis on computational nonparametric approaches (see for details

The AstroMed project at Harvard University’s IIC is dedicated to the application of medical image visualization to 3D astronomical data [

Compressed sensing and the use of sparse representations offer another promising new approach. Traditionally it has been considered unavoidable that any signal must be sampled at a rate of at least twice its highest frequency in order to be represented without errors. However, a technique called compressed sensing that permits error-free sampling at a lower rate has been the subject of much recent research. It has great promise for new ways to compress imaging without significant loss of information, thus ameliorating analysis limitations deriving from limitations of computing resources. The power of compressed sensing was strikingly illustrated when an object was successfully imaged in some detail by a camera composed of a

Various data types together with methods used for their representation are briefly summarized in Table

Examples of complex data types and some of the methods for their representation and processing.

Data Types | Some Astronomical Applications | Traditional Approaches to Data Representation & Processing | Advanced Approaches to Data Representation & Processing |
---|---|---|---|

Vector Data | (1) Multiwavelength observations. | (1) Linear dimension reduction: PCA and its modifications. | (1) Spectral methods, eigenmaps, diffusion maps, LLE, ISOMAP. |

(2) Multitemporal observations. | (2) Sampling on graphs. | ||

(3) VO. | (3) Methods based on nonlinear dynamics. | ||

(4) Spectra. | (4) Neural networks, fuzzy-C sets. | ||

(5) Genetic algorithms. | |||

(6) Scientific visualization. | |||

(7) Compressed sensing. | |||

Manifold-Valued and/or Manifold-Defined | (1) Polarization measurements (CMB). | (1) Various sampling distributions on the sphere. | (1) Healpix (data on 2D sphere). |

(2) Gravitational lensing. | (2) Needlets. | ||

(3) Solar astrophysics. | (3) Sampling on manifolds. | ||

(4) Scientific visualization. |

Extremely large data sets, as well as the analysis of hundreds of objects each having a large number of data dimensions, present astronomy with unprecedented challenges. The challenges are not only about database sizes in themselves, but about how intelligently one organizes, analyzes, and navigates through the databases, and about the limitations of existing data analysis approaches familiar to astronomy. The answers to these challenges are not trivial, and for the most part lie in complex fields of research well outside the training and expertise of almost all astronomers. Fortunately, other disciplines such as imaging science and earth sciences have for many years been grappling with the same sorts of problems. Fruitful interdisciplinary work has already become a regular feature of research in those other disciplines, and has resulted in applications of crucial value to other sciences seeking to take advantage of complex, giant data sets in their respective fields. This work has brought about many helpful applications and promising paths for further progress that potentially have significant value to astronomy.

Multidimensional image processing, image fusion (combining information from multiple sensors in order to create a composite enhanced image) and dimension reduction (finding lower-dimensional representation of high-dimensional data) are effective approaches to tasks that are crucial to multitemporal, multiwavelength astronomy: study of transients, large-scale digital sky surveys, archival research, and so forth. These methods greatly increase computational efficiency of machine learning algorithms and improve statistical inference, thus facilitating automated feature selection, data segmentation, classification and effective scientific visualization (as opposed to illustrative visualization). Dimensionally reduced images also offer an enormous savings in storage space and database-transmission bandwidth for the user, without significant loss of information, if appropriate methods are used.

To effectively use the large, complex data sets being created in 21st Century astronomy, significant interdisciplinary communication and collaboration between astronomers and experts in the disciplines of applied mathematics, statistics, computer science and artificial intelligence will be essential. The concepts and approaches described in this paper are among the first steps in such a broad, long-term interdisciplinary effort that will help bridge the communication gap between astronomy and other disciplines. These concepts, and the approaches derived from them, will help to provide practical ways of analysis and visualization of the increasingly large and complex data sets. Such sophisticated new methods will also help to pave the way for effective automated analysis and processing of giant, complex data sets.

The authors would like to thank Alanna Connors for stimulating discussions and useful suggestions. They would also like to thank the referee for constructive suggestions that led to substantial improvements of the article. The first author would like to thank Michael Werner for support. This work was carried out with funding from the National Geospatial-Intelligence Agency University Research Initiative (NURI), Grant HM1582-08-1-0019, and support from NASA to the California Institute of Technology and the Jet Propulsion Laboratory.