Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

1Department of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30080, USA 2Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14620, USA 3Department of Biomedical Informatics, University at Buffalo, Buffalo, NY 14620, USA 4School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA 5Department of Mechanical Engineering, Iowa State University, Ames, IA 50011, USA


Introduction
Computational analysis of high-dimensional data continues to be a challenging problem, spurring the development of numerous computational techniques.An important and emerging class of methods for dealing with such data is dimensionality reduction.In many applications, features of interest can be preserved while mapping the high dimensionality data to a small number of dimensions.These mappings include popular techniques such as principle component analysis (PCA) [1] and complex nonlinear maps such as Isomap [2] and kernel PCA [3].
Linear manifold learning techniques, for example, PCA or multidimensional scaling [4][5][6][7], existed as orthogonalization methods for several decades.Nonlinear methods like Isomap, LLE (locally linear embedding) [8], and Hessian LLE [9] were discovered recently.Another class of methods that emerged in the past few years are the unsupervised learning techniques, including artificial neural networks for Sammon's nonlinear mapping [10], Kohenen's or self organizing maps (SOM) [11], and curvilinear component analysis [12].Modifications to the existing algorithms of manifold learning, to improve either their efficiency or performance, were another area where efforts were focused [13][14][15][16].For example, Landmark Isomap [17] is a modification to the original Isomap method to extend its usage to larger datasets by picking a few representative points and applying Isomap technique to them.Along with the emergence of new manifold learning techniques, different sequential implementations of these 2 Scientific Programming techniques, targeting various hardware platforms and various programming languages, have been developed [18,19].
Dimensionality reduction techniques are often compute intensive and do not easily scale to large datasets.Recent advances in high-throughput measurements using physical entities, such as sensors, or results of complex numerical simulations are generating data of extremely high dimensionality.It is becoming increasingly difficult to process such data sequentially.
In this paper, we propose a parallel framework for dimensionality reduction.Rather than focusing on a particular method, we consider the class of spectral dimensionality reduction methods.Till date, few efforts have been made in developing parallel implementations of these methods, other than development of a parallel version of PCA [20,21], parallelization of multidimensional scaling (MDS) for genomic data [22], and algorithm development for GPU platforms [23,24].
We perform a systematic analysis of spectral dimensionality reduction techniques and provide their unified view that can be exploited by dimensionality reduction algorithm designers.We identify common computational building blocks required for implementing spectral dimensionality reduction methods and use these abstractions to derive a common parallel framework.We implement such a framework and show that it can handle large datasets and it scales to thousands of processors.We demonstrate advantages of our software by analyzing 75,000 images of morphology evolution during manufacturing of organic solar cells, which enables us to visually inspect and correlate fabrication parameters with morphology.
The remainder of this paper is organized as follows.In Section 2 we introduce the dimensionality reduction problem and describe basic spectral dimensionality reduction techniques, highlighting their computational kernels.In Section 3 we provide a detailed description of our parallel framework including algorithmic solutions.Finally, in Section 4 we present experimental results, and we conclude the paper in Section 5.

Definitions and Methods Overview
The problem of dimensionality reduction can be formulated as follows: Consider a set  = { 0 ,  1 , . . .,  −1 } of  points, where   ∈ R  , and  ≫ 1.We are interested in finding a set Here, |−| ℎ denotes a specific norm that captures properties we want to preserve during dimensionality reduction [25].For instance, by defining ℎ as Euclidean norm we preserve Euclidean distance, thus obtaining a reduction equivalent to the standard technique of principal component analysis (PCA) [1].Similarly, defining ℎ to be the angular distance (or conformal distance [26]) results in locally linear embedding (LLE) [8] that preserves local angles between points.In a typical application [27,28],   represents a state of the analyzed system, for example, temperature field and concentration distribution.Such state description can be derived from experimental sensor data or can be the result of a numerical simulation.However, irrespective of the source, it is characterized by high dimensionality, that is,  is typically of the order of 10 6 [29].While   represents just a single state of the system, common data acquisition setups deliver large collections of such observations, which correspond to the temporal or parametric evolution of the system [27].Thus, the cardinality  of the resulting set  is usually large ( ∼ 10 4 -10 5 ).Intuitively, information obfuscation increases with data dimensionality.Therefore, in the process of dimensionality reduction (DR) we seek as small a dimension  as possible, given constraints induced by the norm | − | ℎ [25].Routinely,  < 4 as it permits, for instance, visualization of the set .
DR techniques have been extensively researched over the last decade [25].In particular, methods based on spectral data decomposition have been very successful [1,2,9] and have been widely adopted.Early approaches in this category exploited simple linear structure of the data, for example, PCA or multidimensional scaling (MDS) [30].More recently, techniques that can unravel complex nonlinear structures in the data, for example, Isomap [2], LLE, and kernel PCA [3], have been developed.While all these methods have been proposed taking into account specific applications [19,25], their underlying formulations share similar algorithmic mechanisms.In what follows we provide a more detailed overview of spectral DR techniques and we identify their common computational kernels that form the basis for our parallel framework.

Spectral Dimensionality Reduction.
The goal of DR is to identify a low-dimensional representation  of the original dataset , that preserves certain predefined properties.The key idea underpinning spectral DR can be explained as follows.We encode desired information about , that is, topology or distance, in its entirety by considering all pairs of points in .This encoding is represented as a matrix  × .Next, we subject matrix  to unitary transformation , that is, transformation that preserves norm of , to obtain its sparsest form Λ, where  = Λ  .Here, Λ × is a diagonal matrix with rapidly diminishing entries.As a result, it is sufficient to consider only  entries of Λ to capture all the information encoded in .These  entries constitute the set .The above procedure hinges on the fact that unitary transformations preserve original properties of  [31].Note also that it requires a method to construct matrix  in the first place.Indeed, what differentiates various spectral methods is the way information is encoded in .
We summarize the general idea of spectral DR in Algorithm 1.In the first four steps we construct the matrix .As indicated, this matrix encodes information about the property that we wish to preserve in the process of DR.To obtain  we first identify the  nearest neighbors (NN) of each point   ∈ .Note that currently all studied methods use NN defined in Euclidean space.This enables us to define a weighted graph  that encapsulates, both distance and topological, properties of the set .Given graph , we can construct a function   :  ×  → R to isolate the desired property.For instance, consider the Isomap algorithm in which the geodesic distance is maintained.In this case,   returns the length of the shortest path between   and   in .Note that for some methods   is very simple; for example, for PCA it is equivalent to a distance measure ,   (  ,   ) = (  ,   ), while for other methods   can be more involved.Differences between various DR methods and their corresponding function   are outlined in Table 1.The property extracted by function   is stored in an auxiliary matrix , which is next normalized to obtain matrix .This process of normalization is a simple algebraic transformation, which ensures that  is centered and hence that the final low-dimensional set of points  contains the origin and is not an affine translation [31].Subsequently,  is spectrally decomposed into its eigenvalues that constitute the sparsest representation of .Resulting eigenvectors and eigenvalues are then postprocessed to extract the set  of lowdimensional points.
The abstract representation of spectral DR methods in Algorithm 1 is based on a thorough analysis of existing techniques [1,2,8].While this representation is compact, it offers sufficient flexibility to, for instance, design new dimensionality reduction procedures.At the same time it provides clear separation of individual computational steps and explicates data flow in any DR process.We exploit both these facts when designing our parallel framework.

Performance Analysis of Dimensionality Reduction
Methods.We used the above presentation of DR methods to identify their basic computational kernels.To better understand how these kernels contribute to the overall performance of different DR methods, we performed a set of experiments using domain specific implementation in Matlab.Experiments were carried out for varying  and a fixed  = 1000 on a workstation with 8 GB of RAM and an Intel 3.2 GHz processor.Obtained results are presented in Table 2.
As can be seen, the run time of analyzed methods is dominated by two steps, namely, NN and construction of the auxiliary matrix .Together they account for 99.8% of the total execution time for  = 4000.In our implementation the NN procedure depends on all-versus-all distance calculations.This is justified taking into account that  is very large and thus efficient algorithmic strategies for NN, for example, based on hierarchical space decomposition [32], are infeasible.Consequently, complexity of this step is ( 2 ).The cost of computing matrix  depends explicitly on the definition of function   .Among existing DR techniques this function is the most complex for the Isomap method.Recall that in the process of DR we are interested in preserving either distance or local topology characteristics.Local topology properties can be directly obtained from NN [8,9,33], inducing computationally efficient definition of   .Conversely, distance characteristics must conform to global constraints and therefore have higher computational complexity [34].In case of Isomap, pairwise geodesic distances can be efficiently derived from all-pairs shortest path distances using, for example, Floyd-Warshall algorithm, with ( 3 ) worst-case complexity.
Another significant DR component is normalization.Although implementation of this step varies between different methods it is invariably dominated by matrix-matrix multiplication.Therefore, we assume overall normalization complexity to be ( 3 ).The last important component is the eigenvalue solver.In general, complexity of this kernel varies depending on the particular solver used.Commonly employed algorithms include Lanczos method [35], Krylov subspace methods [36], or deflation-based power methods [37,38].The choice of method is driven by the structure of the matrix and the number of required eigenvalues.Standard distance preserving DR methods operate on dense symmetric matrices, while topology preserving methods involve sparse symmetric matrices.Accordingly, complexity of these techniques is usually ( 2 ), where  is the number of desired eigenvalues.
A final key factor we have to consider is memory complexity of the described kernels.Here, the main contributing structures are matrices  and .These matrices are most often dense and in the majority of cases require ( 2 ) storage.Because NN directly depends on distances between all pairs, it utilizes an  ×  matrix as well.Finally, input dataset  requires () memory.One important caveat that affects the above analysis is the relationship between  and .In many applications  is significantly greater than .This is not surprising taking into account that acquiring high resolution data (hence high-dimensional) is resource intensive.Therefore one may expect that with increasing  there is usually decrease of .In our applications [29,39] it is not uncommon that  = ( 2 ) or even  = ( 4 ).Consequently, the NN step in Algorithm 1 becomes the most compute intensive while memory requirement is dominated by the input data.Observe that this trend is reflected in our experimental data.

Parallel Framework for Dimensionality Reduction
Dimensionality reduction very quickly becomes both memory and compute intensive, irrespective of the particular method.Memory consumption arises from the size of input data and the auxiliary matrices created in the process.The computational cost is dominated by pairwise computations and weight matrix construction.The goal of our framework is to scale DR methods to very large datasets that could be analyzed on large parallel machines.We designed our parallel DR package following the general outline presented in Algorithm 1. Taking into account significant memory and computational complexity, we focused on distributed memory machines with MPI.To ensure modularity of the framework without sacrificing performance and scalability, we decided to employ a scheme in which processors are organized into a logical 2D mesh.In what follows, we assume a simple point-to-point communication model with latency   and bandwidth 1/  .

Constructing Graph 𝐺.
The graph construction procedure is based on identifying  nearest neighbors of each input point.Because of the high dimensionality of the input data it is advantageous to implement NN in two steps, where we first compute all pairwise distances and then we identify neighbors in a simple scan.Note that these pairwise distances actually represent the distance measure  (see Algorithm 1).Therefore we will consider  to be an  ×  distance matrix.Parallel pairwise computation is a well studied problem [40].
Here, we benefit from our earlier experience with accelerating pairwise computations on heterogeneous parallel processors [41].
Let  =  2 denote the number of processors conceptualized as organized into a  ×  virtual mesh.We decompose  into blocks of (/) × (/) elements.Processor with coordinates (, ) is responsible for computing elements of  within block (, ).This scheme requires that each processor stores two blocks of / points of the input dataset  that correspond to row-vectors and column-vectors used to compute respective part of the matrix .In our implementation, the distribution of the input dataset is performed by parallel I/O with initially preprocessed .Note that to obtain a single element of  we perform | − | 2 norm computations, which are particularly well suited for vectorization.Therefore, we hand-tuned our code to benefit from SIMD extensions of modern processors.
Given pairwise distances, the second step is to identify neighbors of individual points (i.e., vertices of ).This step is executed only for methods where  < , which virtually involves all methods other than PCA (see Table 1).As in the case of pairwise computations it can be efficiently parallelized using the following scheme.Initially, each processor creates a set of  candidate neighbors with respect to the block of matrix  it stores.Specifically, processor with coordinates (, ) searches for neighbors of the set of points [ / , . . .,  (+1)/ ) by analyzing rows of its local block of .Because  is very small this operation can be performed using a simple linear scan.Next, all processors within the same row perform all-to-all communication to aggregate candidate neighbors.Here, the candidate neighborhood of point   is assembled on the processor with coordinates (⌊/⌋, ⌊/⌋ mod ).This processor merges candidate neighborhood lists into the final set of  nearest neighbors.Observe that this completes the graph construction phase-graph  is now stored in the form of adjacency list distributed over  processors.
The computational complexity of the entire procedure can be decomposed into ( 2 /) cost of computing distance matrix, ((/√) log ) cost of performing local NN search, and (( Figure 1: Graph  before (a) and after (b) symmetrization for an example set of seven points and  = 2.

Building Auxiliary Matrix 𝑊.
Given graph  we proceed to the next step, which involves constructing the auxiliary matrix  from the information encapsulated in .As is the case of  we distribute  over  ×  mesh of processors.
Recall that the formulation of DR methods proposed in Algorithm 1 ensures that the only step that is method dependent is the construction of matrix .Consequently, any parallel implementation of this step will vary but it will reflect limitations inherent to the sequential counterpart.Specifically, topology preserving methods, such as, LLE, will involve only local data and hence will be amenable to embarrassing parallelism with limited or no communication.Conversely, distance preserving methods will inevitably require a global data view and thus potentially more sophisticated parallelization strategies.Following our previous claim regarding complexity of Isomap we focus our presentation on the parallel implementation of this particular method.
The function   used in Isomap is based on the geodesic distance, which has been mathematically shown to be asymptotically equivalent to a graph distance in  (i.e., shortest path distance) [42].However, the geodesic distance is a metric, while all-pairs shortest paths in directed graph  do not have to satisfy the symmetry condition.Therefore, to obtain , special attention must be paid to how shortest path distances are used.More precisely, graph  must be transformed to ensure that it is symmetric.Note that after such transformation the graph is no longer regular; that is, certain nodes may have more than  neighbors (see Figure 1).
Taking into account the above requirements we obtain the following procedure of constructing  in parallel.First, all processors within the same row perform all-to-all communication to replicate graph .As a result, each column of processors stores a copy of the entire graph , that is, row-wise distributed between  processors in that column.Thanks to this, each processor can initialize its local part of  without further communication.After initialization  represents the distributed adjacency matrix of , where   = (  ,   ) if   is a neighbor of   , and +∞ otherwise.In the next step symmetrization procedure is executed.Processors with coordinates (, ) and (, ), where  ̸ = , exchange respective blocks of  and select element-wise minimum value between blocks.Similar operation is performed locally by processors on the diagonal, that is, processors for which  = .At this stage  can be used to identify all-pairs shortest paths.Several parallel algorithms have been proposed to address this problem, targeting the PRAM model [43][44][45], shared memory architectures [46], and multi/many cores [47][48][49], as well as distributed memory machines [45,50].Among the existing parallel strategies we decided to adopt the checkerboard version of the parallel Floyd algorithm [46].Briefly, the method proceeds in  iterations, where in each iteration every processor performs ( 2 /) operations to update its local block of .All processors are synchronized in each iteration, owing to the fact that in iteration , th row and th column of  have to be broadcasted.The broadcast is performed between processors that share the same row/column.After  iterations, matrix  stores allpairs shortest path, which concludes the entire procedure.
Complexity of this phase is ( 3 / + (  +   (/√)) log(√)) and is dominated by the parallel Floyd's algorithm.While replication and symmetrization of  can be executed efficiently in ( 2 / +   +   ( 2 /)) time, all-pairs path searching involves extensive communication that slightly hinders scalability.Nevertheless, the algorithm remains scalable as long as  <  2 /log 2 (), which is true in the majority of real-life cases.

Matrix Normalization.
The goal of normalization is to transform matrix  such that the resulting matrix  is both row and column centered; that is, ∑    = 0 and ∑    = 0.The normalization stage in all cases consists of matrix-matrix multiplication (see Table 1).However, in certain situations, especially in distance-preserving methods, explicit matrix multiplication can be avoided by taking advantage of structural properties of one of the matrices (e.g., the matrix  in Table 1).This is the case for, for example, PCA and Isomap, where we exploit the fact that matrices  and   are given analytically and thus can be generated in-place on each processor that requires them to perform multiplication.Consequently, the communication pattern inherent to the standard parallel matrix-matrix multiplication algorithms is simplified to one parallel reduction in the final dot-product operation.The complexity of this approach is ( 3 / + (  +   ( 2 /))√).

Finding Eigenvalues.
Computing eigenvalues is the final step in the dimensionality reduction process.Although parallel eigensolvers are readily available, they are usually designed for shared memory and multi/many core architectures [51][52][53][54].This unfortunately makes them impractical for our purposes.At the same time, existing distributed memory solutions are not scalable and cannot handle large and dense data.For instance, one of the more popular packages, SLEPc [55], uses a simple 1D decomposition and in our tests did not scale to more than 4096 processors.A more recent library, elemental [56], which is still under development, offers 2D block-cyclic decomposition, but relies on a fixed block size (private communication).Finally, ELPA [57], which is the most promising library in this category, is also under development.
For these reasons we decided to implement a custom eigenvalue solver that exploits special properties of matrix  (symmetric, positive semidefinite) and computes only the first  eigenvalues.Our solver is based on the power method [31] and matrix deflation and is outlined in Algorithm 2. Note that power methods are considered easy but not efficient to parallelize.At the same time, however, they are at heart of several important real-life systems, for instance, Google's PageRank [58].
In general, our approach follows the standard scheme of the power method (lines 3-18), repeated  times to identify first  largest eigenvalues.After identification of an eigenvalue and its associated eigenvector, the matrix  is deflated-the contribution of the vector is removed from  (line 19).Observe that power method involves nested matrixvector product (lines 6-11) that under normal circumstances would require parallel vector transposition.However, our parallel implementation benefits from the fact that  is symmetric, hence eliminating need for vector transposition.Indeed, the entire procedure consists of local matrix-vector product followed by all-reduce operation.Here, the reduction operation alternates between columns and rows as required to ensure that vector x is stored properly.Note that the power method is bounded by convergence criteria (line 5).In our case we use one of several popular conditions, which involves checking relative error between the current and previous estimate of the eigenvalue that can be performed every several iterations.We also note that convergence is significantly improved by using a matrix shifting strategy in the form  =  − , where  is a positive number [59].
Extracting eigenvalue and eigenvector in iteration  (lines 11-18) depends on vectors  and , while deflation step involves   .Therefore, it is advantageous to replicate both  and  in their entirety on each processor.We achieve this with all-to-all communication executed by processors within the same row.This allows us to execute the deflation step in parallel, with each processor updating its local block of matrix .Thus, the complexity of a single iteration of the power method is ( 3 / + (  +   (/√)) log()), while the deflation step is ( 2 / +   √ +   ).
To conclude this section we would like to emphasize that our solver operates under the same assumptions as any power method.It requires that the first  eigenvectors of  are linearly independent, the initial vector x generated in th iteration is not orthogonal to the eigenvector V  , and finally, the first  eigenvalues are nondegenerate [31].Note that these conditions are not restrictive and are easily satisfied in the context of dimensionality reduction.

Experimental Results
To assess scalability of our framework and test its performance in real-life applications, we performed a set of experiments using the TACC Ranger cluster [60].A single node of this machine is based on AMD processors working at 2.3 GHz and provides 16 cores with 32 GB of DDR2 RAM and 512 KB of L2 cache per core.Nodes are connected by a multistage Infiniband network that offers 1 GB/s bandwidth.
To compile all test programs and the framework, we used the Intel C++ 10.1 compiler with standard optimization flags and MVAPICH 1.0.1 MPI implementation.In every test we ran one MPI process per CPU core, which we refer to as processor.

Scalability Tests.
In the first set of experiments we measured how problem size influences performance of our solution.We created a collection of synthetic datasets consisting of  = {4096, 8192, 16384, 32768} points with  = 10000.Next, we performed Isomap dimensionality reduction using different numbers of processors.Obtained results are summarized in Table 3 and Figure 2.
The results show that our framework provides very good scalability for large problem sizes in the entire range of tested processor configurations.The super-linear speedup observed for  = 16384 is naturally explained by cache performance.Observe that the dominating computational factors in our framework are operations like matrix-matrix and matrix-vector products, which are well suited to exploit memory hierarchy.In our current implementation, we use direct SIMD-based implementation of these routines.A slightly weaker performance for small problem sizes and large number of processors can be attributed to network latency that offsets computational gains.
To further understand how different components of the framework perform, we measured their run time obtained for changing problem sizes.Table 4 shows that all modules scale as we would expect based on their theoretical complexity.The most time consuming stages are construction of the auxiliary matrix  for Isomap and normalization.This is not surprising taking into account that both components scale as ( 3 ) and the parallel Floyd's algorithm involves  rounds of communication.The abrupt performance decrease in the normalization stage, which can be observed for  = 16384, can be attributed to cache performance.Recall that normalization depends on matrix-matrix multiplication and hence is inherently sensitive to data locality.The final remark Input: Matrix  × and the required numbed of eigenvalues .
In the final test we compared our parallel eigensolver with SLEPc [55], one of the most popular and widely used libraries providing eigensolvers.SLEPc is an efficient and portable framework that offers an intuitive user interface.In many cases it is the first choice for solving large-scale sparse eigenvalue problems.Table 6 shows that our implementation systematically outperforms SLEPc.This can be explained by two main factors: first, unlike SLEPc our implementation follows 2D data decomposition scheme, which offers better scalability, and second, we are seeking only the  largest eigenvalues.represent a promising low-cost, rapidly deployable strategy for harnessing solar energy.While highly cost-effective and flexible, their low power conversion efficiency makes them less competitive on a commercial scale in comparison with conventional inorganic solar cells.A key aspect determining the power conversion efficiency of organic solar cells is the morphological distribution of the two polymer regions in the device.Recent studies reveal that significant improvement in power conversion efficiency is possible through better morphology control of the organic thin film layer during the manufacturing process [29,[61][62][63][64][65][66].High-throughput exploration of the various manufacturing parameters, for example, evaporation rate, blend ratio, substrate patterning frequency, substrate patterning intensity, and solvent, can potentially unravel process-morphology relationships that can help tailor processing pathways to obtain enhanced morphologies.Note that such high-throughput analysis results in datasets that are too large to visually inspect for trends and relationships.A promising approach towards unraveling process-morphology relationships in this high-throughput data is via dimensionality reduction.Here, we showcase our parallel framework on this pressing scientific problem.

Using Dimensionality Reduction to
In particular, we focus on using dimensionality reduction to understand the effects of substrate patterning [67,68], described by patterning frequency and intensity, on morphology evolution.We note that nanotip photolithography patterning of the substrate has shown significant potential to guide morphology evolution [69].
The input dataset consists of  = 75150 morphologies.Each morphology is a 2-dimensional snapshot which is vectorized to have dimensionality  = 8326.Figure 3 shows several representative final morphologies obtained by varying the patterning frequency, , from 0.5 to 1.50, and the intensity of the attraction/repulsion, , from 1 + 1 − 6 to 1 + 8 − 4 (in the remaining we omit leading 1 for clarity).In all these cases, the lower surface of the domain is patterned to attract and/or repel specific classes of polymers, thus affecting the morphology.We performed dimensionality reduction on this dataset using  = 16384 processors on TACC Ranger.The total run time was 1058 seconds.Figure 4 plots the first 10 eigenvalues of the data.Note that the first three eigenvalues, and hence the first three principle components of the data, represent ∼70% of the information content of the entire data.We therefore characterize each morphology in terms of this three-dimensional representation.
Figure 5 represents all the morphologies in such reduced space.In the plot to the left, the points are color coded according to the patterning frequency used, while in the plot to the right, the points are color coded according to the patterning intensity.This plot provides significant visual insight into the effects of patterning frequency and intensity.There exists a central plane of patterning frequency, where the morphology evolution is highly regulated irrespective of the patterning intensity ( ≤ 1).This is particularly valuable information as the patterning frequency is much easier to control than patterning intensity from a manufacturing perspective.For patterning frequencies above  = 1, the morphologies are highly sensitive to slight variations in both frequency and intensity.This is also clearly seen in Figure 6, where slight variations in the intensity give dramatically different final morphologies.Notice also that higher intensity does not necessarily give different morphologies.This is a very important insight that allows us to preclude further, potentially expensive, exploration of the phase space of increasing patterning intensity.
Finally, the low-dimensional plots illustrate the ability to achieve the same morphology using different processing conditions.For instance, in Figure 7, we isolate the morphology evolution under two processing conditions that result in identical morphologies.Such correlations, most sensitive regions versus least sensitive regions (Figure 6) and configurations resulting in identical morphologies, are enormously useful as we tailor processing pathways to achieve designer morphologies.This analysis illustrates the power of parallel dimensionality reduction methods to achieve this goal.We defer a comprehensive physics-based analysis of this dataset to a subsequent publication.

Conclusions
In this work we illustrate a systematic analysis of dimensionality reduction techniques and recast them into a unified view that can be exploited by dimensionality reduction algorithm designers.We subsequently identify the common computational building blocks required to implement a spectral dimensionality reduction method.We use this insight to design and implement a parallel framework for dimensionality reduction that can handle large datasets and scales to thousands of processors.We demonstrate the capability and scalability of this framework on several test datasets.We finally showcase the applicability and potential of the framework towards unraveling complex process-morphology relationships in the manufacturing of plastic solar cells.

Figure 2 :
Figure 2: Relative speedup for different problem sizes .

Figure 5 :
Figure 5: Morphology evolution as captured by the first three principal components of the original data.Different colors represent different patterning frequency  (left) and different patterning intensity  (right).

Table 3 :
Run time in seconds for different  and varying problem sizes .Due to memory limitations problem instance with  = 32768 could not be solved on less than  = 256 processors.
(16))Replicate entire vector  and  on each processor.

Table 4 :
Component-wise run time in seconds for varying problem sizes, and  = 1024 and  = 10000.

Table 5 :
Run time in seconds of KNN component for  = 1024 and different number of processors and varying .

Table 6 :
Comparison of our eigensolver with SLEPc.For  = 1024SLEPc failed to execute.