Cloud Computing for Rigorous Coupled-Wave Analysis

Design and analysis of complex nanophotonic and nanoelectronic structures require significant computing resources. Cloud computing infrastructure allows distributed parallel applications to achieve greater scalability and fault tolerance. The problems of effective use of high-performance computing systems for modeling and simulation of subwavelength diffraction gratings are considered. Rigorous coupled-wave analysis (RCWA) is adapted to cloud computing environment. In order to accomplish this, data flow of the RCWA is analyzed and CPU-intensive operations are converted to data-intensive operations. The generated data sets are structured in accordance with the requirements of MapReduce technology.


Introduction
In recent years there has been a steady increase in the research and development activity in the nanoscale fabrication area.Optical methods are widely used to characterize nanoelectronic and optimize nanophotonic structures [1][2][3][4].When the characteristic size of the nanostructure is comparable to the wavelength of the used light, the conventional scalar diffraction methods become inadequate.Therefore, rigorous Maxwell's electromagnetic theory should be adopted to analyze such structures.With the advances of computer technology, many numerical techniques have been developed to rigorously solve the diffraction problem.
Rigorous-coupled wave analysis (RCWA) [5][6][7][8][9][10] has been frequently used to solve the problem of electromagnetic wave diffraction by periodic structures.Solving optimization and scatterometry problems may require millions of diffraction pattern evaluations for each possible combination of geometric structure and incident light parameters.Computational cost estimate shows that these problems would clearly call for supercomputing capacity.
Thus, current computational tasks of nanostructure analysis generate ever-higher demands on the methods used for parallel computing and data storage [11,12].The developed applications should work efficiently on multicore and multiprocessor systems.An implementation of MPI (message passing interface) is the traditional means of creating such applications [13].MPI provides a highly flexible ability to develop applications that address specific requirements of the algorithm structure and the used computing infrastructure.However, this ability makes it necessary for developers to implement some of the low-level services.
Compared with MPI, MapReduce programming paradigm for cloud computing [14] provides higher level of abstraction for the developer of parallel applications.On the one hand, MapReduce imposes certain restrictions on data format and the flexibility of the algorithm used.On the other hand, MapReduce offers a simple programming model, automatic parallelization and distribution, fault tolerance, I/O scheduling, and monitoring tools.
In the following we consider how RCWA can be implemented to utilize MapReduce technology.The RCWA relies heavily on the computation of eigenvalues of an intermediate matrix and the solution of a corresponding linear system.To reduce the compute time and enhance fault tolerance, we build a distributed, dynamically growing look-up table containing the precomputed eigenvalues and eigenvectors for the set of lamellar layers that approximate the analyzed nanostructure.The look-up table resides in a specialized distributed file system.
The paper is organized as follows.Section 2 describes the main features of MapReduce framework implementation for cloud computing.Section 3 outlines the algorithm and data flow of RCWA.Section 4 reports the adaptation of RCWA for MapReduce technology.Section 5 describes some performance tests for the suggested algorithm on the Hadoop cluster [14,15].Conclusions are given in the final section.

Main Features of MapReduce Technology
MapReduce is based on the old approach of splitting a task into subtasks, executing them in parallel, and merging the intermediate results to obtain the final one [16][17][18].However, the MapReduce implementations of this scheme provide effective means for controlling resource utilization in large-scale distributed systems.
The processing of large data sets (more than hundreds of gigabytes) is a challenge for the traditional high performance computing (HPC) attitude.In comparison to CPU-intensive tasks, the data-intensive tasks require demanding rate of access to the used storage system.The bottleneck of HPC for these problems is the shared file system network bandwidth.Hence, a large number of computing nodes would remain idle and wait for the data.Because of this, the load balancing and data locality become important to efficiently handle.
MapReduce framework provides load balancing of computing nodes by organizing the "master-workers" architecture.The master node is responsible for the assignment of computing tasks to worker nodes every time a worker node becomes idle.
For a large data set it is crucial to ensure the closeness of data and computation to reduce data transfer between nodes.Thus, the input data should already be stored locally on the corresponding computing nodes.MapReduce achieves this by using a specialized distributed file system.This file system divides each file into blocks with size of about several tenths of megabytes and stores several copies of each block on different nodes.This data replication in the most cases allows the master node to assign computing tasks to a worker that stores corresponding input data.
The next problem of traditional HPC is the difficulty in handling partial failure of computing nodes.In large clusters of thousands of computing nodes it is necessary to permanently deal with crashes.In MapReduce, the master node pings every worker periodically to recognize possible failures of individual nodes.Thus, master detects a failed computing task and reschedules it to another worker which owns a replica of the input data block.
The two-stage processing scheme of MapReduce is depicted in Figure 1.The first and second stages correspond, respectively, to Map() and Reduce() functions.Both functions need to be implemented by the developer of distributed application.Key/value pairs form the required processing data structure.Every input key/value pair is processed by the Map() to produce a set of intermediate key/value pairs.The Reduce() takes all values associated with the same intermediate key as input and generates a set of final key/value pairs.Thus, it is sufficient to implement the two functions to obtain a usable-distributed application.
To reduce the network traffic between nodes MapReduce supports the use of auxiliary Combine() function which is similar to Reduce().Combine() processes the output of the Map() on each node of the cluster separately to prepare the intermediate results to subsequent shuffling over a network to the Reduce() nodes.
The shuffling of the intermediate results is based on dividing up the set of intermediate keys and assigning intermediate key/value pairs to Reduce() nodes.The goal is to assign approximately the same number of keys to each Reduce() node.

Overview of RCWA
The general grating diffraction problem is illustrated in Figure 2. It is convenient to separate the space into three regions: two homogeneous semi-infinite regions above and below the grating, and an inhomogeneous region which includes the relative permittivity periodic modulation of the analyzed nanostructure.A linearly polarized plane electromagnetic wave of wavelength λ is obliquely incident at an arbitrary polar angle θ and at an azimuthal angle φ upon a two-dimensional, possibly multilevel or surface relief, dielectric or metal grating.The normalized electric-field vector that corresponds to this plane wave is a solution of Maxwell's equations in an infinite homogeneous region [19].The variable Ψ is the angle between the polarization vector E and the plane of incidence.RCWA was formulated for both principal plane and conical incidence [6].The problem of the slow convergence of the solution when a conducting grating is illuminated with a transverse magnetic (TM) polarized wave has been subsequently resolved [7].
RCWA involves several steps.As shown in Figure 3(a), the grating is sliced into a set of two-dimensional layers and each layer is approximated by a rectangular slab.The periodic permittivity distribution in each slab of the grating region is expanded into the Fourier series.
The field in the grating region can be also expressed as Fourier expansion through application of the Floquet condition.Substituting these expansions into the Maxwell equations, we can derive a set of first-order coupled wave equations.Its homogeneous solution can be obtained by solving the eigenvalue and eigenvector of the matrix that corresponds to this set of wave equations.The size of the matrix is equal to (2N + 1) 4 for conical incidence, where N is the number of positive-observed diffraction orders for each of the coordinates X and Y .The electromagnetic field above and below the grating region can be expressed using linear superposition of plane waves in the direction of various orders of grating diffraction.In the final step, this representation is used as the boundary conditions to determine the specific solution of the diffraction efficiency.
Accuracy of RCWA is related to the number of diffraction orders that are included in the superposition.In general, the greater number of observed diffraction orders the higher accuracy.However, it is necessary to control the systematic computational error when a large number of observed diffraction orders is selected.

Mapping RCWA to the MapReduce Scheme
We consider the technique of using the RCWA for optical simulation or optimization of the nanostructures.As an example of such structure, we take a two-dimensional diffraction grating.The grating period consists of a truncated cone, and its central cross section is shown in Figure 3(a).The gray rectangle in Figure 3(a) shows the set of the possible (evaluated) geometric parameters.The slope of the truncated cone lies inside this rectangle.The parameter W determines the number of vertically homogenous layers that approximate the slope in RCWA.We have the number of these layers fixed and equal to L.
As mentioned in the preceding section, eigenvalues and eigenvectors of the corresponding matrix are calculated for each vertically homogeneous layer.The calculated eigenvalues and eigenvectors do not depend on the thickness of a homogeneous layer.Without saving the intermediate results, this time-consuming task would be repeated many times.The sawtooth step function in Figure 3(b) shows the dependence of number N L of eigenvalues and eigenvectors calculations without storing the intermediate information on N W sets of radius variation.The parameter L in the Figure 3(b) is assumed to be equal to 7 and parameter W is equal to 8. Note that the number of eigenvalues and eigenvectors calculations is equal to W when the intermediate information is stored.
Such storage of the intermediate information converts the CPU-intensive problem into a data-intensive one.We estimate the amount of memory needed to store the calculated eigenproblem data.
The structure and number of input parameters for the task of optical modeling of nanostructures depends in particular on the simulation purposes and the geometry of the structure.Thus, we can distinguish two main groups of input parameters.The first group determines the illumination conditions of the simulation.The second group consists of the geometric parameters of the nanostructure and its variations.
We estimate the possible number of input parameters for the task of modeling the nanostructure as shown in Figure 3(a).We start with the group of geometrical parameters.Let the number of possible values of the truncated cone radius be 100 (W = 100).Suppose that a variation of the structure period is also allowed.The number of possible values of the structure period is equal to 10.Let us now consider the group of the illumination conditions parameters.Suppose that in our case this group contains two parameters.First, it is the meridional angle of incidence.Second, it is the azimuth angle of incidence.For each of these two parameters, we set the number of possible variations to 30.Thus, the number of eigenproblems to solve is equal to the product of the number of variations of the aforesaid four parameters, that is, ∼10 6 .
Having computed the parameters number, we now evaluate the size of the stored intermediate data.This value depends on the number of observed diffraction orders.Next we assume that the calculations are done with long-double precision (16 bytes).
Figure 4(a) shows the dependence of the size of the eigenvectors matrix on the number of observed diffraction orders.Figure 4(b) shows the total amount of the stored intermediate data for the different number of the observed diffraction orders and the parameters numbers.The figure shows that the amount of data may exceed 1 TB.Thus, our task can be classified as a data-intensive problem.To reduce the size of the stored intermediate data some compression procedure can be used.However, this in turn implies that there is a corresponding computational overhead.
We will now consider how RCWA can be mapped into MapReduce scheme.The structure of stored data for Map-Reduce technology is defined by a pair of "key/value".Here the "key" K 1 consists of two sets.First, it is the set of geometric parameters that define one of the layers which is homogeneous in the vertical direction.Second, it is the set of parameters that defines the illumination conditions of the simulation.The following relation is an example of such "key": where d x , d y are the periods of two-dimensional structure, m x , m y are the number of material sections on the period, b x , b y are the size of material sections on the period, and n x , n y are the index of refraction of these material sections.
The last three parameters in the "key"-λ, θ, φ-describe the lighting conditions.Notice that the thickness of the layer is absent in the above list.
The "value" V 1 contains the calculated eigenvalues and eigenvectors for one layer: ( The look-up table resides in the distributed file system mentioned in Section 2. By combining with Map-Reduce, the distributed file system allows providing efficient load balancing, data locality, and faulting tolerance to the whole computing system. Figure 5 shows the flowchart of RCWA method in terms of MapReduce technology.The Map() function receives as input the key K 2 that describes the analyzed nanostructure.This key consists of a set of keys K 1 that defines the layers of the structure, as well as the array t that describes the thicknesses of these layers: ( The Map() looks for the intermediate data in the look-up table with the given key K 2 .Thus, the outputs of the Map() function are pairs of K 2 /[V 2 ].Each structure is associated with the list of the eigenproblem solutions [V 2 ] for the array of ↔ K1 layers.This list of the eigenproblem solutions may contain null values, in the case where the necessary information is not yet in the look-up table.
The pairs of K 2 /[V 2 ] are the input parameters of the Reduce().This function calculates the missing intermediate data, stores them in the look-up table, then generates the system of linear equations and solves it.
In this paper, we assume that the output parameters of the Reduce() are the diffraction efficiency of observed diffraction orders.However, the same approach is valid for other output parameters, for example, electromagnetic field distribution in the socalled near field [10].Thus, in our case where R, T are the diffraction efficiency of observed diffraction orders, respectively, for reflection and transmission.

The Results of Computational Experiments on Hadoop Cluster
Computational experiments are performed on a relatively small Hadoop cluster consisting of four nodes.Each node in the cluster contains a CPU Intel Xeon 2.13 GHz and 4 GB RAM.
Namenode-server is run on one node, and jobtrackerserver is run on another node.The remaining two cluster nodes are used as working nodes.Two mappers and two reducers can be run simultaneously on each of the working nodes, which correspond to the parameters that are set by default.Other parameters of Hadoop cluster are also set by default.
The experiment measures the parallel read/write average throughput for the HDFS file system.The parallel read average throughput is equal to r = 460 MB/s and the parallel write average throughput is equal to w = 60 MB/s.It is not our goal in this work to enhance r and w values by optimizing the default parameters of Hadoop cluster.Therefore, the above values r and w are assumed to be fixed.
We will now assess the feasibility of the proposed scheme of RCWA implementation.The standard and the proposed algorithms are different only at the eigenvectors calculation step.We compare the time of eigenvectors calculation for the standard RCWA algorithm (without saving the intermediate results) with the time of eigenvectors calculation for the proposed algorithm (with saving the intermediate results).The main share of the RCWA computational time is contributed by the eigenvectors calculation.The eigenvectors are calculated for each layer of the modeled nanostructure.
Java programming language is the most appropriate for the implementation of a flexible application within the Hadoop package.Therefore, the algorithm for finding the eigenvectors of a complex general matrix was implemented with Java.  Figure 6 shows the dependence of the time (in seconds) of eigenvectors calculation on N value for Java implementation and Intel MKL implementation.N value denotes the number of the positive diffraction orders for one coordinate.

Advances in Optical Technologies
Here we assume that N value is the same for each coordinate.
The left Y -axis on the graph shown in the Figure 6, corresponds to the results of Java application.The right Y -axis of this graph shows the same results for Intel MKL implementation.
We define the following parameters: s-upload file size (MB), which is processed by one mapper; f -the "useful" share of the uploaded file, which is used by mapper to find the eigenvectors (normalized value); m-the average number of uses of one matrix from the uploaded file; a-the size of the matrix (MB), which contains the computed eigenvectors for one layer of the modeled nanostructure; e-the time for computing the eigenvectors of the matrix for one layer of the structure (s); L-the total number of layers in structures that are processed by one mapper.
Then the condition for the feasibility of the proposed RCWA algorithm can be written as follows: the total computational time that one mapper spends (Le) should be greater than the time spent on the following steps: (1) the reading from the HDFS file with size of s, (2) the calculation of the unresolved eigenvectors, and (3) the writing of the calculated eigenvectors in HDFS.
The above-described condition can be represented as a linear dependence of the L value on the overall usage ratio of the uploaded file f m: where ( In Figure 7 the cross-hatched area shows where the relation ( 5) is satisfied.The slope of the curve in the Figure 7 (k value) defines the rate at which the overall usage ratio f m should increase when L value increased.
Figure 8 shows the dependence of the curve slope value k on N value for (a) Java implementation and (b) Intel MKL implementation.
Note that the curve slope value k is very small on the Figure 8(a).Thus, to stay within the feasibility requirements, there is no much importance struggling for increasing the overall usage ratio f m when using the relatively slow Java implementation.Such need appears for the more effective Intel MKL implementation (Figure 8(b)).In this case, the different approaches of metadata usage can be considered.

Conclusions
The paper discusses the computational problems arising in modeling and optimization of complex nanophotonic structures by Fourier modes method (RCWA).The cloud computing infrastructure usage is suggested to solve these problems.This approach allows to effectively exploit the potential of modern computational tools by improving the scalability of the computational problem and by enhancing the fault tolerance of the computer system.This opens up new possibilities in solving problems of diffraction nanophotonic [20][21][22], magneto-optics [23,24], and plasmonics [25,26].

Figure 3 :
Figure 3: Example of two-dimensional diffraction grating analysis; (a) the grating period with the admissible value ranges of the upper and lower radius of the truncated cone; (b) the number of eigenvalues and eigenvectors calculations for each set of the truncated cone upper and lower radius without storing the intermediate information.

Figure 4 :
Figure 4: The size of the stored intermediate data; (a) the dependence of the size of the eigenvectors matrix on the number of observed diffraction orders; (b) the total amount of the stored intermediate data for the different number of the observed diffraction orders and the parameters numbers.

Figure 6 :Figure 7 :
Figure 6: The dependence of the time (in seconds) of eigenvectors calculation on N value for Java implementation (the left axis) and Intel MKL implementation (the right axis).

Figure 8 :
Figure 8: The dependence of k value on the number of diffraction orders value N for (a) Java implementation and (b) Intel MKL implementation.