Parallel Implementation and Scalability Results of a Local-Scale Air Quality Model: Application to Bamako Urban City

. We present a computational framework for the parallel implementation of a local-scale air quality model described by an advection-di ﬀ usion-reaction partial di ﬀ erential equation, the so-called equation of reactive dispersion. The temporal discretization of the model is carried out using the forward Euler scheme. The spatial discretization is achieved using the ﬁ nite element method. The strategy used for the parallel implementation is based on the distributed-memory approach using the message-passing library MPI. The simulations are focused on two road tra ﬃ c-related air pollutants, namely particulate matters PM2.5 and PM10. The e ﬃ ciency and the scalability of the parallel implementation are illustrated by numerical experiments performed using up to 128 processor cores of a cluster computing system.


Introduction
Air pollution is nowadays an increasingly serious global issue for human health and the environment with significant economic consequences. It is particularly exacerbated in the urban context due to the high population density and strong concentration of pollution sources [1]. Air pollutants refer to any substance in the air that could, at high enough concentrations, harm human health and animals and damage the environment. They may consist of solid particles, liquid droplets, gases, or combinations of these forms. Air pollutants are classified into two main types: primary pollutants and secondary pollutants. The primary pollutants are directly emitted into the atmosphere from identifiable sources, which can be either natural, such as wildfires or volcanic eruptions, or anthropogenic such as industrial activities or traffic emissions. They include carbon monoxide (CO), sulfur dioxide (SO2), nitrogen oxides (NOx), particulate matter (PM), ammonia (NH3), volatile organic compounds (VOCs), and toxic metals. As for secondary pollutants, they are not directly emitted but are formed as a result of chemical interactions between primary pollutants within the atmosphere. Secondary pollutants include ozone (O3), nitrogen dioxide (NO2), sulfur trioxide (SO3), sulfuric acid (H2SO4), nitric acid (HNO3), hydrogen peroxide (H2O2), and particulate matter (PM).
Air pollution is associated with a large spectrum of acute and chronic health effects [2] such as stroke, heart disease, chronic obstructive pulmonary disease, lung cancer, acute respiratory infections, skin irritations, eye and ear inflammations, and nose and throat (ENT) disorders. The World Health Organization (WHO) reports that 4.2 million premature deaths worldwide every year are due to exposure to ambient air pollution [3]. The International Agency for Research on Cancer (IARC), the specialized cancer agency of the WHO, has classified outdoor air pollution as carcinogenic to humans [4], Group 1. Faced with the health and environmental issues and the economic cost of air pollution, policies are gradually being implemented for controlling and regulating air quality at the local, regional, and global levels. In more advanced countries, considerable resources are devoted to the measurement and monitoring of ambient air pollutant concentrations at discrete stations of target areas using continually evolving technologies, including air quality sensors. Since these monitors are often expensive and sometimes difficult to access, especially for developing countries, it is therefore challenging to sufficiently network the target area with them in order to better evaluate the air quality index and make decisions to mitigate its effects.
The modeling and simulation of the motion and dispersion of air pollutants are alternatives that offer good results [5]. These are common tools on which the study and forecasting of air quality are traditionally based. Air pollution modeling is one of the most important and challenging scientific problems [6] often used to support decisions in air quality assessment and management. It covers the transport and diffusion of pollutants in the atmosphere, their dry and wet deposition, and chemical reactions. It also depends on pollutant properties, meteorological conditions, emission data, and terrain parameters.
This work focuses on the modeling and simulation of the road traffic-related air pollution in Bamako, the political and economic capital of the landlocked West African country of Mali (see Figure 1). This city represents 0.02% of the national territory and concentrates 12.46% of the Malian population, with a density of 9062 inhabitants per km 2 [7]. It is home to more than 70% of the economic activities of Mali, which makes it the main industrial and commercial crossroads of the country. This strong concentration of economic activities, associated with rapid population growth and increasing urbanization, is a major driving force of air pollution in Bamako. In addition, Bamako is geographically built in a basin surrounded by hills which favor the imprisonment of pollutants and make the city even more vulnerable. Despite significant improvements in engine technology, the traffic flow is the main pollution source in urban areas [1], particularly in Bamako where the road infrastructure is very poor and the road transport fleet is obsolete. A large number of diesel vehicles and the poor quality of fuels cause traffic congestion and the emission of numerous pollutants.
These factors mentioned above mean that Bamako, like most African capitals, is a very polluted city where the annual concentration of PM10, an inhalable particulate matter with a diameter of 10 μm or less, can reach 333 μg/m 3 [8] with daily peaks exceeding 600 μg/m 3 while the air quality guidelines [9] of the WHO recommend a maximum daily mean concentration of 45 μg/m 3 . Another harmful pollutant widely used as an indicator of air pollution levels in cities is PM2.5, a fine particulate matter with a diameter of 2.5 μm or less. The daily averaged concentrations of PM2.5 over October 2020, obtained from the hourly based PM2.5 data collected by a regulatory grade air monitor installed on the U. S. Embassy compound of Bamako, are plotted in Figure 2. It follows from this figure that these concentrations, with a peak reaching 165 μg/m 3 , far exceed 15 μg/m 3 [10], the maximum value recommended by the air quality guidelines of the WHO as well as the U.S. Environmental Protection Agency (EPA) daily standard level fixed to 35 μg/m 3 .
In this article, we present a parallel implementation framework and scalability results of a local-scale deterministic and Eulerian-based air quality model in Bamako. We will focus on the parallel implementation aspects and scalability behavior of the computational model used for producing the numerical results presented in [12]. In Section 2, we describe the mathematical model governing the spatiotem-poral evolution of the concentration of atmospheric pollutants. Section 3 details the temporal and spatial discretization of the model. Section 4 focuses on the parallel implementation and finite element analysis of the model. In Section 5, we present the numerical experiments and computational scalability results of the framework. The main conclusions are summarized in Section 6.

Model Description
Let Ω be a bounded computational domain of ℝ 3 and ∂Ω the boundary of Ω. We consider a decomposition of ∂Ω of the form ∂Ω = ∂Ω G ∪ ∂Ω H ∪ ∂Ω L , where ∂Ω G is the ground surface, ∂Ω H represents the upper limit boundary of the domain Ω and ∂Ω L denotes lateral boundaries. Let c be a vector of concentration fields, where each element c i corresponds to the scalar field of concentration of the chemical species (pollutant) labelled i in the air. The spatiotemporal evolution of the concentration c i in the domain Ω over the time interval ð0, TÞ is described by the following advection-diffusion-reaction [13] model: where (i) c i = c i ðx, tÞ is the concentration of pollutant labelled i, (ii) μðx, tÞ is the diffusion coefficient, (iii) u = uð x, tÞ is the wind velocity (or wind speed), (iv) χ i = χ i ðx, tÞ is the chemical source term that describes the chemical reac- tÞ is the dry deposition velocity, (ix) n = nðxÞ is the unit outward normal vector to the boundary ∂Ω, (x) x = ðx, y, zÞ is the spatial coordinates, where x and y are the variables of the plane surface and z is the altitude, and (xi) t is the physical time. The units in the international system (SI) of these physical fields and parameters are reported in Table 1. The diffusion coefficient parametrization, the dry deposition, scavenging processes, and the chemical kinetics are described in detail in [12]. Model (1) is based on the assumption that there is no feedback between the chemical species and the flow fields, including wind velocity and diffusion coefficient. We assume that the flow is incompressible. This means that the variations in air density are assumed to be constant. We also assume that the urban terrain is homogeneous. The meteorological parameters are strongly influenced by the atmosphere around them. The transport of pollutants takes place mainly in the atmospheric boundary layer (ABL) [14], the lowest part of the troposphere subject to surface effects, since most of the pollution sources are located at the ground level. The ABL is of interest in urban air    02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  3 Journal of Applied Mathematics pollution modeling as it is the atmospheric layer where humans dwell [15]. We will focus on an altitude, which we denote H, included in the ABL layer.
In the rest of this article, we will consider first-order chemical reaction kinetics with a constant coefficient. This means that the balance of the physicochemical transformations in model (1) can be written as χ i = −κ i c i , where κ i is the reaction rate. The characteristic time [16] of the chemical species labelled i, which we denote τ i , is defined according to the reaction rate by the following relation: This characteristic timescale, also called residence time, represents the average time spent by a representative molecule in the atmosphere before it is removed during loss processes such as dry deposition, wet scavenging, and chemical reactions. The characteristic distance (space scale) over which species can be transported is correlated with the residence time. If the emitted species have short residence time (minutes-hours) in the atmosphere, they cannot be transported long-range, and their effects will therefore be important on a local scale. However, other chemical species with longer residence time (hours-days) can have wider impact zones where regional or continental scales will be needed. The characteristic timescale τ i associated with the dry deposition in an atmospheric layer of height H for a species labelled i can be computed from its dry deposition velocity as follows:

Model Discretization
The time integration of model (1) is carried out using the forward Euler method. The choice of this classical approach is dictated by the temporal scale of interest, about 10 minutes, and the physical and chemical properties of air pollutants in which we are interested. Taking into account the assumptions listed in Section 2, model (1) is discretized in time as follows: where c n i = c i ðx, t n Þ and t n = nΔt, n ∈ ℕ * . Δt denotes the time-step of the discretization.
The spatial discretization of the model is achieved using the finite element method [17,18]. It is considered one of the best-suited methods for computing the approximate solution of complex problems in science and engineering expressed in terms of partial differential equations (PDEs) on regions with complex geometrical configurations. This method is based on the principle of energy minimization and consists in approximating, in a finite-dimensional subspace, a problem written in a variational form in an infinite-dimensional space. The approximate solution is finally computed as a collection of discrete values in the form of components of a solution vector of the linear system resulting from the discretization. Readers who are already familiar with the classical finite element method can directly jump to Section 4.
The main steps involved in the finite element analysis, already recapped in [19], are (i) the problem definition, (ii) the discretization (triangulation and definition of approximation space), (iii) the definition of the variational (or weak) formulation, (iv) the system assembling (local (on each mesh element) and global (on the entire mesh)), and (v) the solution of the resulting linear system.
Let T δ be a mesh of Ω, where δ refers to the mesh characteristic length. We denote V δ the piecewise linear finite element space defined on T δ with a vanishing value at the closed boundary ∂Ω L . It is defined as where H 1 0,∂Ω L is the Hilbert-Sobolev space of order 1 satisfying a homogeneous Dirichlet condition on the closed boundary ∂Ω L . C 0 ð ΩÞ is the set of continuous functions defined on Ω, the closure of Ω. The variational form of the problem (1) is written as follows: find c i ∈ V δ such that where c i refers to c n i . Integrating Equation (6) by parts and taking into account the boundary conditions defined in problem (1), the following weak form holds: find c i ∈ V δ such that The Lax-Milgram theorem [17] guarantees the existence and uniqueness of the solution of the weak problem (7), which can also be written as follows: Let B = fϕ j g N δ j=1 be a set of basis functions of V δ . The approximate solution c i ∈ V δ can be expressed as where c i,j , j = 1, ⋯, N δ are nodal values. By using Equations (9) and (10), the following discrete form of the problem holds: where A ∈ ℝ N δ ×N δ and f, c ∈ ℝ N δ . The entries of the matrix A are computed as a jk = Qðϕ k , ϕ j Þ. The components of the right-hand-side vector f are f j = Rðϕ j Þ, and the vector solution c is composed of the unknown coefficients c i,j . The finite element method presented here refers to the so-called Galerkin approximation [20] since the trial space, where the solution c i belongs, is the same as the space of test functions. The linear system (11) resulting from this discretization can be solved using either direct solvers like LU factorization [21] or iterative methods like generalized minimal residual [22]. It is often essential to use preconditioning techniques [22], including domain decomposition methods [23], for solving the linear system (11) since it is in general ill-conditioned.

Parallel Implementation
Libraries for finite element solutions of problems arising from partial differential equations are commonly used by mathematicians and engineers. These packages provide in general robust and powerful tools for the implementation of advanced numerical methods and support highperformance computing features. There are many opensource and popular libraries that offer these capabilities in the literature: Feel++ [24], FreeFem++ [25], FEniCS Project [26], GetFEM [27], and deal.II [28]. For this project, we develop a new robust framework instead of using existing packages because these are mostly very generic and therefore do not sufficiently support some specific features related to this work. It is therefore an opportunity to make this framework evolve in an independent project using flexible approaches that lead to the intended goals.
Tremendous progress in the design of parallel computers over the last decades and the recent advances in multicore processors and hardware accelerator technologies have opened a new era in high-performance computing for the solution of highly demanding complex problems in simulation-based applied science and engineering. We aim to reproduce the dynamical behaviors of some atmospheric pollutants by solving the air quality model (1) in a computational domain covering the entire region of Bamako (with an area of about 267 km 2 ) using a sufficiently fine spatial resolution. Performing such computations in practice within a reasonable time requires the recourse to parallel computing [29,30].
In this context, we develop a C++ framework, called AirQMB, using modern C++17 programming features. The structure of the main core of this model code and the parallelization strategy were inspired by the parallel computational framework introduced in [19]. We consider the CPU-based distributed-memory programming approach using the Message Passing Interface (MPI) [31]. The communications between processor cores are handled by using Boost.MPI, a useful Boost library [32] that abstracts layers over the standard MPI for simplifying the user interface.
For the parallel finite element analysis, the first essential step we are interested in is mesh processing. The computational mesh, based on a map extracted from OpenStreetMap [33], is generated in preprocessing over the region of interest by using an open-source mesh generator called Gmsh [34]. The software package METIS [35] is used for partitioning the generated serial mesh so that the number of mesh partitions corresponds to the number of processor cores that will be deployed for numerical simulations. This package is called from the Gmsh interface. The parallel (partitioned) finite element mesh, already built, is loaded independently and concurrently by all processor cores. Each processor core extracts locally only the partition that number corresponds to its local rank plus the interprocess (ghost) elements so that all elements needed for the finite element basis functions are locally available. Therefore, interprocess communications are not required for the parallel finite element assembly.
A crucial step that follows mesh processing refers to finite element discretization. It consists of constructing the set (or table) of degrees of freedom (DOFs) and is achieved independently on the local mesh within each processor core. A local numbering of DOFs is then introduced independently within each processor using a minimum-bandwidth numbering procedure. This local numbering will be connected with the global numbering associated with the entire problem through relationships called local-to-global maps (LtGMs). The construction of these maps requires global communication involving all the deployed processors. The parallel global assembly from the elementary matrices built locally on processor cores is achieved using LtGMs. The serial and parallel algebraic operations involving matrices and vectors, including direct and iterative solvers for the solution of linear systems, are handled by PETSc [36] wrappers.
The input data required for air quality simulations, especially the wind fields and traffic emissions, are uploaded 5 Journal of Applied Mathematics from files locally in each processor core into the computational framework using the NetCDF [37] library, an interface that supports the creation, access, and sharing of array-oriented scientific data. Once imported, these data are interpolated on the computational mesh using robust and efficient interpolation routines. The simulation environment is initialized by using a useful library called Boost Program Options. This tool makes it easy to parse command-line options for console applications.

Numerical Experiments
The numerical simulations presented here are focused on two primary air pollutants commonly used in the survey of air quality. The first one is the fine particulate matter (PM2.5) with an aerodynamic diameter less than or equal to 2.5 μm and the second refers to the inhalable particulate matter (PM10) with an aerodynamic diameter less than or equal to 10 μm. These pollutants, composed of a mixture of solid and liquid particles suspended in the air, are an essential air quality indicator since they are among the most widely distributed traffic-related air pollutants that affect short-term and long-term health.
We assume that the losses caused by deposit and scavenging are negligible, which means that the coefficient Λ = 0. The maximum altitude is fixed to H = 10m.  Figure 3. The number of mesh elements (tetra-hedra) is approximately equal to 1:7 × 10 6 . The number of mesh nodes is about 31 × 10 3 . The physical time-step Δt is fixed to 60 seconds, and the duration of the simulation is set to 24 hours. The first-order finite element approximation is used for the spatial discretization, and the wind data are obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF). The road traffic emission data files used as inputs of the model were generated in preprocessing from traffic simulations performed on the road network presented in Figure 4. In this network, we are interested in the main and most frequented roads that are subject to massive traffic congestions during daily peak hours. The traffic simulations were performed using an open-source road traffic simulation suite called SUMO [38]. The parameters needed as SUMO inputs, such as traffic flow, traffic density, vehicle types, vehicle engine, and mean velocity, are taken from the national databases provided by the Malian department in charge of transport and infrastructure.
The simulations have been performed on the "Centre de Calcul, Modélisation et Simulation" (CCMS) of the Faculty of Science and Technology (FST) of Bamako. CCMS is a cluster with ten compute nodes (servers) interconnected by an InfiniBand QDR network. These nodes are divided into two groups of two different machine types. In the first group, composed of five Dell PowerEdge servers, each node has two Intel Xeon Silver 4110 processors with 8 cores (2 threads per core) running at 2.10 GHz and 64 GB of RAM. In the second group, composed of five HPE ProLiant servers, each node has two Intel Xeon E5-2623 v4 processors with 4 cores (2 threads per core) cadenced at 2.60 GHz and 16 GB of RAM. The operating system is Ubuntu Server.
In the context of parallel computing, the computational scalability (or scaling) is a powerful tool widely used for expressing the ability of a given parallel algorithm to best  Journal of Applied Mathematics exploit a parallel computing platform. It is related to two common metrics, namely the strong scaling (or speedup) and weak scaling (or efficiency). The speedup is defined as the time spent to run a problem on one processor core divided by the time it takes to run the same problem (fixed size) on p (p > 1) processor cores. In other words, let T 1 be the execution time of a problem on a single processor core and T p be the parallel execution time of the same problem on p (p > 1) processor cores. The speedup with p processor cores is defined as S p = T 1 /T p . The efficiency with p processor cores is computed as E p = S p /p. The best possible efficiency is E p = 1. It is reached when the speedup is linear, i. e., S p = p. A generalization of these metrics for large-scale computing has led to the introduction of the relative speedup and relative efficiency. Let T r be the parallel  7 Journal of Applied Mathematics execution time of a problem on r processor cores and T p be the parallel execution time of the same problem on p (r < p) processor cores. The speedup relative to r processor cores when using p cores is S r,p = T r /T p . The best possible relative speedup is the linear speedup computed as S ℓ r,p = p/r. The efficiency relative to r processor cores when using p cores is defined as E r,p = rS r,p /p.
The strong and weak scaling results for this parallel computational framework when the number of processor cores varies from 8 to 128 are plotted in Figure 5.
Concerning the strong scaling analysis shown in Figure 5 (a), the speedup relative to 8 cores when using 128 processor cores is S 8,128 ≈ 15:02 while the linear relative speedup, i.e., the best possible value, is S ℓ 8,128 = 16. This result corresponds to a performance gain of about 93.88%. According to the weak scaling analysis presented in Figure 5(b), the efficiency relative to 8 processor cores when using 128 processor cores is E 8,128 ≈ 90:20%. As we can see, these weak and strong scaling results clearly show that this framework scales well up to 128 processor cores of a cluster computing architecture.

Conclusion
We have presented a computational framework for the parallel implementation of a local-scale air quality model and its application to the urban part of Bamako city. We briefly described the advection-diffusion-reaction partial differential equation that governs the spatiotemporal evolution of the concentrations of air pollutants. The numerical methods used for the time integration and the spatial discretization of the model have been presented. Particular emphasis has been placed on the parallel implementation based on the distributed-memory approach using Message Passing Interface (MPI) and modern C++17 programming features. The simulations were focused on two widely distributed road traffic-related air pollutants, namely particulate matters PM2.5 and PM10.
The scalability analyses presented here showed good speedup and efficiency up to 128 processor cores. These results are promising and allow us to reproduce state-ofthe-art dynamical behaviors of the atmospheric pollutants specified.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.