Model-Based Hardware-Software Codesign of ECT Digital Processing Unit

. Image reconstruction algorithm and its controller constitute the main modules of the electrical capacitance tomography (ECT) system; in order to achieve the trade-o ﬀ between the attainable performance and the ﬂ exibility of the image reconstruction and control design of the ECT system, hardware-software codesign of a digital processing unit (DPU) targeting FPGA system-on-chip (SoC) is presented. Design and implementation of software and hardware components of the ECT-DPU and their integration and veri ﬁ cation based on the model-based design (MBD) paradigm are proposed. The inner-product of large vectors constitutes the core of the majority of these ECT image reconstruction algorithms. Full parallel implementation of large vector multiplication on FPGA consumes a huge number of resources and incurs long combinational path delay. The proposed MBD of the ECT-DPU tackles this problem by crafting a parametric segmented parallel inner-product architecture so as to work as the shared hardware core unit for the parallel matrix multiplication in the image reconstruction and control of the ECT system. This allowed the parameterized core unit to be con ﬁ gured at system-level to tackle large matrices with the segment length working as a design degree of freedom. It allows the trade-o ﬀ between performance and resource usage and determines the level of computation parallelism. Using MBD with the proposed segmented architecture, the system design can be ﬂ exibly tailored to the designer speci ﬁ cations to ful ﬁ ll the required performance while meeting the resources constraint. In the linear-back projection image reconstruction algorithm, the segmentation scheme has exhibited high resource saving of 43% and 71% for a small degradation in a frame rate of 3% and 14%, respectively.


Introduction
Electrical capacitance tomography (ECT) is an industrial process tomography technique for imaging materials distributions inside a certain interest area [1,2]. Visualizing the multiphases flow such as the gas/oil in oil pipes is one of the most significant applications of the ECT [3]. The ECT system consists of three main components, capacitance sensors, data acquisition unit, and ECT digital processing unit (ECT-DPU) as shown in Figure 1 [4]. Measured capacitance data are sent wirelessly to a base station attached to ECT-DPU where an image reconstruction algorithm is imple-mented to produce an image describing the material distribution inside the imaging area [5,6].
An ECT image reconstruction algorithm is realized as a software on a general-purpose processor [7], but in stringent time constraints, a dedicated hardware can be used [6] to achieve the real-time operation. The most important driving factors in embedded system design are the performance and flexibility. While high flexibility and low design effort can be gained by applying the software implementation, its performance gain is low. In contrast, the hardware intrinsic parallelism realizes excessive system performance, but its design complexity overhead is high. Hardware-software (HW/SW) codesign of the ECT digital processing unit proposed in this paper allows the trade-off between attainable performance and flexibility.
Recently, the FPGA SoC becomes an appropriate embedded system hardware-software implementation platform; and it turns to be the proper candidate platform for ECT-DPU realization [8,9]. Traditional design of the embedded SoC carries out hardware and software components in two different branches in the design flow and uses different tool-sets [10,11]. The hardware part is modeled and simulated based on a handwritten HDL code [12], and the software is modeled and cross-compiled using a different set of tools. This traditional design and implementation of ECT-DPU software and hardware components, their integration and verification require a great effort and are error-prone. These issues can be managed using model-based design approach.
Model-based design (MBD) is a model-centric approach widely used in the embedded system design [13,14]. It enables the usage of an executable system model throughout the whole design cycle spanning from the system-level down to the implementation. The MBD is a system-level approach that applies refinements and transformation on the abstract system model used for algorithm design and simulation in system-level to HW/SW partitioning, automatic code generation for both SW processor as well as HW implementation, and test and verification, in a single integrated platform. Refinements and transformation processes are achieved by applying the designated tool in the MBD tool-chain [15].
The model-based design has been used extensively in the implementation of software defined radio (SDR) systems on FPGA [16][17][18], in embedded control hardware/software codesign and realization on FPGA [19], and in image processing algorithms design and implementation on FPGA [20].
The image reconstruction is the main constituent module of the ECT digital processing unit. The inner-product constitutes the kernel operation of the matrix multiplication of numerous image and signal processing algorithms [21,22] and cryptography [23]. It is the core operation of the matrix-vector multiplication (MVM) used in linear-back projection (LBP) and the Landweber image reconstruction algorithms of the electrical capacitance tomography system [24][25][26].
Matrix-vector multiplication is a core macrooperation in most of the ECT image reconstruction algorithms. In this research, iterative linear-back projection (iLBP) [27] is used as the image reconstruction algorithm. Mathematically, matrix-vector multiplication (MVM) constitutes the pivotal computation structure in the iLBP image reconstruction algorithm, while the inner-product constitutes the kernel operation of the MVM. The inner-product as well as the matrix-vector multiplication possesses inherent parallelism that makes it to be executed in parallel over generalpurpose graphics processing units [28] and multicore processors [29]. On the other hand, the FPGA intrinsic parallel structure makes it a promising viable platform for hardware implementation of the inner-product and the matrix-vector multiplication.
The FPGA realization of the matrix-vector multiplication algorithm has been tackled by many research work at the algorithmic level as well as at the bit-manipulation level [30][31][32]. Most of the FPGA implementation of these proposed parallel structure of the matrix multiplication at an algorithmic level are for small matrix dimensions. Full FPGA parallel implementation of large matrix-vector multiplication consumes a huge number of FPGA resources and incurs long combinational path latency. Careful setting of the degree of parallelism as well as the design of parallel structure is essential to meet the stringent embedded system performance and complies with the available FPGA resources.
This paper proposes a model-based hardware-software codesign flow of the digital processing unit for realization of the image reconstruction and control module of the ECT system on the FPGA SoC platform. Model-based design is proposed to fully automate and tune the design, and implementation of software and hardware components of the ECT-DPU and their integration and verification. Another contribution of this paper is that it presents a parametric segmented parallel inner-product architecture to work as a shared hardware core unit for parallel matrix multiplication in the image reconstruction, control of the ECT system, and similar matrix-vector multiplication-based embedded system algorithms. This segmentation approach allows the   Modelling and Simulation in Engineering designer to use the MBD to tune the design process to fulfill the required performance while meeting the FPGA resources constraint. In each design cycle, the ECT-DPU is simulated, tested, and verified, and the code is generated for both FPGA fabric and the attached ARM processor in the FPGA SoC platform. System design using MBD with the proposed segmented architecture allows the system to be flexibly tailored to fulfill the required performance while meeting the resource constraint. The proposed segmented architecture modeling equations can be used to rapidly generate an estimate of the execution time and required resources at the system-level. Using MBD can greatly minimize the development time and reduces the design cycle as well as alleviates remodeling the system in each design cycle. Our proposed solution of the image reconstruction and control module of the ECT system is different than that introduced in the previous work in [25,33]. The image reconstruction, FPGA module in [25], is totally a hardware system built around the matrix decomposition at bit level, while our SW/HW system's hardware module of image reconstruction is built around the proposed shared parallel segmented inner-product architecture. In addition, our parameterized MVM core unit is adjustable at the systemlevel to tackle large matrices, with the segment length set by the designer to fulfill the required performance while meeting the FPGA resource constraint.
The rest of this paper is organized as follows: "Section 2" explains the details of the ECT-DPU model, while "Section 3" introduces the formulation of the matrix-vector multiplication problem. The modeling and the implementation of the proposed system are presented in "Section 4". Finally, experiments will be carried out to validate the proposed method.

Digital Processing Unit (ECT-DPU)
2.1. Image Reconstruction Algorithm. Measuring the capacitance is carried out in sequence by making one electrode as a transmitter and the rest as receivers then consecutively changing second electrode to receiver [26]. Therefore, the number of independent measurements for 8-electrode ECT system is 28 computed from where M is the total collected capacitances and n represents electrodes' number. The linear forward model of the ECT is expressed as where C is the measurements, G the image matrix, N the number of images' pixels which is around 256 pixels for a 16x16 image, and S the sensitivity matrix defined for each element k as follows: where C l i,j is capacitance vector when the imaging region is full by a low permittivity material and C h i,j is capacitance vector when filled by the high permittivity one. As shown in Equation (2), the number of the image pixels is much larger than the measured data; therefore, the problem is ill-posed and any small change in the measurements can cause a big difference in the image. Moreover, the sensitivity matrix is not a square matrix, and the reconstructed image cannot be computed by using S −1 [34]. Hence, the reconstruction algorithms are classified into two types: noniterative and iterative algorithms. Linear-back projection (LBP), Equation (4), is one of the noniterative algorithms which usually creates blurred images, but applies low computations.
While iterative algorithms such as iterative linear-back projection (iLBP), shown in Eq. (5), provide more accurate images, its time complexity is high and linearly proportional with the number of iteration k.
where λ is the relaxation parameter, SG k is the forward problem solution, and k is the iteration number [34]. Typically, these algorithms involve a large number of matrix operations; therefore, implementing it on a parallel processing platform rather than a sequential execution on PC is crucial. For example, the LBP algorithm implemented on a 2.53 GHz-i5 PC with 4 GB RAM generates a 32x32 ðG = 1024 elementsÞ image in more than 1.5 s.
The solution of Equation (5) can be summarized in the following steps and described by a flowchart in Figure 2: 1. An initial image is obtained by the LBP algorithm Equation (4) using sensitivity in Equation (3) 2. The forward problem Equation (2) is solved to calculate a vector of capacitance measurements 3. Differences between the calculated and the actual measurements are multiplied by S T to calculate pixels errors 4. The difference between the previous image and the pixel errors represents the new image (5) The termination is reached when the difference in step 3 reaches a certain acceptable value.

System
Architecture. The ECT-DPU unit is responsible for image reconstruction and control of the ECT system. It consists of the image reconstruction subsystem (IR-unit) and the main DPU controller (DPU-C) as shown in Figure 3. The image reconstruction subsystem consists of the image reconstruction algorithm, the image reconstruction controller, and the associated memory and buffering blocks.
The core of the image reconstruction algorithm module (IR-alg) is a matrix processing realizing the three iLBP algorithm steps in Equation (5). Memory and buffering blocks required to store the input measured the electrode capacitance, the constant sensitivity matrix, and the computed 3 Modelling and Simulation in Engineering image pixels that constitute the memory subsystem in the IRunit. They are designated as the C Buffer, S-ROM, and IM-RAM block in Figure 3, respectively. The image reconstruction controller (IR-C) controls the IR-alg processing and coordinates the flow of data to and from the memory subsystem. It works as an interface between the image reconstruction subsystem and the main DPU controller.
The main DPU controller (DPU-C) is the interface to the external LCD and the wireless base station peripherals connected to the ECT-DPU system. It wirelessly collects the received data of the measured electrode capacitance and sends it to the image reconstruction subsystem. At the end of frame processing, it collects the image-pixel vectors, stores it in the attached SDRAM, and displays it to the LCD.

Partitioning.
The most important driving factors in embedded system design are the performance and flexibility. While high flexibility and low design effort can be gained by applying the software implementation, its performance gain is low. In contrast, the hardware intrinsic parallelism realizes excessive system performance, but its design complexity overhead is high. Typically, embedded system is designed following the system-on-chip approach, where the whole-system components such as the processor, memory, dedicated hardware coprocessor, and the input-output peripherals are integrated in a single chip. The hardware-software (HW/SW) codesign of the embedded SoC allows the trade-off between attainable performance and flexibility. The hardware-software partitioning of an application in terms of software and hardware components is a key step in the embedded SoC HW/SW codesign [35].
Quantitative design metrics of the system building blocks are required to drive the partitioning process. These quantitative values such as latency (execution time), area, and power can be acquired using profiling, simulation, and static analysis of the system. The executable model and automatic code generation in the MBD allows handy verification and profiling data collection in order to assist in HW/SW partitioning decision. Analysis of the image reconstruction system exposes the computation intensiveness of the iLBP algorithm. Its core is a computation intensive process of repeated matrix multiplication and addition in large loop iterations, which makes it a viable candidate for the hardware realization on the FPGA fabric. Thus, the required performance gain can be achieved. The image reconstruction controller (IR-C) controls the flow of data to and from the Block-RAM housing of the sensitivity, capacitance, image matrices, and IR-alg module. Its communication is cost-effective to map it to the FPGA fabric rather than to the HPS software side.
On the other hand, the functionality of the DPU-C block as a control flow intensive state machine makes it a perfect candidate for the software mapping on the ARM processor inside the FPGA ðSoCÞ platform. In addition, software implementation of the DPU-C block allows to exploit the legacy software drivers for these peripherals. The IR-alg has to be fed with the input-data from the sensitivity matrix, S, as well as the capacitance vector, C. Because the sensitivity matrix is very large in the ECT system, a careful systemlevel mapping decision has to be taken into consideration for the FPGA implementation. Since the sensitivity matrix has fixed constant elements, it can be hard-coded as part of the IR-alg module to be modeled as a single MATLAB function block in the MBD approach. In this approach, the sensitivity matrix is left to the synthesis tools to map it to registers scattered inside the FPGA fabric. Connecting such large sensitivity matrix to the synthesized computation elements of the IR-alg module in this way consumes a huge number of FPGA routing resources and suffers from long routing paths that might violate the timing constraints.
Error < set number Reconstructed image

Modelling and Simulation in Engineering
Using modular-based system approach to separate the memory requirement and its internal organization from the processing structure of the IR-alg module itself, the sensitivity matrix, S, is mapped to the FPGA Block-RAM, while the IR-alg algorithm is considered a separate module that can be modeled as a MATLAB function block in the MBD approach. In this case, an entire block of S matrix has to be ready and fed to the IR-alg module in each computation cycle. This model offers deterministic timing requirements and a small number of FPGA routing resources. The sensitivity matrix is modeled using this approach in our ECT-DPU system.
Based on the above reasoning as well as the collected profiling data, the ECT-DPU system is partitioned as illustrated in Figure 3. It shows the mapping of the ECT digital processing system to the hardware and software sides of the Cyclone V SoC FPGA platform.

Fixed-Point Representation and Word Length.
A fixedpoint version of the iLBP algorithm has to be generated for low-cost hardware implementation as well as highperformance gain and energy efficiency. Although the fixed-point word length can be set manually inside the MATLAB code of the IR-alg module, it is more appropriate to be generated automatically from the floating-point model with aid of the fixed-point conversion tool as part of the MBD workflow. Since increasing the word length consumes more hardware resources, the fixed-point conversion tool can be guided by the designer to set the word length in the fixed-point version of the iLBP algorithm to preserve similar precision of its floating-point counterpart.

Matrix-Vector Multiplication
Segmentation Scheme 3.1. Matrix-Vector Multiplication. The matrix-vector multiplication (MVM) constitutes the pivotal computation structure in the LBP and iLBP image reconstruction algorithms in Equations (4) and (5), respectively, while the innerproduct constitutes the kernel operation of the MVM. This section introduces the segmented inner-product architecture to serve as the core unit for resource-sharing-based approach of each matrix-vector multiplication stage of the image reconstruction algorithm. It is required to design and implement an efficient FPGA hardware architecture for the large matrix-vector multiplication to meet the real-time performance requirements without violating the hardware resource constraints. The proposed solution is to build each matrixvector multiplication stage around a shared segmented parallel inner-product architecture to achieve the performance/resource-usage trade-off.
In this section, for the generic MVM problem, we will follow the generic XY notation. Then, the matrix and vector names, as well as their indices, are replaced with the corresponding notation of each stage in the iterative linear-back projection (iLBP) algorithm in Equation (5).
Let Y and X be one-dimensional fixed-point data vectors with sizes of M × 1 and N × 1, respectively. The MVM, Y = AX, is represented as where Thus, the system design runs in an operating frequency with period, tck, equals to tm in case of multicycle path realization and equals to dcp in case of single-cycle path realization. The multicycle combinational path realization can also be set by the FPGA SDC timing tools (c.f. Altera TimeQuest [10] Analysis tool). Data elements of the inner-product vectors have to be ready to the computation hardware via reading from their memory locations such as FPGA Block-RAM.

Parallel
Inner-Product. The fully serial realization of MVM can be implemented in hardware using a single multiply-accumulate unit with a controller to generate the row and column indices in a nested-loop similar to software implementation, as has been introduced in [12]. Although this serial implementation requires a single multiplier and single adder, it suffers from long computation time of MN computation cycles. On the other hand, fully parallel implementation of MVM can achieve high performance and be realized in a single computation cycle [36], but on the cost of a huge number of FPGA resources that requires MN multiplier and MðN − 1Þ adders. A performance/resource-usage trade-off is a viable approach to meet the embedded system time constraint and/or achieving high performance while still within the available FPGA resources. Building the matrixvector multiplication around a shared parallel innerproduct architecture can fulfill this trade-off.
Exploiting the intrinsic parallelism among multiplication operations of the inner-product procedure to build a parallel inner-product architecture will increase the performance in the cost of increasing the required resources. It consists of N multiplier and ðN − 1Þ adders in order to achieve the inner-product between two vectors of length N in a single computation cycle. Combinational path delay can be shortened by performing the multiplication operation in parallel for all pairs of elements of the inner-product input vectors; then, the multiplication results are summed with a set of adders to produce the final inner-product result, as shown in Figure 4.
The parallel inner-product architecture with multicycle path realization of its combinational path requires inputdata to be read from their memory locations in the FPGA Block-RAM with a read-time, RT, as in Equation (9), while its single-cycle realization incurs input-data read-time as in Equation (10). This contrasts the reduction of the inputdata read-time with a factor of Kc = ddcp/tme in case of the multicycle path as opposed to the single-cycle path realization. On the other hand, the computation time is almost similar that is ddcp/tme:tm, and dcp, in case of the multi-and single-cycle path realization, respectively.
3.4. Segmented Inner-Product Architecture. Combinational path delay tends to be long for large vectors, which leads to long computation cycle delay and a large number of resources. The resource usage, as well as combinational path delay, can be greatly shortened through segmenting the inner-product input vectors to multiple segments of length SL, and the complete-vector inner-product is completed in Ns = dN/SLe computation cycles for length N vectors, where Ns denotes the number of segments. In each computation cycle, the resultant segmented inner-product is added to the previously buffered partial inner-product, as it is expressed by Equation (13). Following the notation in Equation (6), where A i and X are 1 × N and N × 1 vectors, respectively, the inner-product can be written as Then, the partial inner-product of a single segment, s of length SL, is written as Thus, the inner-product can be written in the segmented form as where the partial inner-product of each segment is computed in a single computation cycle. In each computation cycle, it is required to feed the segmented inner-product unit with only a segment of length S L from both input vectors, instead of the whole vectors, which greatly shortens the combinational path delay and its required hardware resources. Figure 5 illustrates the segmented inner-product architectures.
It requires parallel SL multiplier and SL adders, and it experiences a combinational path delay, dcps, expressed by Equation (14). Its combinational path delay is realized as multiple clock-cycles via a delay-counter that enables writing the partial inner-product to a memory buffer at the end of the computation cycle, at clock-cycle Kc = ddcps/tme.
where dm and da¯sl are the propagation delay through the multiplier and the set of adders required for the innerproduct of length SL vectors, respectively. In order to reduce the propagation delay, the set of adders is organized as a treelike structure. .

Modelling and Simulation in Engineering
For length N vectors, the segmented inner-product architecture requires the same input-data read-time as the nonsegmented architecture, expressed in Equation (9). On the other hand, it dictates a computation time of Equation (15) reveals that the segmented inner-product architecture incurs some computation time overhead compared to the nonsegmented parallel architecture Equation (8). As will be shown in the experimental results, this overhead is very small compared to the advantages of massive decrease in the number of required resources.
The execution time, ET, for completing the innerproduct calculation is the sum of the input-data read-time, RT, and the computation time, CT. Assuming that each data element requires a read time of single clock-cycle, the execution time of the length N inner-product, represented as the number of clock-cycles, is The segment length is a design degree of freedom that allows the trade-off between the performance and the resources usage and determines the level of computation parallelism as well as the maximum number of input ports. The number of parallel input ports is another design degree of freedom for more performance gain. Increasing the number of parallel input ports using multiple Block-RAMs for feeding segmented inner-product unit with multiple elements of the two input vectors, A i and X, simultaneously, decreases the read-time so as to increase the performance gain.
Let p denote the parallel input ports, p < <N; the execution time of the length N inner-product, represented as the number of clock-cycles, becomes The proposed segmented architecture turns the innerproduct unit to be a parametric module, with the segment length set by the designer to fulfill the required performance while meeting the resources constraint. Table 1 compares hardware architectures requirements of parallel innerproduct computations that serve as the kernel operations in matrix multiplication algorithms.

3.5.
Segment-Based Matrix-Vector Multiplication Architecture. The segmented inner-product architecture serves as the core unit for resource-sharing-based approach of matrix multiplication; and each matrix-vector multiplication stage of the image reconstruction algorithm is built around this shared parallel segmented inner-product architecture.
Using Equation (13) and letting Y s = ½y 1s y 2s ⋯ y Ms T , the generic matrix-vector multiplication, Y = AX, in the segmented form, is written as in Equation (18) and represented in Equation (20) : ð20Þ Table 1: The number of FPGA resources used in different inner-product unit proposals.
Qasim [12] Tiwari [33] Segmented (proposed) In the model-based design at the system-level, the segment length works as an input to the design flow. First, the designer calculates the estimated execution time (using Equation (21)) for small segment length-in order to preserve the HW resources-and checks the achievement of the required performance. Then, the segment length can be increased, and the estimated execution time is recalculated until the performance requirement is met. Inserting the designated segment length in the system-level model, the system design can be flexibly tailored to the designer specifications to fulfill the required performance while meeting the resources constraint. Coupling the allocated segment length with the MBD procedures will greatly minimize the development time and effort and alleviate the designer from remodeling the system in each design cycle.
Based on this segment-based MVM architecture, the iLBP architecture is shown in Figure 6. The first LBP matrix-vector multiplication stage of the iLBP algorithm is organized as to feed each row vector of S T matrix as well as the C vector to the segmented inner-product unit in dM/SL e computation cycles. In this architecture, LBP MVM requires N:dM/SLe computation cycles. Similarly, the second and third iLBP MVM stages require M:dN/SLe and N:dM/S Le computation cycles, respectively.

System Modeling and Implementation
4.1. Hardware Platform. The Altera Cyclone V SoC FPGA platform [8] is used for the ECT-DPU system implementation as a SoC platform. This SoC platform is an evaluation board. It includes the Cyclone V 5CSXFC6D6F31C6 FPGA device integrating a multicore ARM processor subsystem into the FPGA fabric [37], in addition to DDR3 memory and common interface controllers. The dual-core ARM Cortex-A9 MPCore processor operating at a 925 MHz subsystem connected to a rich set of connected peripherals constitutes the hard processing system (HPS) side of the Cyclone V SoC device. The communication between the HPS and the FPGA fabric is achieved through the Standard AXI4 bridge. The HPSFPGA AXI bridges allow masters in the FPGA fabric to communicate with slaves in the HPS logic, and vice versa [37].

System-Level Modeling and Simulation.
MathWorks has introduced a complete model-based design platform based on its MATLAB environment [15]. It covers the whole design flow from modeling and simulation using the MATLAB/Simulink down to the deployment on the FPGA SoC platform. MATLAB's MBD workflow depends on two key technologies, the "HDL Coder" toolbox used to generate synthesizable HDL code and the Embedded Coder toolbox for the embedded C code generation, from both MATLAB code and Simulink model.
The ECT-DPU system design procedures follow the MBD approach. It is based on the MATLAB MBD flow using the MATLAB HDL Coder and Embedded Coder toolboxes [15]. For a complete process, these toolboxes are linked to the hardware synthesis tools: Altera suite of development tools, Quartus II, and the software compilation tools: the SoC Embedded Design Suite (SoC-EDS) [11].
The ECT-DPU based on the proposed segment-based MVM architecture is modeled using MATLAB Simulink and HDL Coder toolboxes. Its functional behavior is verified at the system-level. The equivalent VHDL code is generated, synthesis and timing analysis are performed via integration with Altera Quartus II, and the design is deployed to the Altera Cyclone V FPGA device. These FPGA design steps are automated using the HDL Workflow Advisor of the HDL Coder.
The image reconstruction subsystem of the ECT-DPU system is coded with HDL-synthesizable MATLAB code, while the main DPU controller is modeled with a Ccompatible MATLAB code. Both are modeled as a MATLAB function block inside a Simulink model. In MBD, the systemlevel simulation of the cycle-accurate model allows functional verification as well as cycle-based performance measurement. The MATLAB Simulation Data Inspector tool is used in this purpose for the ECT-DPU to inspect and verify the behavior of the segmented inner-product core used inside each of the three stages of the iLBP IR-alg, as well as the handshaking signals between the DPU-C and IR-C controllers' modules. Figure 7 illustrates the timing behavior at cycle-level of the segmented inner-product core in a single stage of MVM for 5x32-element matrix and 32-element vector example, with the SL parameter set to 8. In addition, HPSto-FPGA, h2f, and FPGA-to-HPS, f2h, handshaking signals are shown in Figures 8(a) and 8(b).

Code Generation.
Using the HDL Workflow Advisor tool of the HDL Coder toolbox, the HDL-code as an IP-core for the image reconstruction subsystem (IR-alg and IR-C modules) is generated. In the same process, the IP interface logic as well as its abstract software interface model to the ARM  Figure 7: Timing behavior of segmented inner-product. 8 Modelling and Simulation in Engineering processor is automatically generated according to the AXI4 interface standard.
On the other hand, the generation of the DPU-C corresponding C code for the ARM processor is handled by the Embedded Coder toolbox so as to be connected to the generated AXI4 interface model. The compiled and bit-stream configuration files of the generated C and HDL code parts, respectively, of the system can be deployed to the FPGA platform directly from within the HDL Workflow Advisor tool. The generated C code for the DPU-C is ready to be integrated with the rest of the software components of the complete ECT-DPU system. Using the Qsys tools of the Quartus II CAD system, the generated HDL IP-core for the image reconstruction subsystem can be reused within other related ECT-DPU system. In each design cycle, the segment length parameter of the segment-based MVM architecture, of the three stages of the iLBP image reconstruction algorithm, is set with one of the test values as an input to the design flow at the systemlevel. In each design cycle, the ECT-DPU is simulated, tested, and verified, and the code is generated for both the FPGA fabric and the HPS ARM processor.
Using MBD can greatly minimize the development time and reduces the design cycle as well as alleviates the remod-eling of the system in each design cycle. The effect of changing the segment length parameter of the segment-based MVM architecture on the required hardware resources and the execution time overhead illustrated with 64-element vectors, as an example is shown in Figure 9. The segment length of its segmented inner-product shared core unit is decreased from the full-length vector of 64 elements to 4 elements. Figure 9 shows the required hardware resources exemplified by the number of multipliers, alongside with the overhead in the execution time as a percentage of the nonsegmented version. The analytical model in Equation (16) is used to generate the calculated data as the first step in the highest-level of MBD scheme. On the other hand, down to the code generation and implementation, the synthesized FPGA hardware resources is recorded in Table 2 for these segment length test values. The Altera TimeQuest [10] analysis tool on the CVSoC FPGA device with an operating frequency of 100 MHz is used to obtain the propagation delay of the combinational path components. The propagation delay of the synthesized segmented inner-product is compared against the calculated analytical data as illustrated in Figure 10 for a 64-element vectors. It shows the close matching of the analytical model in Equation (16) to the synthesized results.
Both of the calculated data and the synthesized FPGA hardware resources illustrate the linear effect of reducing the segment length on minimizing FPGA hardware resources and its impact on the trade-off between the performance and resource usage. This illustrates how the MBD approach can help the designer of the ECT-DPU to experiment with input parameters to the design flow starting at the system-level model down to the FPGA implementation without the need of remodeling the system in each design cycle.
The segmentation effect on the LBP image reconstruction algorithm, Equation (4), is illustrated in Figure 11. The illustrated frame rates are in kilo frames/sec for a 28x256-element sensitivity matrix in the segment-based MVM architecture. It illustrates that the segmentation scheme achieves high resource saving of 43% and 71% (corresponding to segment length of 57% and 29% of the full-length vector) for a small degradation in a frame rate of 3% and 14%, respectively.   Figure 10: Propagation delay of the segmented inner-product. Frames/sec (k) Figure 11: LBP frame rate.
Pipelined architecture of the three-stage iLBP image reconstruction algorithm, Equation (5), achieves the same frame throughput.
The proposed algorithm is validated by applying synthetic data which collected from an ECT system shown in Figure 1 with N = 8 electrodes. These electrodes are uniformly distributed around the 2-D plane of the vessel to be imaged. The voltage is applied on one electrode at a time to work as transmitter while the rest electrodes are receivers. By measuring the produced charges on these electrodes, M = NðN − 1Þ/2 = 28 independent mutual capacitance values were collected. Typically, the quality of the generated images can be enhanced by increasing the number sensing electrodes around the imaging area. However, this radically increased the complexity of the measuring circuit as well as the cost of the hardware design of the system.
The results shown here represent two different permittivity variations. The mutual capacitances between the electrodes corresponding to each permittivity distribution were estimated using the FEM model [4]. The FEM mesh consists of 720 linear triangular elements. An image area of size 16x16 pixel is selected in the center of the FEM model to reduce the complexity of the computation. Therefore, the sensitivity matrix S of size 28x256 and a measurement electrode capacitance C vector of size 28 elements were applied during the experiments. The permittivity value of the inhomogeneous material is 1.8, while the permittivity value of the area when it is empty is 1.0. The sensitivity matrix is also calculated based on Equation (3) from the FEM model. The relaxation parameter λ = 10 −3 was adjusted and selected to give better reconstructed images. The MBD approach for the ECT-DPU is tested and verified for both LBP and iLBP image reconstruction algorithms implemented on the FPGA. The capacitance measurements and the sensitivity values are stored on the SDRAM. LBP and iLBP image reconstruction algorithms are applied to detect the distribution of the materials inside the imaging area. Two objects separated by a distance are used to test both algorithms as shown in Figure 12. The LBP reconstructed image on FPGA is shown in Figure 12 Figure 12 proofs the ability of the FPGA implementation of the ECT-DPU in detecting multiple objects from the capacitance measurements of the ECT system. Figures 12(c) and 12(d) verify that the iLBP algorithm is accurately able to detect the size and the location of that object more than the LBP algorithm. In addition, this illustrates the trade-off between the number of iLBP iterations and the reconstructed image quality resulted from the ECT-DPU FPGA implementation.
Errors between the real and the reconstructed images were calculated as Equation (22) [26].
whereĜ is the reconstructed image and G is the real image distribution. The quality of the images increases as the relative error decreases. The error during 300 iterations for the objects shown in Figure 12 is illustrated in Figure 13. The error decreases as the number of the iterations increases.

Conclusion
In this paper, a segment-based matrix-vector multiplication architecture is proposed to work as the core unit of the ECT digital processing unit. In addition, the hardwaresoftware (HW/SW) codesign of the ECT digital processing unit is proposed. The design and implementation of the ECT-DPU follow the MBD approach. It is based on the MATLAB MBD flow using the MATLAB HDL Coder and Embedded Coder toolboxes, linked to the Altera suite of development tools, Quartus II and SoC-EDS.
In each design cycle, the segment length parameter of the segment-based MVM architecture, of the three stages of the iLBP image reconstruction algorithm, is set with one of the test values as an input to the design flow at the systemlevel. In each design cycle, the ECT-DPU is simulated, tested, and verified, and the code is generated for both FPGA fabric as well as the HPS ARM processor. The architecture was evaluated and deployed to the Altera Cyclone SoC FPGA platform. These FPGA design steps are automated using the HDL Workflow Advisor of the HDL Coder.
Design and implementation of the ECT-DPU via MBD has greatly minimized the development time and reduces the design cycle as well as alleviates remodeling the system in each design cycle. Both of the calculated data and the synthesized FPGA hardware resources illustrate the linear effect of reducing the segment length on minimizing FPGA hardware resources and its impact on the trade-off between the performance and resource usage. In LBP algorithm, segmentation scheme has illustrated high resource saving of 43% and 71% for a small degradation in frame rate of 3% and 14%, respectively.

Data Availability
Data is available per request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.

12
Modelling and Simulation in Engineering