Web-Based Data Integration and Interoperability for a Massive Spatial-Temporal Dataset of the Heihe River Basin EScience Framework

,


Introduction
Modern geoscience often requires massive datasets and a huge amount of computation for numerical simulation and data handling, and data needs increase daily [1][2][3][4].In earth and environmental science, data management is becoming more challenging [5][6][7].Volumes of geographic data have been collected with modern acquisition techniques such as global positioning systems (GPS), remote sensing, wireless sensor, and surveys [8][9][10][11][12].The increase in data volume has led to more distributed archiving, and it is consequently more difficult to analyse data at a single location and store it locally [13,14]."Big Data" has become a ubiquitous new term for researchers.It concerns not only the amount of data but also timeliness (velocity), variety, and veracity.The eScience environment makes full use of people, data, and computing resources.Software enables convenient data applications and saves manpower and resources [15].In addition to researchers, governments and private industries also have enormous interest in both collecting spatial data and using massive dataset resources for various application needs, especially Web-based needs [16][17][18][19].The eScience environment supports the combination environment mentioned above and provides online workflow including integration, access, analysis, visualization, and quality control for various data sources and formats [18,20,21].Compared to traditional research methods, eScience applications enable researchers to make full use of scientific research and facilitate international collaboration [22,23].Chen et al. proposed a geoprocessing eScience workflow model to integrate logical and physical processes into a composite process chain for sensor observations [24].A case study on e-Cadastres Metadata must be effectively archived.There is an urgent need for the design of a user-friendly interface to facilitate interoperability, geocomputation, and geovisualisation.Third, unification of the heterogeneity of existing protocols and standards in service-oriented architecture is also needed, such as those applied to establish GIS standards.Finally, experts from different disciplines are familiar with their specific data format and processing software.They encounter problems in converting one format to another or losing information in converting the data formats.
To address these issues, spatial and temporal data from the basin in an eScience context was classified to five categories according to the data source.The data model is a key factor of data sharing.Selecting the data model to integrate spatial-temporal data is very important for the efficient and seamless management in basin eScience.The model determines whether we can efficiently describe the state or evolution of a geographical entity and effectively solve various problems related to time.Considering the compatibility of data, it is hard to achieve a single, completely uniform data model, so the goal may be a few data models.The Heihe River Basin eScience platform selected the Network Common Data Form (NetCDF) (http://www.unidata.ucar.edu/software/netcdf/),Hierarchical Data Format (HDF) (http://www.hdfgroup.org/projects/),and GRib in Binary (GRIB) (http://www.wmo.int/pages/prog/www/) to integrate and interoperate the data according to requirements analysis.The platform integrates multisource and differently formatted data into a few data models.It can solve the contradiction between the high performance computing and the low efficiency of the processing software, creating management standards for massive spatial-temporal datasets and metadata.The new organization mode and collaborative environment can unite disciplines, regions, and time scales and achieve the complete value chain of the data integration, acquisition, transport, storage, processing, application, and service on this eScience platform via Web services [33].The eScience environment framework is convenient for cooperation between experts in various fields and simplifies data postprocessing analysis and data retrieval.It can be accessed openly and freely by everyone.The primary focus is to improve data and application interoperability via data models, interfaces, standards, and Web services.
The emphasis of this paper is on the process and methods of unifying data from the Heihe River Basin in NetCDF and on introducing the modules of online data integration and interoperability.

Study Areas
The Heihe River is the second largest inland river in China.It is 821 km long and originates from Qilianshan Mountain.The Heihe River Basin is a typical large inland river basin that covers an area of approximately 130,000 km 2 in the arid zone of northwestern China.It is located in the middle section of the Hexi Corridor of Gansu Province, which is composed of upper, middle, and lower reaches.Its upper reaches originate in the Qilian Mountains, where the headwater streams form The study area map of Heihe River Basin.
strong drainages, and its lower reaches end in the desert Inner Mongolia.The middle reaches are primarily oases surrounded by the Gobi Desert, and the landscape includes heterogeneously distributed farmland, forest, and residential areas [34].The study area is shown in Figure 1.
The Heihe River Basin, as a typical study area for earth science, has been the object of recent research on weather, climate and remote sensing, ecology, and hydrology, which frequently require analysing hundreds of thousands of variables.The data processing produces large numbers of results, explanations, and other information in various formats and files.In the Heihe River Basin, long-term monitoring, testing, and research are the main sources of data and the important basis of earth system science research.Managing and processing the long-term monitoring data are some of the important tasks for basin research.Therefore, in the eScience context of the basin, issues such as the distribution, heterogeneity, and volume of data need to be addressed during the design and implementation of new data-oriented infrastructures, services, standards, and systems.

The Design Flow and Function of Data Integration and
Interoperability.Spatial-temporal data integration and interoperability platform were constructed based on the Web service of B/S architecture to enhance sharing and interoperability.The design flows of conversion and interoperability for data in the framework are shown in Figure 3.The CDM interface was used to access different scientific data including NetCDF, HDF, and GRIB.In addition to the CDM, we also used two other technologies (XML schemas and objectoriented components) to realize the data integration and interoperability framework.TSD data server, NCML-GML, and OGC WCS/WMS were achieved mainly through the XML schemas.The XML schemas resolve the problems in remote access to data and facilitate the interoperability of GIS and other data via Web services.Object-oriented component technology mainly addressed the issues that different domains develop different data processing algorithms in various computer languages (e.g., C, MATLAB, and Fortran).We needed to provide object-oriented components technology to construct components library via collecting data processing program.In addition, we can access these three data formats through the CDM interfaces via OPeN-DAP or HTTP protocols.NcML-GML and the WCS/WMS achieved the Web service of GIS data encoded in NetCDF.
• Quality control • Data processor • OGC • TDS The framework is convenient for the standard management of large-scale spatial-temporal data and facilitates cooperative research across disciplines.Figure 4 shows the main functions of the spatial-temporal data in the eScience platform.The platform provides the services including NetCDF metadata extraction, NetCDF dataset operation, data format conversion, data visualization, and data access.If the CDM interface cannot achieve special data processing, appropriate components in component libraries were selected according to the requirements of the researchers.If a component does not exist, a new component can be designed and added to the component libraries.The metadata extraction services in NetCDF extract dataset attribute information including department, author, and coordinate system and attribute names.They also can be extracted via NcML.Dataset operation services include the basic operations such as appending spatial-temporal data, renaming, modifying and deleting attributes, variables, and dimensions.Data format conversion services convert the formats of point data, remote sensing data, radar data, and the grid data to NetCDF and convert NetCDF to raster and vector data format through third party software or an online tools library such as GIS software to promote the sharing and interoperability of data.The visualization services of NetCDF offer dynamic visualization of long series spatial-temporal data to achieve convenient comparisons and selection of the data in a study area via WCS/WMS or online tools.NetCDF access services acquire data online through THREDDS Data Server and existing protocols (e.g., OPeNDAP and ADDE).When the users are not interested in all the data, they can extract sections of data for certain variables at certain times or in certain regions from these datasets via the Web.Analysis of NetCDF realises arithmetic operations through the browser on the NetCDF datasets such as computing averages.

NetCDF Data Interoperability with TSD, OGC WCS/WFS, and NcML-GML Technology in EScience Framework.
Web technology provides support for eScience development through innovative technologies and protocols, the message format and algorithms, and creative services such as Wikis, TSD, and WCS [35,36].The eScience framework is a serviceoriented interoperability platform for large spatial-temporal datasets.The key technologies, THREDDS Data Server, OGC WCS/WFS, and NcML-GML, facilitate the interoperability of the scientists in different disciplinary areas, as shown in Figure 5.The THREDDS Data Server (TDS) is the Web server for scientific data and lists the datasets in a THREDDS catalogue, which is simply an XML file offering available datasets and services.Through the TDS, users can obtain the name and location of datasets from different institutions and then access the datasets through OPeNDAP, ADDE, or NetCDF/HTTP protocols [37].TDS can serve any dataset that the NetCDF-Java library can read (e.g., NetCDF-3, NetCDF-4, HDF-4, HDF-5, HDF-EOS, GRIB-1, and GRIB-2).It can also provide data access (subset) services (e.g., OGC WMS and WCS), data collection services (e.g., aggregation), and metadata services (e.g., NcML).Researchers can obtain select parts of these datasets via Web browser (e.g., certain variables at certain times or regions).
An NcML document is an XML document describing the content and structure of the data stored in a NetCDF file and represents a generic NetCDF dataset (http://www.unidata.ucar.edu/software/netcdf/ncml/) [38].In our eScience context, it can be used as a "public interface" for spatial-temporal online data, conforming to the NetCDF data model.NcML describes the metadata of the NetCDF data and does not encode the data.The purpose of NcML is to define and redefine NetCDF file.The NcML has the function as follows: (i) Metadata to be added, deleted, and changed.(ii) Variables to be renamed, added, deleted, and restructured.(iii) Aggregated data from multiple CDM files (e.g., Union, JoinNew, and JoinExisting).
We take average monthly temperature NetCDF data of the Heihe River Basin as an example.The data are in a CFcomplaint NetCDF format, and the visualisation is shown via online tools in Figure 6.The NcML of the data is seen in Appendix.
The aggregation function of the NcML is useful for time series data combinations.Multiple time series NetCDF data can be aggregated into a single, logical dataset with several types of aggregation including Union, JoinExisting, and JoinNew.To facilitate interdisciplinary work between earth sciences and the GIS communities, NcML-GML is developed to use NetCDF datasets in GIS software, providing them with all the necessary metadata in the form of GML (Geography Markup Language) extensions to NcML.GML is written in XML schema for the storage of geographic information with the GIS community semantics.NcML-GML supports referencing information of spatial-temporal data and realizing the function of the platform that describes the coverage data derived from NetCDF data file.NcML-GML and WCS/WMS can map the NetCDF model into the model of GIS and facilitate the interoperability between these two models and different scientists.Through the technology above, users can obtain metadata and the slices of data they require from remote NetCDF files on a Web server accessible directory.

The Key Object-Oriented Component Technologies of Spatial-Temporal Online Data Integration and Interoperation.
To improve the calculation speed and convenient visualization of the data, we selected the mixed solution of MATLAB and Java to complete data integration based on the Web as one example of the object-oriented method to build components.Quality control components are also built by the objectoriented method.Figure 6 shows the technical framework.
The MATLAB and Java mixed solution is to complete the custom framework and interface via Java and Web technology for special data processing and computation through MATLAB with powerful matrix and numerical analysis capabilities.The mixed solution can solve the problems posed by MATLAB's poor interactivity and the fact that MATLAB programs cannot run outside the environment.In addition, the characteristics of Java language such as crossing platforms and exception handling, multithreading, and stable and fast operation could also be utilized.
Figure 6 shows the workflow of mixed solutions technology; first, the MATLAB code completes core algorithms of NetCDF integration and generates the m files.Second, the m files are then transformed into a component which will interact with server-side through the Java language without the MATLAB environment through MATLAB compiler and MATLAB builder JA.Finally, an encapsulation function would be called to achieve the core calculation of MATLAB and online computing on the Web through the Java program with the MATLAB dynamic library.Figure 7 depicts an example of the processes within a scientific workflow.The NetCDF integration process contains an integrated chain invoking first the data processing component and then the integration process.After creating new NetCDF, the add data component continues to increase the variable to NetCDF, extending time dimension or adding other variables.CDM is available as free software to process NetCDF and is actually called several times as part of different scientific workflows.

Online Spatial-Temporal Data Quality Control Methods on
EScience Platform.Spatial data quality has been recognised as an important issue in GIS.However, online spatialtemporal data quality control has received little attention from data processing.Irregularities cause unreliable results because any initial spatial data error can be propagated through the spatial data processing.Based on glaciers, permafrost, deserts, and atmospheric, ecological, environmental, hydrological, and other elements, monitoring systems established in Heihe River Basin realize automatic data transmission and connect with the basin eScience context.Before data integration and analysis, we achieve real-time data detection and calibration to ensure data quality control on the eScience platform.
In this study, we mainly focus on the online data quality control of outliers in the spatial-temporal data before conversion to NetCDF.This is very important for data quality control, especially for data transmission in wireless sensor networks before data formats are converted to NetCDF files.We provided quality control components for Web services in the components library.In addition, we will continue to enrich our components library to facilitate data processing and data quality control.
According to the data request, the basin eScience platform provides online outlier detection methods, including extreme test method, 3 test method, Dixon's test method, and Grubbs' test method.The platform will provide convenient detection of abnormal data points, which will help users to understand the data change rules over time and the intrinsic relationships among the data.Based on the physical characteristics and statistical experience of the various elements, the extreme test method gives the maxima and minima values of the real-time data.For the 3 test method, according to the theory of error, the random error  obeys a normal distribution.As the standard differential is generally unknown,  counted with a Bezier formula is typically used instead of .In formula (1),  is the true value, and   is observation data.Consider For an observation data point   , if its residuals V  meet V  = |  − | > 3,  = 1, 2, 3, . . ., ,   is marked as outlying data.For Dixon' test method, suppose the overall observation data are normally distributed.In the sample  1 , For Grubbs' test method, we assumed normal independently measured samples  1 ,  2 ,  3 , . . .,   , where  is the number of the samples, the residual absolute value of the data is   , and  is the maximum. is the average of the samples.Then, we constructed the statistic (  − )/, with the formula for  being given by formula (2).At the selected significance level , we obtain the threshold (, ) by formula (4). is usually a value of 0.05 or 0.01.Consider If |  −| ≥ (, ), then   is abnormal value and (, ) can be given by the lookup table.
In the eScience platform, we also collect a range of open and free data processing tools and provide them online, such as visualization tools to facilitate collaboration like the NetCDF tools (http://www.unidata.ucar.edu/downloads).

The Case Study of Spatial-Temporal Data Integration and Interoperability on EScience Platform.
In this paper, observation data from the Mafengou subbasin wireless sensor transmission site in the Heihe River Basin is used as a case study for abnormal data quality control.The temperature data are transmitted every 30 minutes with a total of 73 records.Figures 8(a), 8(b), 8(c), and 8(d) compare the dataset before and after outlier quality control with four methods, the extreme test, 3 test, Dixon's test, and Grubbs' test, respectively.Figure 8(a) shows that three outliers were found by the extreme test method.Figure 8(b) shows that the 3 test method found the obvious abnormal data.Figure 8(c) shows that five outliers were found by Dixon's test method.Grubbs' test method is the best, finding seven outliers, as shown in Figure 8(d).
Figure 9 shows the NetCDF tools display of the grid map of the temperature for the Heihe River Basin.The NetCDF tools can also browse remote data model datasets (e.g., NetCDF, GRIB, and HDF) via the TSD data server.An online tools library facilitates data processing and the interoperation of the eScience context using tools with which researchers are familiar.

Raster Data Integration.
To demonstrate the data integration, we took average monthly temperature raster data of the Heihe River Basin as an example and integration components of the component library via Web services as shown in Figure 10.The tool mainly achieved integration and aggregation of the data.First, it converted grid data to ASCII and then integrated the data online as NetCDF to complete the long series data integration.In this example, the grid size is 500 meters, the line number is 899, the column number is 1041, the coordinate  of the left bottom corner is 666083.7 meters, and coordinate  is 4008999.5 meters.These parameters and coordinate system were needed on the webpage.When generating the m function files, we choose grid size, rank number of the grid, and the left bottom corner coordinates as the function's parameters for NetCDF  data integration, time as an unlimited dimensional variable parameter, and the coordinate system as the metadata according to the CF.The components can also add variables to the NetCDF via aggregation.Figure 10 shows a visualization map of one-month data from the NetCDF datasets of the Heihe river upstream temperature data in November, 2005.

The Integration of Wireless Sensor Network Station Data.
Station point data from a wireless sensor transmission site in the Heihe River Basin was used as a case study for integrating point data.The data were transmitted every 15 minutes and examined via the quality control components mentioned in Section 4.4 before conversion to NetCDF.The soil humidity data of the observation data were defined in NetCDF.The integration of NetCDF is divided into two parts: the first describes the information of station number, latitude, longitude, and altitude, and the other describes the measurements such as meteorological and hydrological elements.The visualization map of the NetCDF dataset for soil humidity of the Mafengou subbasin wireless sensor station in October is shown in Figure 11.The lines named soil humidity1 and soil humidity2 were the data of NetCDF from different time, and the soil humidity10 lines were aggregations of two NetCDF datasets with one of soil humidity1 NetCDF and soil humidity2 NetCDF.
For the time series data of the observation station, variables aggregated can integrate different time series NetCDF data into one NetCDF database via NcML files in order to add time series data.The following codes are a program example aggregating different time series soil humidity data, shown in Figure 8.  (the number of satellites scan lines), elem (the element point per scan line number), and the band (the band number of observations).The geographical location is described by latitude and longitude, and the observation values of each band are defined as the main data variable of NetCDF.
In integrating the radial data (e.g., radar data), we mainly consider the radial data to be located by azimuth, elevation, angle, and orientation.A scan record is made up of a number of adjacent radial data records.The main variables of the NetCDF data model in the program include gate (the number of the pulse of a radial data record), radial (the number of radial data of recording a scan), scan (scan number), distance (the distance of pulse), time (the time of the data record), eleva (elevation angular), and azim (azimuth).In this paper, NetCDF is used as a case study of technology to implement the framework.

Conclusion
The eScience platform provides effective interfacing and interacting strategies for data processing, sharing scientific research and decision support with the general public.It is an important method to solve the common problem of information islands by offering public Web data access.Online integration of heterogeneous data sources provided a uniform interface for users to access, analyse, and seamlessly manage the data and give a standard format for data processing programs.The problem of messy data formats was resolved by the eScience platform.It improved the ability of the users to investigate complex phenomena such as climate change, hydrological change, and soil dynamics.Finally, the eScience environment will be gradually used to support decisionmaking in the Heihe River Basin.In further research, we will examine HDF and GRIB data processing methods and gradually establish a single online spatial data process in the eScience context for the Heihe River Basin, developing a suite of efficient parallel algorithms and constructing a geoscience data-supporting library suitable for high performance parallel computation.
This study constructed the Heihe River Basin data integration and interoperability eScience context, which integrated the spatial-temporal data and different formats into NetCDF data models.The framework was constructed based on HDF, NetCDF, and GRIB for uniform management of the spatial-temporal data and metadata, which were longterm, massive, and multidimensional.In addition, we can access and analyse these data formats (e.g., HDF, NetCDF, and GRIB) through the CDM interface, which provided a convenient method for data mining, integration, and the analysis of spatial-temporal data.The framework can establish the eScience cooperative work environment and support the efficient application of the data via Web services.It is especially beneficial to the GIS and the earth science communities for cooperative communication via eScience platform.
The data integration and interoperability eScience platform of the combination of technological solutions can achieve the following goals: (i) the integration of realtime and historical data; (ii) solving the data application problems cross fields, areas, and disciplines; (iii) conveniently accessing and analysing the data resources from different institutions; and (iv) addressing issues about heterogeneous existing standards and existing protocols of Web data access.The combination of solutions chosen could be interesting for achieving the goals, but one kind of technology cannot achieve them.Through the platform, to generate the NetCDF format from heterogeneous multisource data (e.g., geoTiff, ASCII, free-text, shapefile, and grid), it is different from other data share platforms and it is important to manage and share scientific data.And Heihe River Basin eScience platform is superior to other data share platforms in sophisticated analysis algorithm workflows, access to powerful computational resources, analysis, and interactive visualization interface.Our continuing work will provide scientists access to a wide range of datasets, algorithm applications, access to computational resources, services, and support.

Figure 3 :SpatialFigure 4 :
Figure 3: The design flow of data format conversion and interoperability in the framework.

Figure 5 :
Figure 5: The key technology of the eScience interoperability platform of GIS community and other communities.

Figure 6 :Figure 7 :
Figure 6: The key integrated NetCDF data technology based on Web via object-oriented method with MATLAB and Java mixed solution.

Figure 8 :
Figure 8: (a), (b), (c), and (d) compare a dataset before and after the outliers quality control with the methods extreme, 3, Dixon's, and Grubbs' tests, respectively.

Figure 9 :
Figure 9: Example using freely available software (NetCDF (4.2) Tools) from the online library, which can process and visualize NetCDF and NcML files and remotely access NetCDF, GRIB, and HDF files via TSD.

Figure 10 :
Figure 10: The raster data were integrated to NetCDF through the components library based on Web service.
Common Data Model and Spatial-Temporal Data Model for EScience Context.The chosen spatial-temporal data model has to address platform compliance with model traditional data files can be integrated and archived into a single NetCDF file (e.g., free-text data) via these types, as shown in Table1.The NetCDF structure provides a powerful mechanism for dealing with complicated scientific workflows and resolves "messy" issues, such as traditional multiple files and heterogeneous data.

Table 1 :
Classification and archive of various data types.
2 ,  3 , . . .,   ,  is the number of the samples, and the observation data are arranged in order of size   1 ,   2 ,   3 , . . .,    .Depending on the number of samples, we select a different formula, such as formula (3).We marked  10 ,   10 ,  11 ,   11 ,  21 ,   21 ,  22 , and   22 as   and    .To determine the significance level , look up the threshold (, ) in the threshold table of the Dixon test.If   >    ,   > (, ), then    is judged as an abnormal value.If   <    ,   > (, ), then   1 is judged as abnormal value.Otherwise, there are no abnormal values.Dixon's test method is suitable for real-time data quality control.Consider Figure 11: Aggregated different time series soil humidity data with one NetCDF.