The Relevance of Open Data Principles for the Web of Data

. Open data has been improving both publishing platforms and the consumers-oriented process over the years, providing better openness policies and transparency. Although organizations have tried to open their data, the enrichment of their resources through the Web of Data has been decreasing. Linked data has been sufering from notable difculties in diferent stages of its life cycle, becoming over the years less attractive to users. According to that, we decided to explore how the lack of some opening requirements afects the decline of the Web of Data. Tis paper presents the Web of Data radiography, analyzing the governmental domain as a case study. Te results indicate that it is necessary to strengthen the data opening process to improve resource enrichment on the Web and have better datasets. Tese improvements describe that open data must be public, accessible (in machine-readable formats), described (use of robust, granular metadata), reusable (made available under an open license), complete (published in primary forms), and timely (preserve the value of the data). Te implementation of these characteristics would enhance the availability and reuse of datasets. Besides, organizations must understand that opening and enriching their data require a completely new approach, and they have to pay special attention and control to this project, generally by putting money, the commitment by management at all levels, and lots of time. On the contrary, given the magnitude of availability and reuse problems identifed in the opening and enrichment data process, it is believed that the Web of Data model would inevitably lose the interest it aroused at the beginning if not addressed immediately by data quality, openness, and enrichment issues. Besides, its use would be restricted to a few particular niches or would even disappear altogether.


Introduction
Linked open data (LOD) is an initiative suggested by organizations to make their data available in a machinereadable format.Tis requirement allows users to use and combine available datasets to create knowledge and apps in their context [1].In addition, LOD has been working with two major concepts: linked data and open data.On the one hand, linked data defnes a set of design principles for adding value to data by linking to other data (data enrichment).On the other hand, data available under a given license for use, reuse, and redistribution by any person or organization [2] are called open data.To carry out this proposal, the authors in [3] proposed a linked data 5-star scheme.Although the 5star scheme has a lot of advantages, two relevant problems have been recognized by the authors in [4,5].Firstly, most open data systems do not manage dataset reuse completely, even though datasets are available (1-3 linked data levels).Tis lack of reuse (replicating and redundancy of existing data) does not allow for interlinking among existing data, decreasing the possibility of creating a richly interconnected data network on the Web of Data.Secondly, not all linked data is open data, and not all open data can be linked.In the Web of Data, the licensing terms of the published datasets determine whether the data are freely available and open for anyone to use, reuse, share, and distribute.
Researchers such as [6,7] have identifed the rapid growth in the quantity of LOD repositories, which use platforms such as comprehensive knowledge archive network (CKAN) [8][9][10] to manage their services.Tese platforms let publish and exploit datasets and metadata.However, despite data opening and linking guidelines [1,[11][12][13], there are challenges in diferent linked data lifecycle stages.Tese challenges can afect the opening and enrichment of data available on the Web of Data.Some issues described by the authors of [14][15][16] are (1) the lack of use of machine-readable data formats, appropriate data license terms, provenance and quality attributes, data vocabularies, and data access strategies [17].Tese problems hamper the freely available and open data; (2) the lack of apps to detect possible data quality issues [18][19][20][21], such as inconsistency, inaccuracy, out-of-date, and incompleteness; (3) the reliability of the search results is defned by the reliability of the datasets from which these results were obtained [22].For example, if you link datasets that have inconsistency problems, you would not lend value added to your data.Finally, (4) data on the Web show a signifcant data quality variation.For example, data extracted from semistructured sources, such as DBpedia, often contain inconsistencies and false and incomplete information.
According to this context, we decided to explore the current status of the Web of Data, focusing on the main opening requirements defned by the Linked Data guidelines.Studies such as [4,17,23] and [24] propose the criteria and methods used in this research.For this aim, this analysis works on two main approaches: frstly, the requirements to reach the dataset availability, and secondly, the necessary information to achieve the dataset reuse.Tese approaches allowed for the assessment of the behavior of the opening and linking processes provided by datasets published on the Web.Te contribution of this paper is analyzing the main challenges of open and linked data in the Web of Data.Tis analysis will allow us to identify how to improve the openness and linking of our data published on the Web and fnally add value to them.For this aim, this research proposes the following study questions: What are the most common issues that arise from exploit datasets published under LOD principles?How do these fndings afect the decline of the Web of Data?What are the challenges addressed in linking resources under LOD principles?To solve these questions, Section 2 examines the background of the problem.Section 3 presents the methodological design and its corresponding implementation.Sections 4 and 5 review, analyze, and discuss the evidence found on the status of the Web of Data.And fnally, Section 6 presents conclusions and future work.

Background
Diferent problems have reduced the use of these design principles for sharing machine-readable interlinked data on the Web.LOD sufers from serious issues such as the lack of availability of data published on the Web, the lack of use of machine-readable and reusable formats, datasets are not available free of charge and do not have openly licensed, the datasets are not up-to-date, it is not easy to fnd information (metadata) about these datasets, and some of these datasets have inaccuracy, incompleteness, and inconsistency issues.Te Open Data Barometer [25] describes some of these issues deeply.Tese problems do not allow us to add semantic value to our data or link and reuse them in other contexts.
Te data quality is usually understood as ftness for use.Data quality may depend on several quality dimensions.Some of these dimensions are accuracy, timeliness, completeness, relevancy, objectivity, believability, and understandability, among others, cited by Zaveri et al. [16].In addition, data quality problems can strike at the potentiality of data applications.Te lack of data provenance information is a problem in data quality evaluation [16,26], for instance.In this sense, global data management [27] identifed that human errors, too many data sources, and inadequate data strategy are the most signifcant issues concerning the lack of data quality.In short, the data were often from multiple heterogeneous sources, and sometimes, these sources have diferent quality levels [28].Tus, data quality is the main challenge in linked data.
According to the circumstances described previously, we considered it relevant to study the frst stages of the linked data life cycle, owing to the components of the opening and reuse abilities (the two proposed approaches in this research) starting in these frst stages.
Tis research will allow us to identify other features that data must meet to reach all linked data levels and, hence, a better-linked data quality level.For that reason, this study has taken as a reference the analysis of datasets published in diferent instances of CKAN [8,10] to analyze the status of the Web of Data.For that purpose, we address two specifc topics: (1) previous studies of the Web of Data status and (2) challenges identifed on it.

Previous Work Analyzing the Status of the Web of Data.
Regarding the state of the Web of Data, diferent studies [4,7,[29][30][31][32][33][34][35][36] and [37] expose several issues regarding requirements to reach all linked data levels.Te low use of machine-readable formats, the lack of adequate open licensing terms which do not impede its reuse for free, metadata with little human readability, out-of-date data, the extra efort required to get the fve stars of the linked data model, and the under reuse and enrichment of data briefy summarize their main fndings on these researches.
Te fndings described above show that data quality problems are a persistent challenge in linking processes.Tese problems can be observed both at the level of linked data principles and the attributes abstraction that describes an addressable resource.

Web of Data Challenges.
It is recommendable to meet a set of best guidelines, such as those described in [38], to discover datasets and facilitate data integration from different data sources.Despite the existence of these best practices, linked data faces challenges based, for the most part, on data quality.Te authors of [17,23] show a compilation of data on the Web challenges summarized into categories.Tese challenges focus on metadata, data license, provenance, quality, data versioning, data identifcation, data format, data vocabularies, data access, data preservation, feedback, data enrichment, and data republication.
Based on this background, this research aims to carry out the radiography of the linked resources.For this purpose, this analysis works two main criteria: availability and reuse of published data on the Web.For this analysis, this research does not seek to review the social, political, economic, or contractual issues that afect the low quality of open and linked data.On the contrary, we analyze six basic technical requirements for openness and data enrichment.As a result, we present a set of recommendations to improve the steady decline sufered by the Semantic Web.To develop this analysis, the approach of this study is further detailed in the next section.

Research Approach.
Te LOD model establishes a fvelevel schema for linked data (5 stars).Each level adds features that data must meet to reach a level of linkage.Te inventor of the World Wide Web and the creator and advocate of the Semantic Web and Linked Data, Sir Tim Berners-Lee, laid down the four design principles of linked data [39]: (a) Use URIs as names for things (b) Use HTTP URIs so that people can look up these names (c) When someone looks up a URI, provide useful information using the standards (RDF, SPARQL) (d) Include links to other URIs so that they can discover more things Tese principles suggested a 5-star deployment scheme for open data [40]: (1) Make your stuf available on the Web (whatever format) under an open license (2) Make it available as structured data (e.g., Excel instead of an image scan of a table) (3) Use nonproprietary formats (e.g., CSV instead of Excel) (4) Use URIs to denote things so people can point at your stuf (5) Link your data to other data to provide context As said by Abella et al. [4], this fve-level schema can be classifed into two relevant aspects: availability (levels 1, 2, and 3) and reuse (levels 4 and 5).Tis study proposes to examine these two aspects, considering the following delimitations.

Availability of Opening.
Having in mind that linked data defnes four principles described in a 5-star scheme for linked resources, our research explores a set of requirements that support the frst three linked data levels.Tese requirements circumscribe the necessary elements for the linked resource availability.Studies such as [2,12,24], and the analysis of descriptive and administrative metadata [41], allowed us to identify those elements.According to this, the components covered by this study are as follows: (i) Publishing domain: how the datasets are naming their knowledge domain (ii) Resource licensing: what kind of Terms of Service, attribution requirements, and restrictions on dissemination are defned (iii) Publication format: what kind of fle formats is used for data (iv) Publications updating: how often data is being updated Tese variables make it possible to identify a core process (availability), on which the opening schema and, consequently, the linked resources are supported.It is necessary to note that variables such as access, performance, or cost will not be analyzed [42] as they are variables oriented to the infrastructure service that supports linkage from a technological point of view.

Ability to Reuse.
As exposed by Abella et al. [4], levels 4 and 5 in the LOD schema allow the reuse ability.According to that, the published data must be perfectly identifable, able to be linked, and make its information useful to other datasets.To carry out these tasks, frstly, URIs that identify the workspace entities must be provided.Also, these URIs produce links to useful data sources, both internal and external, that enrich their data.Subsequently, the data must be published in a structured way, using the data model provided by RDF (Turtle, RDFa, and SPARQL, for instance).Regarding the RDF structure, it is based on triples (subjectpredicate-object), and the objects of this triple can be a URI reference, a literal, or a blank node.Making use of the RDF structure provided by datasets, where it is specifed that datasets act as linkage subjects or objects, our study analyzes the information provided by the queried datasets, to identify information concerning the reuse made of these datasets published on the Web.Briefy, the criteria described above were selected as strategic as they contribute to establishing the open data availability and provide elements for assessing the reusability of published datasets.Also, they allow the identifcation of shortcomings or barriers in the process of publishing linked data and, fnally, serve to identify challenges to be taken up in the linked data process.

Methodology
A set of stages is defned by our methodology.Firstly, and according to the Research Approach, a knowledge domain was selected (Section 3.1).After that, the data consumption process was performed using a query tool that was built for this purpose (Section 3.2).Ten, the analysis of the results about resource availability and reuse was performed (Section 4).Finally, the fndings obtained were analyzed (Section 5).Te compilation and analysis of the entire study were carried out from November 2018 to July 2019, and the fnal results were acquired in January 2020.

Selection of the Knowledge Domain.
To defne the knowledge domain, a repositories analysis was performed.In this review, problems such as proprietary approach, login use, and behavior as storage banks, the lack of data exploitation services, and proprietary data management platforms, among others, were identifed.Based on these problems, repositories that use CKAN instances were selected [8,10].CKAN helps users from diferent domains (governments, companies, and organizations), in order to publish their data through a data management workfow.CKAN is the platform that handles websites such as Datahub, European Public Data Portal, or the U.S. Government's Open Data portal [43].
Although there are knowledge domains with welldefned taxonomies and data-publishing processes, they are not open access.On the contrary, and considering that openness and transparency are mandatory for the public sector, Government data repositories were selected for this research.Ten, after the revision of open data catalogs, such as Open Data Inception, DataPortals.org,and the Open Data Inventory (2018 and 2019), a random sample of 40 instances of CKAN was selected and is shown in Table 1.
For the experimental design, we proposed a simple random sample.In this sampling technique, each item in the population, and every sample size, has an equal probability of being chosen in the sample.It is complex to defne the dataset population size in this study owing to dataset abundance.Considering that, we selected this random sample technique for an infnite population.
To defne the sample size, there are several potential ways to decide upon the size of your sample, but one of the simplest involves using a formula with your desired confdence interval and confdence level, the estimated size of the population you are working with, and the standard deviation of whatever you want to measure in your population [44].Te most common confdence interval and levels used are 0.05 and 0.95, respectively.Since you may not know the standard deviation of the population you are studying, you should choose a number high enough to account for a variety of possibilities (such as 0.5) [45].
According to the explanation above, we selected a margin of error close to 0.25% with a confdence level of 95%.Considering that each dataset has the same probability of success or failure, the result of the estimated sample is 217.778datasets.Open Data Portal Directories, such as DataPortals.org,Open Data Inception, and Open Data Portals (TruLibraries), allow us to identify data repositories.Te sample was determined using these directories by assigning sequential values to each data portal within a population, then randomly selecting those values.After that, we added all datasets of each instance until obtaining an approximate number of the sample.As a result, and according to the statistical method, a representative sample of 226.393 datasets was selected from 40 CKAN instances (Tables 1 and 2).
Finally, we selected the CKAN platform for this study because CKAN is a powerful tool for data custodians, and all its services are available for free as part of the open-source movement.In addition, hundreds of CKAN portals live, with hundreds of thousands of datasets being used.For example, there are over 800,000 datasets on the European data portal alone [46].Some of the CKAN users include the Humanitarian Data Exchange (managed by the United Nations), data.gov.au,data.gov(US), data.gov.uk,Open Government of Canada, and the European Data Portal [47].

Data Exploitation Strategy.
Two main challenges were posed to carry out the experimental phase: how to query the selected data instances and how to tabulate and visualize the queried information, taking into account the defned variables?Visual analytics for CKAN instances Tool was built for this purpose [48].Te tool provides a series of visual analytics about the current state of the datasets queried from the diferent datasets published in CKAN instances.Tis tool provides the modules described as follows: Tis tool allowed us to select the instances to be queried, use the connection services granted by the data instance, query datasets according to the variables identifed, and create analysis strategies for the queried data (Figure 1).
Tis tool shows the particular visualizations of the selected variables and allowed storing the obtained information in JavaScript Object Notation (JSON) fles [49].Datasets were loaded and scanned to check the existence of the availability metadata, such as format, author, and date of send out, among others, and the evaluation of their behavior as linked objects or subjects.As a result, the obtained visualizations allowed us to realize the respective analysis and the construction of the judgment.Proposals such as [34,38,50] were considered to build the visualizations.

Analysis of Results
As explained in [48], the metadata describes the dataset and specifes its content.Tese tags allow us to relate datasets of diferent instances to a specifc knowledge domain.For this purpose, we implemented an unsupervised machine learning module.Tis module determines the accuracy level of the metadata tags depending on their description.After the data load, the machine learning module consumes and 4 Journal of Electrical and Computer Engineering For the data analysis, the results were segmented into two sections: the resource availability and the ability to reuse.

Te Resource Availability.
Results related to the availability of the resources queried were analyzed.To carry out this aim, the analysis of the variables associated with licensing, data format, updating date, and related domains, were worked.Te results are shown below.Tese diferent licensing types subordinate to brands, products, or countries, among other aspects, can lead to difculties using the dataset, depending on the type of permits or restrictions each country, brand, or product laid down in the license terms.
From the licensing analysis, Figure 2 shows that 26.24% of the queried datasets do not have a specifc license to determine the characteristics of their use.Some causes of this issue are: Not to specify the licensing used, or not to have processed this information in the resource metadata.As proposed in [24], the lack of the "Terms of Service" description, attribution requirements, and restrictions on dissemination, among others, act as a barrier to public use of the data.Maximal openness includes clearly labeling public and available information without restrictions on use as part of the public domain.In the used generic licenses is observed a high degree of fexibility, which allows distribution, mix again and create from its work, even for commercial purposes (except for governmental ones), provided that the respective credit is given for the original creation.Tese types of licenses are recommended for maximum dissemination and use of licensed materials.Similarly, among other aspects, it highlights the use of attribution and noncommercial and public domain licenses, which grant the waiver of all rights to the work worldwide, and under copyright law, including all related rights to the extent allowed by law.Overall, the organizations that have entered LOD have tried to approach the 5-star scheme by publishing data on the Web (frst level) but have failed to provide these resources under a clear licensing which allows actions such as reuse or redistribution.

Data Formats.
According to the second level of the linked data scheme, CVS and XLS are the most used structured data formats.On the contrary, PDF and JPG are the most used unstructured data formats.Figure 3 shows the most commonly used formats.
Te results show that HTML, PDF, CSV, XLS, and JPG are the most utilized data formats.Also, although commaseparated value (CSV) formats [51] are widely used, the use of RDF format is limited.Another fnding is that proprietary formats such as DOC, XLS, XLSX, RAR, and AutoCAD, among others, are still used.Tose data formats are not machine-readable [24].Tus, those datasets cannot be used to enrich and add value to other data.On the other hand, some CKAN instances, such as Datahub.io(old version), provide additional nonproprietary formats for each of the published datasets, in order to apply open data principles.Te aim for three stars in the LOD model is a minimum requirement for open data publishing.However, licenses can be applied to data in any format (DOC, XLS, XLSX, RAR, AutoCAD, and among others), including those embedded within PDF documents.Some people use this kind of format due to the lack of a learning curve in machine-readable structured nonproprietary formats.As [2] said: "A proprietary fle format is one that a company owns and controls.Data in this format may need proprietary software to be read reliably.Unlike an open format, the description of the format may be confdential or unpublished and can be changed by the company at any time.Proprietary software usually reads and saves data in its proprietary format.For example, diferent versions of Microsoft use the proprietary XLS and XLSX formats." Briefy, to reach the third level in the LOD model, the data are available using an open license in a widely reusable format, which means users do not need specifc and proprietary software to reuse it [52].
On the other hand, although with a low level of use, diferent instances are using formats such as Atom, RDF, JSON, ODS, and SHP.Finally, in the sample of the 40 instances queried, we can identify other results: the Datahub instance holds the most extensive quantity of formats, the Rio Grande State Open Data Portal only shows data published in CSV (15,574 datasets published), the Salzburgerland Open Data Portal handles 13 datasets in SPARQL format, and lastly, 90% of the instances manage CSV as one of the data formats.

Data Updating.
Te timeliness principle argues that published datasets must be made available to the public promptly.As proposed in [16], data that are highly volatile should be up-to-date, and that is why priority should be given to time-sensitive data.Real-time information updates would maximize the usefulness that the public can get from this information.According to the fndings, the highest proportion of datasets (60.5%) was updated less than six months ago.Figure 4 shows this monthly update distribution.
However, results show that 13.9% of the datasets have not been updated between 2 and 4 years ago.As was described previously, the data quality dimension includes timeliness or currency [53].It means that data have been updated to keep it current and are available to use when data are needed.According to the 10 principles of open data [54], datasets released should be made available to the public promptly (timeliness) whenever feasible, as quickly as it is gathered and collected, to preserve the value of the data.In short, we must be careful about the information we need.If time-sensitive datasets are not updated, data are not reliable and trustworthy.For that reason, we cannot ensure the accuracy and reliability of the data.On the other hand, some datasets have non-time-sensitive data, for example, historical population data.Tis information does not change over time, so it is reliable and trustworthy.
Another fnding shows that those instances that handle the most quantity of datasets are those that have the most dispersed updating times of their datasets (Table 3).Furthermore, some datasets do not present an updating date, they have never been updated, or their update is given by updating one of their resources.Terefore, the lack of regular updating of those datasets that may change over time infuences both the dataset quality and the search results performed by the consumer users.[5,7,55] and [25], datasets published on the Web can be classifed by diferent knowledge domains such as Media, Publications, Life   4.

Domains. As described by
Although the use of standard domain tags is identifed (transport, health, policy, investment, statistics, geography, education, public sector, economy, and energy, for instance), we can identify a large number of domain tags that do not take account of a domain taxonomy.Some causes of these multiple domains are the lack of guidelines about how to fll in the tag information, the lack of staf preparation who do this task, and the lack of support ofered by the applications used to process this information.Tis disparity of domains makes it difcult to classify and process this type of information.

Registration of Authors and Organizations (Providers).
Concerning the provenance information, results show that there is a wide dispersion in the provenance registrations (Table 5).More than half of the queried instances do not have provenance information fully registered, and eight of these queried instances do not have any provenance information registered in their datasets.
Finally, some of the results obtained are described in Table 6.
Tere are instances such as the State of Rio Grande and Lexington that do not report information about authorship or organizational provenance.
However, there are instances such as Salzburgerland, DART, and Montreal that publish complete authorship and provenance information about their datasets.In the main, despite the existence of tags and best practices for the provenance registration, the published datasets do not have this type of information or are handled half-fnished, which infuences the confdence assessment of the suppliers of the datasets operated.

Te Ability to Reuse.
For this perspective, a query interface that allows visualizing URLs used in each queried dataset was built.Tis visualization allowed us to analyze the linking subject or object behavior of each dataset.As a result, this study showed that queried datasets contain diferent types of resources, and each resource can be accessed using its URL.Japan's open data instance, which holds 20.195 datasets, manages 255.132 linked resources, for instance.Te data instances with the highest number of linked resources are shown in Figure 5 and Table 2.   Tese results let us identify problems such as restrictive license types, the lack of licensing defnition, and reduced use of structured and nonproprietary formats.Despite these issues, organizations that publish their data in CKAN instances use active URLs.However, in some cases, these URLs link proprietary or no-structured fles.Te dataset of the ongoing recruitments of the Municipality of Lorca (European Data Portal) is an example of this issue.Tis instance links some fles in Excel, which do not load or display information.
On the contrary, it is necessary to highlight the work and evolution that CKAN has been providing, to improve the services of publication and consumption of data.
Concerning its internal structure, datasets have tags to describe diferent aspects.One of these aspects is the description of its behavior as a linkage subject or object to other datasets (relationships_as_object and relation-ships_as_subject).When looking into the instances queried, only 2 of the 40 of them provide information about their behavior as a linkage subject or object to other datasets: Datahub.io(170 subject-object tags) and Te University of Bristol (12437 subject-object tags).It shows that, even when resources provide active and reachable URLs, the tagging structure does not provide complete information.Only those organizations that both create publication services and decide to generate data consumption start to review this type  of small details, which provide primary information at the time of analyzing the linking level of datasets published on the Web.In a nutshell, although eforts made in openness and linked data are shown, the low metadata quality and the weak application of the best practices of availability and reuse have created barriers that discourage the growth and use of the Web of Data.Although nonproprietary formats and reachable URLs [56] are used, data-publishing problems were identifed.Issues, such as the nonupdated datasets, the diversity of published formats, the nonproper flling out tags, the low availability of end-user-friendly tools, the poor institutional policy oriented to linked data, and the lack of guidelines for data providers, were identifed.Tese problems reduce the use of the Web of Data.Considering these factors, platforms like Datahub have opted for a new orientation called Frictionless Data [57].Tis strategy provides a simple wrapper and basic structure for data transport, signifcantly reducing friction in data exchange and integration and also supporting automation without imposing signifcant changes on the underlying data being packaged.

Discussion of Results
First of all, concerning the research fndings, it is seen that in the Government domain, the Web of Data sufers from a set of issues in diferent implementation stages.In the case of open resources availability (levels 1 to 3 of LOD schema), the results show that although eforts are made to publish data (level 1), these processes present diferent barriers such as the use of proprietary formats, the multiplicity of data formats, the outdated nature of the published data, the lack of appropriate licensing allowing the use, reuse, and distribution of published data.In the case of the ability to reuse, published datasets and query services are identifed in the queried instances.Likewise, queried datasets have URIs that link information to their triples.However, the behavior of linking subject or object is not recorded in the datasets.Tis problem is due to human errors or the absence of knowledge of the dataset structure.
Despite datasets being available in the queried instances, the following fndings were identifed: (i) A portion of datasets do not have a license, or it is too specifc.Te lack of a copyright declaration does not allow the reuse of the data and restricts checking the data attribution and share-alike requirements.(ii) Te CSV format is one of the most used formats in the queried instances.However, the lack of RDF formats reduces the scope of the linked data.(iii) Some datasets have been updated six months before.However, some datasets were updated more than two years ago.Tis variation of updates afects the data timeliness for reuse and decision-making based on its content.(iv) As far as domain tags are concerned, a signifcant disparity of domain names and problems in their tagging is identifed.Diferent abstractions and particular designs of real-world objects can generate these problems.(v) A large proportion of datasets do not record provenance information.Te lack of relevant information makes it difcult to determine whether the dataset fts the purpose of information required by the user, afecting the user's confdence.(vi) Although resources are linked using their URLs, this information is not recorded in the dataset structure.Tis lack of registration may be due to human errors or a lack of detailed knowledge of the data structure.(vii) Te government information changes from one dataset to another because of diferent abstractions of the knowledge domain, in addition to a limited expressivity of the used vocabulary.
Considering that metadata published reveals diferent data quality problems, these fndings reinforce the hypothesis of the decline that the Web of Data presents.Tese problems have been identifed by authors in [17][18][19][20][21]58] and [59], among others, which are detrimental both to the dataset information and to the information obtained from the queries.Data quality is crucial when it comes to making far-reaching decisions based on the results of querying multiple datasets [16].
While there are lots of open data guidelines, putting it into practice, it is a little bit difcult.In terms of data availability (levels 1-3 of linked data), we can identify stumbling points as follows.
Some people think that open data merely requires that each opening proposal includes fles published on the Web, and no specifc practices are specifed since best practices are often diferent for diferent projects.Simply making data publish does not guarantee that the data have utility as open data.A substantial proportion of published data is not available under an open licensing, had insufcient metadata, uses an inappropriate fle format, or is out-of-date.Briefy, openness is changing to what can be termed "open-washing" [60] which means data are open but are not complete or there are data qualities and discovery issues.
While there are a lot of open data standards and best practices, most people are not familiar with them and do not use them.Opening practices are decentralized, and people in charge of data opening rarely receive formal training about opening and linked data.Tey are often left to abstract and represent their data models.Our results show that it is necessary to improve the frst stages in the linked data lifecycle (abstraction, modeling, and opening).We have to shift our paradigm from "open-washing" to "opening focused on data availability." In terms of data reuse, according to levels 4 and 5 of linked data, we can identify stumbling points as follows.
Some people think that their datasets merely require that each metadata adds some kind of literal data and does not use information from outside their immediate environment.On the other hand, an inappropriate open data license does 12 Journal of Electrical and Computer Engineering not allow our data to use by other people.Lastly, people do not concern about the visibility of their data as they think that they alone shall have the right to use them.
We have to shift our paradigm towards the reuse of open data, understanding that semantic enrichment allows that any metadata inside our dataset can be enriched by information from outside our immediate environment, which we reused.For this purpose, it is essential to both share our dataset identifcation properly and link other datasets using their URIs.Additionally, it is necessary to record this information inside the RDF structure of our datasets.
Keeping in mind that data availability is the base for data reuse, we have to use best practices in opening and linking data in the frst place.On the other side, data owners must shift their paradigm, and they need to understand that open data, as a social movement, will become part of their workfow and understand that this process requires policies, investment, and training.Te necessity of open, share, and reuse data has never been more apparent than it is today when we are all sufering to some degree from the COVID-19 crisis.
Lastly, although existing successful Web of Data examples, such as Place Name Databases, where you can fnd open data about place names, or Census 2006 as linked open data, a project developed by the Irish Central Statistics Ofce [61]; various organizations do not understand the purpose of publishing data on the Web, let alone why data on the Web should be linked [61][62][63].For that reason, we have to understand that data enrichment is not an exclusive task of the "public sector."Tere are lots of data enrichment experiences in sectors that do not make their data available to the public.
In general, our fndings can be identifed in studies as follows.
Data availability and reuse issues permeate diferent knowledge domains.An example of which is the digital humanities researchers.In the Arts and Humanities, this tension between publishing research results in silos, particularly where the underlying research data are never shared, and the desire for open, reusable data remains [64].In addition, a main obstacle to the reuse of digitized cultural heritage is a lack of data quality.Reusing LOD datasets is a challenging task requiring the knowledge of several technologies as well as how the data are modeled [65].On the other hand, regardless of the specifc tasks that LODbased tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g., semantic heterogeneity, provenance, and data quality [66].Another problem potentially is the inability of machines to automatically fnd and read data, which makes it challenging for the data to be reused by any stakeholder.Tus, if the data are not in some way open or accessible, it is impossible to reuse the data for other purposes, like AI [67].
In the knowledge domain of linguistics, there is a need to share linguistic resources, but reuse is impaired by several constraints including a lack of common formats, diferences in conceptual notions, and unsystematic metadata.Te following fve constraints are discerned in this knowledge domain [68]: (1) Linguistic resources are often designed for particular tasks (e.g., part-of-speech tagging, named entity recognition).( 2) Tere is a plethora of diferent markup languages, which are often not fully compatible between systems, much less between domains.(3) Each linguistic resource may use diferent conceptual models.For example, there are dozens of diferent part-of-speech tag sets.(4) Existing linguistic resources often do not provide precise or machine-readable defnitions of the terminology they use, thus making it difcult to reuse them without manual investigation.( 5) It is often difcult to obtain the full metadata around the creation of a resource.
Briefy, if you want to start with LOD, keep in mind the recommendation of [69]: "Many organizations are interested in publishing linked open data.However, this is a complex endeavor that requires a gradual approach, especially in situations where resources are scarce and technical know-how and infrastructure need to be developed frst.In such contexts, it is recommended that organizations 'open data frst, and then link' (Caracciolo & Keizer 2015), focusing on priority datasets that are highly visible or which have high reuse value." Te studies described previously let us identify that our results ft into the broader research context on resource availability and ability to reuse.On the other hand, these studies evidence that the sample used in our research, taken from diferent countries and topics, is representative given that it lets us evaluate and identify that the availability and reusability issues are present nowadays and need to be addressed to improve the data quality.Te lack of appropriate metadata limits data openness and enrichment.Using metadata with the correct metadata architecture can yield considerable benefts for LOD publication and use, including improving fnding ability, accessibility, storing, preservation, analyzing, comparing, reproducing, fnding inconsistencies, correct interpretation, visualizing, linking data, assessing and ranking the quality of data, and avoiding unnecessary duplication of data [70].

Conclusion and Future Work
Te Semantic Web has faced challenges that have emphasized aspects that have not allowed its evolution at the expected pace.Tese factors have signifcant features associated with open data that need to be evaluated: Are datasets displayed in machine-readable formats?Are they reusable?Are they free of charge?Do they have an open license?Are they up-to-date?Is it easy to get information about them?Tese features, among others, are still challenges that afect the availability and reusability of data within the Semantic Web.Once the study is completed, the results obtained from the dataset exploitation let us identify the following: Journal of Electrical and Computer Engineering (i) Datasets are neither representing abstractions that respond to the same context nor being described with known vocabularies.Additionally, factors, such as the lack of updating, the poor technological support, and the prominent learning curve, are barriers to the development of linked data projects.Finally, the lack of knowledge about restrictions and permissions acquired on the data restricts access to them, thereby limiting consumers' ability to exploit the possibilities of open data.(ii) Regarding the reusability of linked resources, most of the queried datasets make use of URIs to connect the information to their triples.However, a low rate of the queried instances records their behavior as a linkage subject or object.Although the absence of these registers does not directly afect the operation of linked data, such problems may be due to a lack of awareness of RDF structure or human errors.
Based on these research results, people may feel those open data portals are good enough for a political party, a public institution, or the Government but not good enough for linked data.Tis assertion can be "right" because data are appropriate for their target audience.But this assertion is not correct.Government data portals that do not meet the minimum requirements for opening their data can be identifed.Data portals must provide a good level of open data quality for data openness and linking processes.Besides, the linked resources sufer from abstraction and data quality issues despite the existence of best practices for data publication.In addition, although the linked resources use reachable URIs, the registration of linkages is a major issue when you try to enrich your data.For that reason, the link analysis is proposed as a strategy to complement the reuse analysis of the linked resources.
Regarding availability and reuse approaches, the research results evidence that these approaches impact the data published on the Web.Tese approaches are indispensable components to reach the frst levels of the linked data model and allow evaluation of the data access by the users, who are looking to connect their data and improve it with timely and accurate data.
Although this research analyzed the governmental domain, these variables can be worked out in other knowledge domains because they represent operational aspects that ft any knowledge domain.Besides, this method is applicable to scientifc data portals as long as their metadata are published in an open format and the platform works over a CKAN instance.However, the description of the information in the RDF model must be a thoughtfully completed task to improve the interoperability levels in the knowledge domain worked.
Concerning the research fndings, the following issues are identifed: (a) the lack of update that restricts the timeliness of the data, (b) the lack of licensing skews the use of the data, (c) the absence of information that allows determining if the dataset suits the purpose of information required by the user, (d) difculties to access and availability which does not permit the data exploitation, and (e) the discrepancy of real-world abstractions reduces the interoperability between repositories.All of this evidence has been weakening the linked resources perspective, shifting towards strategies that allow data transport without needing their prior processing.
Last but not least, based on the results of this study, the challenges that we have identifed as the main ones to address in the information linking process, from the data availability and reuse approaches, are the following: (i) Open and link data as a priority in organizations (at all levels of organizations).Tis organizational priority will ofer many business opportunities to consumer users.Furthermore, enough trained human resources and the necessary budget must be provided to carry out the task of open and linked data.Also, the open data legislation should be provided, which extends to diferent government and organization levels, whether public or private.(ii) To empower users, increasing their perception of the possibilities of reuse of information, in addition to providing them with agile and straightforward tools and procedures to carry out the open and linked data processes.(iii) Te technological context may be improved through mechanisms that easing the tasks of publishing and linking data, managing the metadata of published resources.Also, these tools must provide stable services that allow users to exploit data in a more user-friendly way.Finally, the learning curve in topics of standards, vocabulary, and languages used for the knowledge representation must be improved.(iv) To defne and unify the data publication and linking workfows properly.Te proper defnition and organization of activities for opening, publishing, and enriching resources help streamline the development of tasks (curation, acquisitions, discovery, and analytics) and facilitate coordination among people.Besides, planning activities in the openness and enrichment resource process allows for identifying needs in the open and linked data learning curve.Finally, this workfow must be easy to address by data editors and consumer users.(v) A well-defned process for abstracting real-world entities and attributes must be established in order to improve interoperability between repositories.Both the proper knowledge domain contextualization and the use of vocabularies with enough expressiveness are vital for data modeling workfow.(vi) To complement availability and reuse features, RDF data formats that meet all LOD model requirements should be provided.
14 Journal of Electrical and Computer Engineering (vii) Linked data should provide strategies to overcome not only the current difculties of linking but also the search for resources that contribute to data enrichment.(viii) Given that in diferent researches, the linked data quality is mainly evaluated on the generated instances, and it is necessary to strengthen the abstraction and design of data models, based on linked data quality requirements.(ix) Also, given the particular designs of real-world objects combined with the failure to open data models and the use of their standards, it is necessary to move towards the opening and the reusing, safeguarding the fundamental rights.Moreover, the idea should be to decrease the syntactic (languages) and semantic (meanings) transformations that ought to be carried out to use the data.(x) Licensing and copyright: the announcement of publishing rights and the respective linked data authorization allow to both reuses and the use of the data legally by complying with the restrictions provided for its manipulation.(xi) Design and implementation of data update policies and strategies to leverage appropriate data.
Some practical recommendations for organizations looking to implement open and linked data could be to increase the stakeholder's learning curve on applying LOD principles and improving metadata quality on the resource descriptions.Besides, organizations must understand that opening and enriching their data require a completely new approach, and they have to pay special attention and control to this project, generally by putting money, the commitment by management at all levels, and lots of time.And last, organizations must apply the open data and linked open data principles to their published dataset to add real value to their data.
Implementing explicit concepts and rules abstraction needed to build particular models in a specifc domain is proposed as future work.Tis proposal raises the design of a metamodel that allows the generation of data instances based on quality dimensions of linked data and avoids diferences in the context represented.Other topics, such as investigating the efectiveness of diferent strategies to improve data quality, evaluating the impact of open data legislation, or exploring the potential of new technologies such as blockchain or artifcial intelligence for linked data, are proposed as future works.
(a) Metadata download of CKAN instances: Tis module uses the API provided by CKAN for the Linked Open Data consumption and storage of data for later use.(b) Creation of REST service: Tis module creates a REST server that allows connection between the front end of the tool and the data of the instances.(c) Implementation of the machine learning module to evaluate the concordance level of the metadata labels: Tis module generated a consumption library for unsupervised machine learning.Tis technique allows us to determine the concordance level of the metadata tags corresponding to each dataset of an instance.(d) Visual Analytics module: Tis module implements graphic libraries to represent the metadata analysis coming from the instances of CKAN.

Figure 1 :Figure 2 :
Figure 1: Technological environment of the proposed experiment.

Figure 5 :
Figure 5: Number of linked resources.
4.1.1.Licensing.Te most common licenses used for publishing data are OGL (Open Game License) (16.8%),Creative Commons Attribution (19.1%),Non-Commercial Government License for Public Sector Information (8%), and Creative Commons Zero (5.95%).Some 11% of the queried datasets have diferent licensing types that are too specifc for their purposes or countries.Examples of this licensing type are the United Kingdom's Crown or Canada, or organizations such as IBM or MIT.Certain countries or brands use this kind of specifc license according to their purposes or needs, such as: (i) Te Open Government License is used where data collections are Crown Copyright and the Creative Commons Attribution 4.0 International License is used (when available) where data collections are copyrighted by others, for instance.(ii) Te MIT license gives express permission for users to reuse code for any purpose, sometimes even if the code is part of proprietary software.As long as users include the original copy of the MIT license in their distribution, they can make changes or modifcations to the code to suit their own needs.It is one of the simplest open-source license agreements.Te intent was for the text to be understandable by average users and to avoid extensive litigation, which may arise from other similar Free and Open-Source Software (FOSS) licenses (https://snyk.io/learn/what-is-mitlicense/).

Table 2 :
Amount of resources per queried CKAN instance.

Table 3 :
Instances with the most datasets.

Table 4 :
Domain tags by instance.

Table 5 :
List of authors and organizations of the queried instances.

Table 6 :
Authorship and organization registry.