A Large Visual, Qualitative, and Quantitative Dataset for Web Intelligence Applications

. Te Web is the communication platform and source of information par excellence. Te volume and complexity of its content have grown enormously, with organizing, retrieving, and cleaning Web information becoming a challenge for traditional techniques. Web intelligence is a novel research area to improve Web-based services and applications using artifcial intelligence and automatic learning algorithms, for which a large amount of Web-related data are essential. Current datasets are, however, limited and do not combine visual representation and attributes of Web pages. Our work provides a large dataset of 49,438 Web pages, composed of webshots, along with qualitative and quantitative attributes. Tis dataset covers all the countries in the world and a wide range of topics, such as art, entertainment, economics, business, education, government, news, media, science, and the environment, addressing diferent cultural characteristics and varied design preferences. We use this dataset to develop three Web Intelligence applications: knowledge extraction on Web design using statistical analysis, recognition of error Web pages using a customized convolutional neural network (CNN) to eliminate invalid pages, and Web categorization based solely on screenshots using a CNN with transfer learning to assist search engines, indexers, and Web directories.


Introduction
Te modern world relies on the Internet, with many human activities (commerce, education, entertainment, social interaction, etc.) having digital applications supported by this platform.It has allowed the development of such activities despite the paralysis caused by the recent pandemic.Te Internet and the Web are closely related concepts, where the frst term refers to the large network of networks (the infrastructure), while the second refers to the content, consisting of Web sites, which are a collection of interlinked Web pages on a specifc topic [1].Since its invention in the 1990s, the World Wide Web, or simply the Web, has revolutionized access to large amounts of data and information [2,3].Factors such as ease of use, user-friendly interface, popularity, and increased connectivity have made the Web an everyday tool for individuals and organizations in all areas [4].Te Web is a constantly evolving global communication platform.Te volume of information available is enormous and is growing rapidly, becoming more complex, and covering all topics.Te efective management of such a quantity and variety of information is an increasingly difcult task for traditional techniques.For example, organizing valid content and fltering invalid content (purifying the Internet) are challenges facing the current Web [5].Te immense presence on the Internet of error Web pages (e.g., under construction, maintenance, domain ofer, suspended account, page not found, browser incompatibility, virus, phishing, or service failure), which continue to be indexed and returned by search engines, afecting webmasters and users in general.Once the invalid pages are fltered, the valid content must be organized.To this end, classifcation is a basic technique, but doing this manually is impractical, with automatic classifcation of Web pages being the recommended method.Tis is usually carried out by analyzing both textual content and underlying HTML code.However, modern Web pages that include multimedia objects, video streaming, and picture sharing are making the extraction of information increasingly complicated [6].Hence, there is a need to improve the mechanisms to address such problems, which has attracted scientifc interest and created new areas of research and development.Web intelligence (WI) is a relatively recent feld that fundamentally seeks to improve Web-based services and extract knowledge by using artifcial intelligence [6].In this work, we present applications in the context of WI that efciently deal with cleaning and organizing Web content, as well as extracting knowledge from Web data.It is worth mentioning that a large amount of Web-related data is an essential resource for WI and its applications.In addition to Web usage data, such as user profles and preferences, interactions with Web sites and browsing habits [7], data on the structure and content of Web pages are necessary for our purposes.Currently, Web-related datasets only include screenshots, URLs, or images extracted from Web pages.Tere is no large, quality dataset that integrates the attributes and visual aspect of Web pages.Terefore, the frst aim was to create such a dataset, which we collected automatically from Google and a Web directory by leveraging various computational tools, obtaining a total of 49,438 Web pages from all the countries in the world and classifed into the following topics: arts and entertainment, business and economy, education, government, news and media, and science and environment.Te dataset combines the attributes (structure data) and the visual aspect (content data) of a Web page.Te structure data allow us to discover patterns on rules of thumb, trends, and guidelines in current Web design.In this way, a Web of data becomes a Web of knowledge to support beginners and experts, statistically processing and analyzing the qualitative and quantitative attributes of our dataset.Furthermore, the content data were used to develop artifcial intelligence applications, based on deep learning, to clean and organize Web content.We implemented automatic recognition of error Web pages and categorization of valid Web pages, both applications based exclusively on screenshots.To this end, we used convolutional neural networks (CNNs), which are state-of-the-art tools in the feld of computer vision.Consequently, our work makes the following contributions: (a) a freely available extensive Web page structure and content dataset; (b) a workfow supported by computer tools to automate the process of collecting, organizing, and debugging screenshots and attributes of Web pages, a methodology that can be adapted to other problems where the acquisition of large amounts of data is needed; and (c) WI applications to flter and categorize Web pages using only screenshots, avoiding the analysis of the HTML code and other new technologies.In this regard, it may be useful to save time and cost in information retrieval systems such as search engines (e.g., Google and Bing), classifers, recommendation systems, Web directories, and crawlers [8,9], and optimization of Internet resources.
Te rest of the work is structured as follows.First we review Web page datasets in Section 2. Tis is followed by a detailed description of our dataset and the process designed for its creation in Section 3. In Section 4, we then present Web intelligence applications' three use cases: a statistical analysis of the attributes of Web pages, the implementation of automatic recognition of error Web pages, and Web categorization, based on screenshots and CNNs.Section 5 details our conclusions and future work.

Related Work
Te availability of a large amount of data underlies the current need for research and development in areas related to the Web.Our review of the literature and existing datasets is summarized in Table 1.For ease of comparison, we have divided them into two groups according to size: small (less than 1000 instances) and large.First, Boer et al. [1] use a tiny dataset, only visual, collected for categorization within four classes (news, hotels, conferences, and celebrities).Tese categories are considerably diferent from each other; so, the categorization problem is less complicated.Tere is no link to download the screenshots.López-Sánchez et al. [10,11] have datasets with more Web pages, including their respective images and links (URLs).However, these images are not screenshots but elements of the Web page.Te URL is used to download the images from the HTML code and analyze them for categorization.Although there are more categories than in the previous work, they still cover very diferent topics.In neither case is a download link available.Of the smaller ones, Reinecke and Gajos [12] propose the most signifcant dataset, which might be a useful resource for small-scale research and development works.It covers several countries around the world and various topics and is available for download.However, it is strictly visual and insufcient for current needs, and its purpose is more oriented towards aesthetics analysis and classifcation.Te Computer Incident Response Center Luxembourg (CIRCL) is a government initiative created to respond to computer security threats and incidents.CIRCL [13] ofers a dataset of more than 400 screenshots of verifed or potential phishing Web sites.Furthermore, an extensive dataset with more than 37000 images is available [14], corresponding to screenshots of Web sites belonging to the dark-Web, the problematic facet of the Web associated with cybercrime, hate, and extremism [15].Both datasets can be easily downloaded, although, because the images represent fakes or hidden Web pages, these datasets would have a limited application.ImageNet [16], the most popular of the image databases, includes millions of images organized according to the WordNet hierarchy (A large lexical database of English nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), https://wordnet.princeton.edu).Te Web sites section has 1840 screenshots from diferent countries and languages without categorization.Some screenshots appear cropped, and download requires registration and authorization.Te dataset created by the University of Alicante [17] compiles 8950 screenshots of Web pages for analysis and evaluation of the quality of Web design.Half of the images come from the Awwwards site (https://www.awwwards.com)and are labeled with "good design," whereas the other half are extracted from yellow pages, labeled with "bad design."Tis dataset serves the  [18] stands out because it covers a larger number of Web pages.However, these come from only 44 countries, the parameters are purely aesthetics, and the image download is not direct.In contrast, our dataset takes into account all the countries in the world, includes attributes related to Web page structure, is general-purpose, and available for download.We created from scratch an extensive and available dataset, which incorporates the visual representation of the Web page, complemented with qualitative and quantitative attributes extracted from the underlying HTML source code, so that a Web page is better characterized.
Because manual data collection is a complex task and requires a great deal of time and human efort, we automated the process by writing several programs in Python and R programming languages.

Materials and Methods
Current Web page datasets are limited to screenshots and sometimes URLs.Creation of more sophisticated and larger datasets is needed by research on Web-related problems.In this section, we present a methodology for combining visual, textual, and numerical elements into a single dataset on Web pages.Te workfow is represented in Figure 1, which considers the example of our large dataset that integrates quantitative and qualitative attributes, along the visual appearance of Web pages.Te resulting dataset is the main resource for implementing three applications in the context of Web Intelligence, which we explain in the next section.
3.1.Web Dataset Design.Te frst step is to defne the elements that compose the dataset according to the proposed aim.Our interest is focused on the structure and content of a Web page, so we selected a series of qualitative and quantitative attributes for the structure, and a webshot for the visual aspect.In this way, we created a single mixed dataset, which is designed to combine diferent data types (visual, textual, and numerical) and is more extensive and descriptive than current Web page datasets.Table 2 shows the list of the elements of our dataset, which are detailed as follows.
A webshot is a digital image of the entire Web page, unlike a screenshot, which may appear cropped because its dimensions exceed the viewing device, forcing the user to scroll.Te name given to the webshot is a key element that follows a convention to identify the source, category, and country of the Web page.It is also the link between the webshot and the qualitative and quantitative attributes.A URL (uniform resource locator) is the address of the Web page together with the recovery mechanism (http/https).It is placed in the address bar of browsers, which are the programs that display the content to the user.We collected URLs from around the world to cover diferent cultural characteristics and preferences, so that the dataset includes attributes related to geographic locations, such as country and continent.Te Web pages collected belong to the following categories: arts and entertainment, business and   Computational Intelligence and Neuroscience economy, education, government, news and media, and science and environment.We considered these categories since they are part of the Web directory used here and explained in the section on Browsing.We added the following quantitative parameters, which provide an overview of the structure and quality of a Web page: download time, because users want to wait as little as possible to view a Web page [19], which means reducing the source code download time; size, the larger the size in bytes, the slower are the download and display of the Web page; images, since they will increase the download time.A Web page is not necessarily more attractive because it has more images, a balance between all types of information is recommendable [20]; scripts are external fles to provide the Web page with more complex functionality.It is advisable to reduce their quantity because they increase the network trafc and download time; CSS fles are style fles that cause an extra load and delay the display of the Web page, ideally there should be one; tables are often used to structure the content of a Web page, but this is discouraged due to appropriate elements such as "div" tags; iFrames insert a Web page inside another one, which is not currently good practice; style tags are not recommended since there are CSS fles.Finally, we have the image size in bytes, as well as the dimensions (width and height) in pixels of each webshot.Figure 2 presents a small sample of the dataset showing a case of each category.

URL Collection.
Once the dataset had been structured, the next step was to collect URLs worldwide related to the given categories.Each URL was then used to download the HTML code, extract quantitative and qualitative attributes by scraping, and take a screenshot of the entire Web page (webshot).To obtain a larger number of URLs, we used two ways of fnding information on the Web: searching and browsing.Searching requires the user to translate a need for information into queries, whereas browsing is a basic, natural human activity, occurring in an information environment where information objects are visible and organized [21].Next, we describe how both techniques allowed us to collect URLs and, based on these, extract attributes and capture webshots, all through Python and R scripts.

3.2.1.
Searching.Te Google search engine asks for words or phrases related to the topic of interest.To avoid the repetitive task of typing the query into the Google search page and manually retrieving the response URLs, we automatized the process through a Python script (https://osf.io/k6yrx),where (1) Te country name and its Internet code are extracted iteratively from a plain text fle (https://osf.io/yrmx8).(2) Te search query has the following structure: "site: " + countrycode + " business OR economy OR marketing OR computers OR internet OR construction OR fnancial OR industry OR shopping OR restaurant" + " ext : html" OR, site and ext are operators or reserved words that can be used in query phrases within the Google search engine.Te OR operator concatenates several search words related to the category.Te "site" operator specifes the geographic top-level Internet domain assigned for each country, e.g., "es" for Spain.Te "ext : html" operator produces results exclusively with that fle extension.(3) Te request returns the Google results page with the frst 100 links, which is used to achieve an approximately uniform distribution of Web pages according to country and category.(4) Web page links are extracted by automatically scanning the source code of the results page (scraping), generating a text fle that contains the URLs and their attributes: country, continent, and category.( 5) For the rest of the categories, the queries are as follows: "site: " + countrycode + " arts OR entertainment OR dance OR museums OR theatre OR literature OR artists OR galleries" + " ext : html" "site: " + countrycode + " education OR academy OR university OR college OR school" + " ext : html" "site: " + countrycode + " government OR military OR presidency" + " ext : html" "site: " + countrycode + " news OR media OR magazine OR radio OR television OR newspaper" + " ext : html" "site: " + countrycode + " science OR environment OR archaeology" + " ext : html"

Browsing. Tis technique uses a
Web Directory, a specialized Web site consisting of a catalog of links to other Web sites.Building, maintaining, and organizing by category and subcategory is done by human experts, unlike search engines, which do so automatically.To include a URL, the specialists perform a review, analysis, and evaluation process to verify the requirements determined by the Web Directory.A few Web directories have survived the popularity of search engines like Google.We can highlight Best of the Web (BOTW) (Figure 3), one of the most widely recognized due to its quality, global reach, a wide range of categories and subcategories, level of trafc (visits per month), reliability, number of links, and demanding requirements.Instead of a query or search phrase, it is necessary to know the hierarchical structure of the directories and subdirectories until the URL of interest.We took advantage of the organization by country and category defned by BOTW, e.g., for Greece: https://botw.org/top/Regional/Europe/Greece/Scienceand-Environment/We collected the URLs published within each category through a Python script (https://osf.io/73sc2)that (1) Iteratively reads the name of each country from a fat text fle (https://osf.io/ve986)(2) Sets the path corresponding to the category, which will always have the same structure, where only the country's name changes: "https://botw.org/top/Regional/" + countryname + "/Science_and_Environment/" (3) A connection to the Web address formed is realized, which obtains the source code of the results page to extract the URLs of each category

Web Intelligence Applications
4.1.Use Case 1: Knowledge from Web Data.Te large amount of data collected can provide useful information, which can be converted into knowledge.Trough statistical analysis of the qualitative and quantitative attributes of Web pages, we can identify patterns about how they are structured, which is fundamental in Web design.We distinguish between the two sources of Web data: browsing, which is stricter, and searching, which is virtually unrestricted.R was used to compare both sources, and outliers were excluded in the calculation of statistical indicators and the creation of graphs due to the great heterogeneity of the variables cited in Table 2. Tese values difer greatly to those considered common and may cause distortions in mathematical and visual analysis.By using the well-known rule "1.5 times the Interquartile Range," outliers can be identifed and omitted.For this reason, the number of values for each of the variables may difer.

Qualitative Attributes.
Although we attempted to obtain a uniformly distributed set of URLs with respect to the categories, the errors cited in the section on Automatic Recognition of Error Web Pages generated the results shown in Figure 4(a).In searching, there is less imbalance, in contrast to browsing, where Web pages related to business, economy, and government predominate, possibly due to a greater need for dissemination and economic capacity to register their Web pages in a paid service.Most of the Web pages are geographically located in Europe and Asia, for both browsing and searching, with these continents accounting for a larger number of countries.Moreover, their economic potential could explain why they are at the top (Figure 4(b)).

Quantitative Attributes.
Te variability of searching is more evident given the number of characters in a URL (Figure 5(a)), resulting in a wider range (the diference between the minimum and maximum) and a higher mean and standard deviation.In both graphs, the values accumulate more at the bottom of the variable and are less frequent at the top, and so, there is a tail to the right.Tis behavior is desirable when reading or typing a URL in a browser and indicates there are not too many levels to reach a particular Web page.Te download time of the source code of the Web pages shown in Figure 5(b) behave similarly for browsing and searching.In both cases, although there is considerable variability due to the width of the range and standard deviation, the values are around 20 milliseconds and mostly low, which is a beneft for the user who wants to view the Web page in the shortest time possible.According to the errors cited in the debugging section, 254 of 3182 URLs (7.98%) belonging to browsing were unavailable, while in searching, 4079 of 46256 (8.82%), were not accessible to download the source code, and hence, the extraction of the quantitative parameters was not possible.Te behavior of the size in bytes (Figure 5(c)) is almost identical in browsing and searching, where the average of the Web pages is approximately 50 KB.Both the variability and the tails of the distributions are practically the same, favoring a quick view of the Web page, which is in direct relation to a small size.Nonetheless, there are still some Web pages with a considerable size, which may be due to graphic elements or external objects linked to the page.In Figure 5(d), almost 60% of Web pages include no images within their source code.It seems that the pages present only textual information.However, they might include images through CSS style fles, which is a good practice [19].Although the range of number of images is wide, the average is low, 2 or 3 images per Web page, so there is a tendency to use a few images within a Web page in order to make it lighter.Nonscripted Web pages exceed 50% in both cases.In this sense, scripts are add-on programs that provide additional functions to Web pages.However, their use may cause incompatibilities with browsers and make the page more complex and heavy.In Figure 5(e), the trend is to minimize the presence of scripts, 3 scripts per Web page on average.In Figure 5(f ), approximately 50% of the Web pages do not use cascading style sheet (CSS) fles, whose use is recommended as good practice in Web design.Te ideal number would be one style fle per Web page.Te average for browsing and searching is 5 and 4, respectively, which is relatively close to the ideal number.Te tendency is towards low values, although there are some cases with many style fles, which hinders an agile display of the Web page.Te graphs in Figure 5(g) have a very similar aspect.Te majority of Web pages (about 85%) no longer use tables within the source code.While, in the past, tables were generally used to structure the content, this practice has been replaced by the "div" tag, achieving a more elegant and professional design.More than 85% of Web pages do not use iFrames.For browsing and searching, the bars in the graph are grouped on the left (Figure 5(h)).We can deduce that embedding another document in the current HTML document through Computational Intelligence and Neuroscience the "iFrame" tag is no longer a common practice, as there are now better options.More than half of the Web pages (close to 60%) no longer use "style" tags in their source code.Both graphs in Figure 5(i) have bars that decrease towards the right.Te trend is to minimize the number of such tags, as it is more appropriate to use CSS fles.Tus, the source code of a Web page does not overly extend.Te webshot size is decisive in determining how much space our dataset will consume on a storage device.In Figure 5(j), the size fuctuates over a wide range of values, with an average of about 300 KB.Most of the values are concentrated in low sizes, but with a considerable presence of images with medium and high size.Tis behavior would require not only a large amount of space but a preprocessing of the images for machine learning and deep learning applications.Te searching and browsing graphs are quite similar for the webshot width (Figure 5(k)).In both cases, the frst bar, which shows the minimum, signifcantly predominates, as the default screenshot sets a width of 992 pixels.Te average value is very close to the minimum since almost all images were captured with this default value (about 85%), although there are also images with a wider width, especially in searching with a maximum of almost 10000 pixels.In the case of the height variable (Figure 5(l)), the minimum value also prevails, albeit to a lesser degree (about 28%), which coincides with the default value set by the screenshot, i.e., 744 pixels.Unlike the previous case, there is a less unbalanced distribution of values, with a wider variability in which the highest accumulation occurs up to 5000 pixels, a considerable accumulation between 5000 and 10000 pixels, and fnally, images with a height of up to almost 50000 pixels are obtained.Considering the width and height, most of the Web pages have a vertical layout.Tese parameters are closely related to the resolution or quality of the image.Te more pixels there are, the greater is the resolution and quality of the images, although it does demand more storage space.Finally, Table 3 summarizes the main statistical indicators for the quantitative parameters of the Web pages, for both the browsing and searching sets.Computational Intelligence and Neuroscience

Use Case 2: Automatic Recognition of Error Web Pages.
During our automatic data collection, some events blocked the download of HTML code or the webshot.Tese were caused by the following: a request for manual acceptance of cookies and SSL certifcates; error messages such as HTTP 403 Forbidden, HTTP 404 Not Found, HTTP 406 Not Acceptable, HTTP 909 Denied permission; and exceeding timeout.We used exception handling inside the scripts to avoid interruptions in the execution of the programs.When an error occurred, the felds associated with the parameters  12 Computational Intelligence and Neuroscience or webshot were assigned the value "−1."Tus, the programs could continue their execution, and the inexistence of webshots or attributes was solved.For the fnal dataset, we considered only URLs that had a respective webshot, as this is the most important element of our work.However, after a brief visual review of the webshots in the dataset, several error Web pages were detected, e.g., Web sites under construction, maintenance, domain ofer, suspended account, page not found, browser incompatibility, virus, or phishing risks.Some of these are shown in Figure 6.Tese webshots are not useful for the dataset, and so, we decided to remove them.Although the size of the fnal dataset would be smaller, we would obtain a cleaner dataset.Given that the URL connections corresponding to these webshots did not return HTTP 403 or HTTP 404 error messages, nor did the HTML code contain phrases such as "suspended account" or "page under construction," text analysis was not possible.We implemented an image analyzer to avoid manual and visual verifcation of thousands of webshots, which requires excessive time and efort.We used a convolutional neural network (CNN), the state-of-the-art tool in computer vision, to detect error Web pages and then separate them into a dedicated folder, all automatically.In this sense, Web pages that do not contain useful information are called "ERROR" Web pages, whereas Web pages that contain valuable information are called "VALID" Web pages [5].Here, we present an automatic detection of error Web pages based exclusively on their webshots.Tis consists of determining whether a Web page belongs to a "VALID" category or to an "ERROR" category, i.e., a binary classifcation problem.To address this, we followed the methodology shown in Figure 7, each phase of which is subsequently explained in detail.

Data Selection.
Te main resource for automatic learning is the data.In our case, the data are the images that will be the input for a training process that aims, iteratively, to obtain a known output ("valid" or "error"), and if an acceptable accuracy is reached, to make predictions.Te training process requires images associated with their respective category: valid or error.Since our dataset consists of two groups of images (browsing and searching), we selected the webshots of the smallest subset (browsing) to perform an exhaustive visual inspection and classify the images manually, obtaining the results shown in Table 4. Once the neural network model is adjusted, it classifes each webshot in the largest subset (searching) as valid or error.
Te dataset for training an error Web page detection model has 3609 images, 427 error webshots, and 3182 valid webshots, which were uploaded to Google Drive in separate folders: "VALID" and "ERROR" (Figure 8).
We used Google Colaboratory, a free platform that ofers powerful hardware and requires no installation or setup, supports Python through an online notebook and includes the packages and libraries to facilitate automatic learning such as Tensorfow, Keras, Sklearn, and others.Next, we describe the code developed (https://drive.google.com/fle/d/1uKRfFb_KtP2KABRCOef1X_Bb847BDi7d/view?usp= sharing).Te initial step is the connection to the data source where the images of our dataset have been stored within folders and subfolders named categories (Figure 8).Tis denomination facilitates the labeling of images with their corresponding category.Two instructions are needed to access Google Drive: from google.colabimport drive drive.mount("/content/drive")

Data Splitting: Training and Validation.
One of the tasks that characterize automatic learning is the division of data.Because we have only 3609 images, we consider two subsets: training and validation (Table 5).Te training subset  Computational Intelligence and Neuroscience contains the largest number of images (80%) and is used to learn and ft the model parameters, while the validation subset (20%) is used to evaluate the capacity of the model.Although the most appropriate is a balanced dataset, that is, an equal number of error cases and valid cases, we used all the images to obtain a better generalization.Automatically dividing into training and validation folders is useful to install and import the split-folders (https://pypi.org/project/split-folders/) package.It is necessary to specify the images directory, output directory, and the proportion to split (80% and 20%, respectively).splitfolders.ratio("/content/drive/My Drive/DATA-SET", output = "/content/drive/My Drive/SPLIT", seed = 1337, ratio = (0.8, 0.2), group_prefx = None) Te result is a new directory structure.Within the "SPLIT" folder, the "train" and "val" folders are created, and within each of these, the "ERROR" and "VALID" folders.

Data Preprocessing.
Te images must be prepared before modeling.First, we normalized the pixel values (integers between 0 and 255) to a scale between 0 and 1. Te ImageDataGenerator class of the Keras framework divides all the values of pixels by the maximum pixel value (255).train_datagen � ImageDataGenerator (rescale � 1./255) Second, the images have diferent dimensions (width and height), and so, they were all resized to 256 × 256 pixels by setting the target_size parameter of the fow_from_directory method.Tis operation was performed in groups of 32 images (batch_size) that are labeled for binary classifcation (class_mode) according to the folder where they are stored (valid and error) within the training directory.train_generator � train_datagen.fow_from_directory(train_data_dir, target_size � (256, 256), batch_size � 32, shufe � False, class_mode � "binary") In this way, the small values of both pixels and dimensions help speed up the training process.Te above code applies to the validation data, with only the directory changing.

CNN Model.
Te architecture of the model is based on the convolutional neural network proposed by Liu et al. to detect malicious Web sites [22].Since this is a similar problem, we only applied minor adaptations.Its structure (Figure 9) is composed of the following two parts:   form by a fully connected layer, which applies dropout to reduce overftting.Te sigmoid function generates the prediction as a probability value between 0 and 1.If the value is greater than 0.5, the Web page is valid, and, if not, it is an error Web page.
Te training and validation phases reached a high level of accuracy, both progressing to the same level, which is desirable.Te model fts very well with the images provided, but how it behaves with new images (generalization) is uncertain.Tis concern is addressed by analyzing the diference between training and validation losses.Te latter, despite oscillating, does not vary greatly from the other one until iteration 17, after which, they start to separate, with the possibility of overftting.Terefore, the model is saved with the accuracy and parameters of iteration number 16.We can say that the model is capable of acceptably distinguishing error Web pages and valid Web pages and thereby moves to the prediction phase.4.2.6.Prediction.Te images from the largest set (Searching) of our dataset become the input of the already trained and validated model.We used the google.colablibrary to select and upload the fle (webshot) from the local drive with a click on the "Choose Files" button.from google.colabimport fles uploaded � fles.upload() Once the fle is 100% uploaded, it is preprocessed using keras.preprocessingand NumPy libraries to transform the image into an array with a suitable shape and normalized pixel values for the model, which makes the prediction.img � image.load_img(path, target_size � (256, 256)) x � image.img_to_array(img)/255.x � np.expand_dims (x, axis � 0) 3@256×256 3@254×254 32@127×127 32@125×125 32@62×62 64@60×60 64@30×30  Te result, for images selected one at a time, is shown in Figure 11.Te resized webshot is displayed and the prediction is a probability value between 0 and 1, less than 0.5, so the category assigned is ERROR (Figure 11(a)), and a case of a valid Web page, with a probability value very close to 1 (Figure 11(b)).
In addition to making predictions one by one, more important for our purpose is to generate predictions for groups of images.To do so, we simply select a list of fles using the choose button.As our dataset is organized by topic category, we can select all the images in one category, e.g., "arts and entertainment."An extract of the results is shown in Figure 12.
Tis list of predictions is passed to a spreadsheet, and by means of a flter, the Web pages the error category are selected and saved as a text format fle (list.txt).Tis fle is the input to execute a command that moves all images from their original folder to the "ERROR" folder.Tis command line runs in Windows (PowerShell interface), although it is easily adaptable to diferent operating systems such as Linux.cat list.txt| ForEach {mv $_ ERROR} As a result of the prediction, 822 error Web pages and 7747 valid Web pages were found.Once the images were classifed and separated, the visual verifcation was much faster, and we were able to manually establish the successes and fails of the classifer.Tus, we identifed 1214 real error Web pages and 7355 real valid Web pages.Te same procedure was performed for the remaining images in the other categories.Te results are summarized in Table 6.
We used the Confusion Matrix to evaluate and determine the accuracy of our model.Tis table compares reality and prediction and, based on successes and failures, calculates an accuracy value.For example, for the "arts and entertainment" category, the classifer predicted 822 error Web pages, but it failed in 32 instances.Tere were 7747 predictions of valid Web pages, but 424 were incorrect.Tese values are placed in Table 7 and substituted into the formula for accuracy.
Te classifer reached an accuracy of 94.68% for this category, which is good considering the small number of images involved in the training process.Table 8 shows the respective confusion matrix for each category and a total matrix indicating an accuracy of 92.47% for the entire searching dataset.
After running the automatic error Web page detection, the debugging is complete.Te composition and fnal size of our dataset are shown in Table 9. Combining browsing and searching techniques, we managed to collect 49438 valid Web pages that occupy approx.17 GB.

Use Case 3: Web Categorization.
Here, we demonstrate the use of the presented dataset with a practical case related to Web categorization.Te classifcation of Web pages, also called Web categorization, determines whether a Web page or a Web site belongs to a particular category.For example, judging whether a page is about "arts," "business," or "sports" is an instance of subject classifcation [23].Instead of analyzing complex programming code, visual appearance is also an important part of a Web page, and many topics have a distinctive visual appearance.For example, Web design blogs have a highly designed visual appearance, whereas newspaper sites have a great deal of text and images [1].In this sense, we present an automatic categorization of Web pages according to topic or subject and based exclusively on their visual appearance.We leverage the dataset generated in this work, formed by webshots belonging to 6 categories: arts and entertainment, business and economy, education, government, news and media, and science and environment.Terefore, the problem becomes a multiclass categorization.We implemented a deep learning model with a convolutional neural network.In essence, this is a learning process with the webshots collected in order to achieve an acceptable accuracy 16 Computational Intelligence and Neuroscience and then make predictions.We hoped to capture features (difcult to identify manually) that can distinguish categories, predict to which of them a Web page would belong, analyze the difculty of the topic classifcation of the Web pages and verify whether there are particular patterns for each category.Te following results were selected from a series of several experiments in which diferent models and architectures were tested using the entire dataset and parts of it.Te code developed, as well as the weights of the adjusted deep learning model, are publicly accessible (https://osf.io/8zfh2).Te best results were obtained with the transfer learning technique and the images of the browsing dataset.Tis may be because in   Te images are stored within the directory structure shown in Figure 13.Within the main folder of the dataset the division in training and validation, and the subfolders represent the categories, which have the same names as the topics considered in this work.
After organizing the images, a preprocessing step is advisable to normalize the image's pixel values (integers between 0 and 255) to the scale of values between 0 and 1.It is also necessary to resize to the 224 × 224 pixels recommended for the model, because the images in the dataset have diferent dimensions (width and height).Both are common practices that help speed up the process of training.Several models were tested with a variety of options to achieve greater accuracy.Te fnal model exploits the transfer learning technique using ResNet [24], a competitive CNN pretrained on the ImageNet dataset (more than acceptably.Although we increased the data, tuned the hyperparameters, and applied regularization techniques such as dropout, neither is accuracy improved nor is overftting signifcantly reduced.For a better understanding of the results, Table 11 shows the confusion matrix with the validation data.
If we focus on the categories of arts and entertainment, government, and news and media, the model is correct in most cases, although the number of successes is low.Tis test takes validation data, a total of 444 images, achieving an accuracy of 38.29%, according to the confusion matrix.For the remaining categories, the model becomes signifcantly confused.Classifying these categories is a complex problem.Te composition of today's Web pages is becoming increasing complex and the content has a high variability of visual features, even within the same category.

Conclusion and Future Work
We created a large dataset on Web pages that combines diferent types of data: text, numbers, and images.We automated the workfow with scripts in Python and R to collect URLs and their respective webshots, while scraping allowed us to extract attributes from each Web page.Te methodology designed can be adapted to problems requiring the collection, organization, analysis, and publication of large amounts of data.In addition, we developed three Web intelligence applications using this dataset.First, the qualitative and quantitative attributes of the dataset allowed us to obtain useful information about the structure of the Web pages.Statistical analysis of these attributes showed a very heterogeneous distribution, high variability, and a tendency towards low values.Tis suggests that Web design follows an implicit rule of optimization, since the higher the values, the longer the page download and display time, leading to user discomfort.Second, we were able to automatically collect a total of 58174 webshots, although the fnal dataset was reduced to 49438 due to the elimination of error Web pages.We implemented an automatic detection of error Web pages based on a CNN model from scratch achieving acceptable accuracy.Tis approach could be a more efcient debugging process to address the signifcant presence of invalid pages on the Internet, which afects webmasters, search engines and users in general.Tird, Web categorization based exclusively on webshots using a multiclass CNN model proved to be a complex problem.Te difculty increases when the categories cover a wide range of topics and within each topic there is great variability in the visual appearance of Web pages.However, the remarkable accuracy of the model for the government category, as well as the arts and entertainment category, allows us to infer the existence of distinctive visual patterns, which can be a baseline for future research.Te results could be improved by increasing the dataset, preprocessing webshots to the same size and resolution, cropping, and scaling.Finally, our work may motivate the extension of the dataset with further categories, URLs, webshots, and other attributes; the exploration of alternative URL sources, i.e., a search engine other than Google and a Web directory other than BOTW; and the improvement of the accuracy achieved in multiclass Web categorization with deeper and more recent convolutional neural networks.

Figure 1 :
Figure 1: Methodology created to produce the Web page dataset.

( 4 )
Te links are stored in a text fle (https://osf.io/hjwgm), imported into a spreadsheet where a flter is applied to select only those links belonging to a country in particular 3.3.Attribute Collection by Scraping.After collecting and storing the URLs from the two previously described sources, we implemented a Python script (https://osf.io/de78f) to (a) sequentially read the URL links stored in the text fle (https:// osf.io/gk3p2);(b) make a connection through the browser to each of these links; and (c) download and analyze the source code of the Web page, obtaining the attributes specifed in the dataset: download time in seconds, total size in bytes, number of images, script fles, CSS fles, tables, iFrames tags, and style tags.Tis script includes the Web scraping library

Figure 2 :
Figure 2: A sample of the dataset, including one example of each category.

Figure 4 :Figure 5 Figure 5 Figure 5 :
Figure 4: Distribution of the qualitative attributes about Web pages: (a) category and (b) continent.

Figure 5 :
Figure 5: Distribution of the quantitative attributes about Web pages.

Figure 6 :
Figure 6: A sample of the error web pages.

Figure 7 :
Figure 7: Methodology for detecting error Web pages.

Figure 9 :
Figure 9: CNN architecture for error Web page detection.

Figure 10 :
Figure 10: Accuracy and loss in training and validation phases.

Figure 11 :
Figure 11: Prediction of error (a) and valid (b) Web page.

Figure 12 :
Figure 12: Prediction for a group of Web pages.

Table 1 :
Main characteristics of the datasets reviewed.
Computational Intelligence and Neuroscience academic work of the institution.Meanwhile, Nordhof et al.

Table 2 :
Structure of the dataset.

Table 3 :
Summary of statistical indicators for quantitative attributes.

Table 5 :
Separation of dataset for training and validation.

Table 6 :
Results of the binary Web categorization.

Table 7 :
Confusion matrix for the "arts and entertainment" category.

Table 8 :
Confusion matrix for the rest of categories and overall result.

Table 11 :
Confusion matrix for validation data.