Bidders Recommender for Public Procurement Auctions Using Machine Learning: Data Analysis, Algorithm, and Case Study with Tenders from Spain

. Recommending the identity of bidders in public procurement auctions (tenders) has a signiﬁcant impact in many areas of public procurement, but it has not yet been studied in depth. A bidders recommender would be a very beneﬁcial tool because a supplier (company) can search appropriate tenders and, vice versa, a public procurement agency can discover automatically unknown companies which are suitable for its tender. This paper develops a pioneering algorithm to recommend potential bidders using a machine learning method, particularly a random forest classiﬁer. The bidders recommender is described theoretically, so it can be implemented or adapted to any particular situation. It has been successfully validated with a case study: an actual Spanish tender dataset (free public information) which has 102,087 tenders from 2014 to 2020 and a company dataset (nonfree public information) which has 1,353,213 Spanish companies. Quantitative, graphical, and statistical descriptions of both datasets are presented. The results of the case study were satisfactory: the winning bidding company is within the recommended companies group, from 24% to 38% of the tenders, according to diﬀerent test conditions and scenarios.


Introduction
e largest adjudicators of a country, by number of projects and by cost, are public procurement agencies.For example, public authorities in the European Union spend around 14% of GDP (around €2 trillion) on public procurement [1] every year.e definition of public procurement is the purchase of goods, works, or services by a public agency.Public procurement is clearly important to politicians, citizens, researchers, and companies because of its size.On the other hand, the European open data market size (products and services enabled by open data) was €184.45 billion in 2019, according to the official European Data Portal [2].High growth is expected in the near future.e availability of open data in public procurement announcements (also known as tenders) enables the building of a bidders recommender.e bidders recommender may be a strategic tool for improving the efficiency and competitiveness of organisations and is particularly suitable for the two main stakeholders: suppliers and public procurement agencies.On the one hand, it is useful to the supplier because it assists in identifying the most suited tenders, i.e., those that they should prioritise.On the other hand, the contracting agency could automatically search companies with a compatible profile for the tender's announcement, e.g., selective tendering where suppliers are only allowed by invitation.us, it could be called a "bidders search engine" or a "bidders recommender." Many public agencies do not easily obtain competitive offers when they publish public procurement announcements.It is a serious problem with negative consequences for the project in terms of cost, quality, lifetime, sustainability, etc.A bidders recommender would produce significant benefits as follows: (i) Tenders with more bidders have lower award prices and, consequently, the public agencies will reduce costs.is relationship is quantitatively demonstrated for Spanish tenders in this paper, but there are more empirical studies, e.g., in Italy [3] and the Czech Republic [4,5].(ii) is new tool will provide support to small-and medium-sized enterprises (SMEs), which play a crucial role in most economies.It will make it easier and more efficient for SMEs to access procurement auctions, promote inclusive growth, and support principles such as equal treatment, open access, and effective competition [6].(iii) In scenarios of high participation, it is more difficult to generate corruption or collusive tendering (where the bidders do not compete honestly).
e main objective of this paper is to propose an algorithm to search for suppliers (companies) to invite to tender.Discovering the number and identity of bidders is challenging, since there does not exist a suitable quantitative model to forecast the identities of a single or a group of specific key competitors likely to submit a future tender [7].So, the input parameters of the bidders recommender have to have the tender's announcement but also be a generic algorithm that can be implemented or adapted to any particular situation.e main issue is to get information about bidders and the rest of the companies in the market because in many countries, the information is not public or free.
Some papers have proposed similar tools, but only the tenders are characterised or analysed, not the bidders, e.g., a product search service [8] or a similar tenders engine search (comparison of one tender to all other tenders according to specific criteria) [9].Our work is based on the profile of the winning companies rather than the characteristics of the tender.us, this paper is a novel study which brings a new and modern perspective to gathering tenders and bidders.
e bidders recommender has used tenders that have been published in Spain.In particular, the tender dataset has 102,087 Spanish tenders from 2014 to 2020.All types of works are included, not only construction auctions (which are the favourite subjects in the public procurement literature, for several reasons).
e company dataset has 1,353,213 Spanish companies to search suitable bidders.In [10,11], the Spanish public procurement system as well as the European and national legislation is described, and they have also analysed Spanish tenders for other purposes.
e application of this pioneering bidders recommender by public procurement agencies or potential bidders is summarised in Figure 1.It has three sequential steps or phases, and the input is obviously a new public procurement announcement, also known as a tender notice.Initially, it is based on forecasting the winning company of the tender thanks to a machine learning method called a random forest classifier model.
is classification model has previously been trained with lots of tenders and their respective winning companies.e second phase is to add the business information of the forecast winning company for creating a profile of a winning company.e business information is in the company dataset (data from the Business Register).Finally, similar or compatible companies are searched, according to their profile, where the search criteria are filters or fixed rules.e paper is structured as follows.Section 2 summarises the literature review associated with the bidders recommender in public procurement auctions.Section 3 presents the fields of the dataset and the machine learning algorithm (called random forest classifier) which will be used in the recommender.Furthermore, the bidders recommender is explained in detail (Section 3.5) and some evaluation metrics are defined to measure the accuracy of detecting the winning company of the tender within the group of bidders.Section 4 quantitatively describes the datasets for the real case study from Spain to test the bidders recommender.It is tested under different scenarios, and the results are presented in Section 4.3.In Section 5, the recommender is discussed from a general perspective to be applied to other countries or datasets.Finally, some concluding remarks, limitations, and avenues for future research are presented in Section 6.

Literature Review
is paper involves (either directly or indirectly) diverse topics such as open government data, public procurement and its regulation, machine learning, tender evaluation, prediction techniques, business registers, and so on.e bidders recommender has a multidisciplinary nature which fills a gap in the literature.Nevertheless, the key components have an extensive literature which will be summarised in the following paragraphs.
In this article, we used open data and, especially, Open Government Data (OGD).e OGD initiatives have grown very strongly in the academic field [12][13][14].at is to say, open data are produced by governmental entities in order to promote government transparency and accountability.Hence, there are different stakeholders, user groups, and perspectives [15,16].e OGD is a part of the public value of e-government [17], and it is a new and important resource with economic value [18,19].For example, data.europe.auand data.govare online portals that provide open access datasets in a machine-readable format [20] and are generated by the European Union and the United States of America public agencies, respectively.However, there are challenges and risks in dealing with the data quality of open datasets (quality over quantity) [21] and this article suffers from these too.It is very important to measure the transparency and the metadata quality in the open government data portals [22][23][24].
Other public procurement fields that have recently sparked the interest of governments, policy makers, and researchers are Big and Open Linked Data (BOLD) [25], the growing awareness of public procurement as an innovation policy tool [26], and the role of e-government in sustainable public procurement [27].
2 Complexity is article uses a machine learning algorithm.e big data and machine learning technologies can be used for econometrics [28,29], enterprises [30], tender evaluation [31], or analysis of public procurement notices [32].
erefore, this paper follows the trends in the literature.ere is extensive literature about tender evaluation (also called bidding selection methods) for the selection of the optimal supplier in public procurement [33] with different techniques such as the economic scoring formulas [34], data envelopment analysis [35] or multicriteria decision making [36,37], and where multiple bidders are evaluated on the basis of price and quality [38].In particular, the most studied public procurement auctions are related to construction, i.e., distribution of bids [39], bidding competitiveness and position performance [40], strategic bidding [41], tender evaluation and contractor selection [42,43], and empirical analysis in countries such as Slovakia [44].ere are almost no studies which include all kinds of business sectors and a large volume of tenders.However, this article has a holistic approach due to the large tender dataset of all sectors.
Another relevant subject in the public procurement literature is the detection of collusive tendering or bid rigging [45] with case studies in Spain [46], India [47], and Hungary [48].
is occurs when businesses that would otherwise be expected to compete secretly conspire to raise prices or lower the quality of goods or services for purchasers in a public procurement auction (this is called a cartel).In addition, public procurement contracts have other issues such as optimal quality [49], too many regulations [50], systemic risk [51], or corruption [52][53][54].Corruption is a form of dishonesty undertaken by a person or organisation with the authority to acquire illicit benefit.ere are empirical studies to detect corruption by analysing public tenders in many countries, for example, in China [55], Russia [56], the Czech Republic [57], and Hungary [58].e application of algorithms by governments or enterprises to detect collusion or corruption [59], especially using machine learning methods [60][61][62], has become an almost inevitable topic and the subject of numerous studies.Indirectly, this article could create a useful tool for these topics since it is able to forecast the most probable winning bidders and, therefore, the detection of unlikely winners too.
Forecasting and prediction techniques are widely studied and applied in the academic field of public procurement auctions.In [63], the mathematical relationship between scoring parameters in tendering is studied because, among other reasons, it is useful for the bid tender forecasting model [64].ere are some notable key parameters which have been analysed in the forecasting literature, especially for construction auctions, from traditional techniques to new machine learning methods, for example, the probability of bidder participation [7], an award price estimator [10,65,66], or cost estimator [67,68].However, as far as we know, this article is the first attempt to forecast the winning company for all tenders in a country.
In conclusion, this paper creates a smart search engine to recommend a group of companies for each tender, according to the forecast winning company. is means they have a similar business, technical, and economic profiles.erefore, it is necessary to find these profiles in the Business Registers [69,70] or other databases where the company's annual accounts are available.For instance, it is even possible to forecast (2) Aggregation phase: add company's information (location, employees, classification of activities, EBITDA, etc.) for the forecast winning company (3) Searching phase: search in the company's dataset similar companies to the forecast winning company (1) Forecasting phase: forecast the winning company using the classification model (random forest) previously trained.

Bidders recommender application
Company's information dataset Complexity the corporate distress using machine learning in such reports [71].e analysis of a company's profile has the same basis as the academic topic called bankruptcy prediction. is is the measurement of corporate solvency and the creation of prediction models [72] to forecast the company failure or distress.It has been intensively discussed over the past decades [73], using traditional statistical techniques [74][75][76] or machine learning methods, such as gradient boosting [72], neural networks [77], support vector machine [78], or the comparison of different methods [79,80].

Materials and Methods
is section describes the necessary components to create the bidders recommender proposed in this article.It is described theoretically so that it can be implemented in any country, not only in Spain.Section 3.1 presents the origin of the tender dataset and describes its fields, and, analogously, the company dataset is presented in Section 3.2.Section 3.3 explains the random forest classifier which is used in the first phase of the bidders recommender method.In Section 3.4, the evaluation metrics are defined to measure the recommender's accuracy.Finally, the bidders recommender algorithm is described in detail in Section 3.5.

Tender Dataset.
e European and Spanish legislation on public procurement and on the reuse of public information is extensively detailed in [11].e official website of the Public Sector Contracting Platform (P.S.C.P.) of Spain publishes the public procurement notices and their resolutions of all contracting agencies belonging to the Spanish Public Sector.e P.S.C.P. has an open data section for the reuse of this information which will be used in this article to generate the tender dataset.e information is provided by the Ministry of Finance (the link is given in the Data Availability section) and has been published as open data since 2012.e fields, their descriptions, and the process to obtain the dataset are the same as discussed in [10].However, these fields are shown in Table 1 for the convenience of the reader.A remarkable limitation is that only the identity of the winning company is known, not the rest of the bidders, and this will be a constraint for the recommendation system.

Company Dataset.
In general, to obtain business information (companies' annual accounts) over several years is not easy or free.In Europe, Business Registers offer a range of services, which may vary from one country to another.However, the core services provided by all registers are to examine and store company information and to make this information available to the public [69].European Regulation 2015/884 [81] interconnects the Business Registers of the EU countries.e European Business Registry Association [82] has a list of Business Registers from around the world, for more information.
e authors have collected a dataset of annual accounts from Spanish companies, based on the information available in the Spanish Business Register.It is a public institution, but access is not free of charge.It is the main legal instrument for recording business activity: the company documents and submission of the annual accounts.e companies become a legal entity through their registration on the Business Register.
e fields of the company dataset are explained in Table 2. ey can be divided into 5 headings: general information, human resources, location, accounting measures (operating income, EBIT, and EBITDA), and different systems for classifying industries or economic activities (CNAE, NACE2, IAE, US SIC, and NAICS).It should be noted that the company's annual accounts have more fields, but the authors have not been able to access and collect them.e fields of this dataset try to characterise the company from different points of view: main business activities (CNAE, NACE2, IAE, US SIC, and NAICS), nearby market (location), work capacity (employees), size (operating income), financial performance (EBITDA), etc.Not all of the fields have been used because they are not relevant to the analysis in this paper.

Random Forest Classifier.
Random forest (RF), introduced by Breiman [83] in 2001, is an ensemble learning method for classification or regression that operates by constructing a multitude of decision trees at training times and outputting the class, which is the mode of the classes (classification) or mean prediction (regression) of the individual trees.It is a popular learning algorithm that offers excellent performance [84], no overfitting [85,86], a versatility of applicability to large-scale problems and in handling different types of data [85,87].Particularly, Random Forest has been applied with remarkable success in tender datasets, for example in [10].It provides its own internal generalisation error estimate, called the out-ofbag (OOB) error.Simplified algorithm of RF for classification [88] is summarized in Algorithm 1.
At each split in each tree, the improvement in the split criterion is the measure of the importance attributed to the splitting variable and is accumulated over all the trees in the forest separately for each variable.is is called "variable importance" [83].

Evaluation Metrics.
It is necessary to define some error metrics to compare similar variables of the datasets and calculate the prediction error of the bidders recommender.
e use of metrics based on medians and relative percentage is useful because the dataset has outliers of great weight, and the use of such metrics helps us to counteract the effect of these outliers.To compare variables of the dataset, the median absolute percentage error (MdAPE) was used, as defined in the following equation: where A t is the actual value for period t, F t is the expected value for period t, and n is the number of periods.e following error metrics are to measure the prediction error of the RF classifier method for multiclass classification on imbalanced datasets [89].Multiclass 4 Complexity classification occurs when the input is classified into one, and only one, nonoverlapping class.An imbalanced dataset occurs when there is a disproportionate ratio of observations in each class.
Let  y i be the predicted value of the i − th sample (1 ≤ i ≤ n), y i be the corresponding true value, ϖ i be the corresponding sample weight, and L be the set of classes (1 ≤ l ≤ L).Accuracy (2) is the proportion of correct predictions over n samples: where 1( y i ) is the indicator function.e equation returns a 1 if the classes match and 0 otherwise.

Balanced accuracy (3) avoids inflated performance estimates on imbalanced datasets
where 1( y i ) is the indicator function and Let y l be the subset of true values with class l. e precision (average macro) is calculated as follows: Finally, the out-of-bag (OOB) is a method of measuring the prediction error in RF and other machine learning ).e OOB error is the average error for each z i calculated using predictions from the trees that do not contain z i in their respective bootstrap sample.is allows the RF classifier to be fitted and validated while being trained [88].

Creation of the Bidders Recommender Algorithm.
e flowchart for the creation of the bidders recommender is summarised in Figure 2. e two data sources and the steps for its development are illustrated.It is important to note that the application of the bidders recommender is one thing (see Figure 1), but its creation and setting is another.e steps are quite similar, but they are not the same.
e creation of the bidders recommender has the following four sequential steps.It is based on initially training the classification model, then forecasting the winning company, and aggregating its business information.Finally, it requires searching for similar companies, according to the profile where the search criteria are filters or fixed rules.
(1) Training and Forecasting Phase.Train the classification model (random forest classifier) over the tender dataset.Typically, 80% of the data is for the training subset and 20% is for the testing subset.en, forecast the winning company for each tender of the testing subset by applying the previous classification model.e following input and output variables (described in Table 1) have been used by the random forest classifier: (1) Input variables: Procedure_code, Subtype_code, Name_Organisation, Date, CCAA, Province, Municipality, Latitude, Longitude, Tender_Price, CPV, and Duration.(2) Output variables (forecast): N winning companies (variable called CIF_Winner) for each tender.Typically, N � 1 but it is also possible to predict the N most probable companies to win the tender.
At this point, the accuracy n�N of the testing subset can be calculated.It will be the minimum accuracy of the bidders recommender because these N forecast winning companies will be inserted into the recommended companies group.
(2) Aggregation Phase.Add the business fields from the company dataset (described in Table 2) to the forecast winning company estimated in the previous step.e business fields are (3) Searching Phase.In the company dataset, search for similar companies to the forecast winning company.Hence, it will create a recommended companies group for each tender.e search criteria (filters) are a basic mechanism to modulate the number of recommended companies, and they are described below.Each filter has a constant factor (numeric value from 0 to infinite) to increase or decrease the size of the search.e minimum annual value for Operating_Income forecast co., EBIT forecast co., and EBITDA forecast co. for the last available 5 years were selected.For searching companies, the Operating_Income co., EBIT co., and EBITDA co. of the tender's year date were selected.
(4) Evaluation Phase.Check if the real winner company is within the recommended companies group created for each tender (phase 3).
is evaluation metric is called accuracy n�M .Logically, accuracy n�M ≥ accuracy n�N because the N forecast winning companies (phase 1) are automatically within the recommended companies group.Furthermore, the mean and median number of the recommended companies of each tender is calculated.Large groups are more likely to contain the real winner company but, obviously, the smart search engine is less useful because it recommends too many companies.erefore, the bidders recommender selects winning companies from the tender dataset but also incorporates new companies available in the market (company dataset) that have a similar profile to the forecast winning company.Creating this profile to search similar companies is a very complex issue, which has been simplified.For this reason, the searching phase (3) has basic filters or rules.Moreover, it is possible to modify or add other filters according to the available company dataset used in the aggregation phase.
e fields available in the company dataset (filters) will strongly depend on the country.In our case study, the filters are the following: (i) Economic resources to finance the project: Operating Income co., EBIT co., and EBITDA co. .(ii) Human resources to do the work: Employees co. .(iii) Kind of specialised work which the company can do: NACE2, IAE, SIC, and NAICS.(iv) Geographical distance between the company's location and the tender's location: Distance tender−co. .It will be shown that it is a fundamental parameter.Intuitively, the proximity has business benefits such as lower costs.8 Complexity

Application of the Bidders Recommender.
e application of the bidders recommender (see Figure 1) by public agencies or potential bidders for a new tender was summarised in Section 1.It has three phases, which is very similar to its creation.e first phase (forecasting) is to predict the most probable company to win the tender using the model, already trained by the random forest classifier.
e second phase (aggregation) is exactly the same: add the business fields from the company to the forecast winning company.Finally, the third phase (searching) is simply applying the filters (numeric factors) that were previously fixed in the creation, in order to search the recommended companies.

Experimental Analysis
A real case study from Spain is presented to evaluate the bidders recommender.Section 4.1 summarises the preprocessing of the two data sources: tender dataset and company dataset.Section 4.2 provides a quantitative description of both datasets and their relationship such as the correlation.In Section 4.3, the bidders recommender is applied under two different scenarios with five different settings in each one.Finally, the results are presented and analysed for these ten different tests.

Data Preprocessing.
Data preprocessing of the tender dataset is necessary due to the fact that information has not been verified automatically to correct human errors, such as incorrect formatting, wrong values, empty fields, and so on.Data preprocessing can be divided into the following 5 consecutive tasks: extraction, reduction, cleaning, transformation, and filtering.ey are described in detail in [10] because the data source and the data preprocessing are the same in both articles.At first, there were 612,090 tenders.After data preprocessing, there were 110,987 tenders.
Data preprocessing of the company dataset is a simple task since the data source is already a database.erefore, it is not necessary to verify or check the data.e company dataset has 1,353,213 Spanish companies listed.
Finally, the tender dataset has been merged with the company dataset.is relationship is possible thanks to the CIF field (ID company number) which both datasets have.e merged dataset has 102,087 tenders and their respective winner companies.About 8,900 tenders have been lost because the winning company's CIF has not been found for some reason.
e possible reasons include foreign company, wrong CIF value, winning company's CIF not stored in the database, etc.

Statistical Analysis of the Datasets.
Firstly, the most relevant information of the tender dataset will be explained, quantitatively.Secondly, the company dataset will also be explained, and, finally, the correlations between both datasets will be analysed.
Table 3 shows the quantitative description of the tender dataset: total numbers, means, medians, maximum, percentages, etc. e dataset has 19 fields or variables: 15 announcement fields and 4 award fields.
ere are 102,087 tenders from 2014 to 2020 spread across Spain, and any CPV code is possible.erefore, there are a wide number of heterogeneous tenders which will be used in the bidders recommender.
Looking at Table 3, the following issues are observed: (i) ere are a lot of winning companies and tendering organisations.On average, each public procurement agency creates 17.72 tenders and each company wins 4.80 tenders.(ii) ere is a great dispersion of prices (for both Tender_Price and Award_Price) considering the median, the mean, and the maximum.Furthermore, there is a remarkable difference between Tender_-Price and Award_Price, looking at the differences between their medians (€12,535.large number of tenders with only one bidder could be a sign of anomaly (collusion, corruption, economical disorder, or others).However, according to the European public reports [90], this ratio is similar to other countries, like, for example, Poland (37.5%),Romania (34%), or Czech Republic (26.6%).
Table 4 shows the quantitative description of the company dataset.ere are 1,353,213 companies, and 61.44% of them are active.e dataset has 23 fields (see the description in Table 2): general information of the company, location, employees, 3 economic indicators (operating income, EBIT, and EBITDA), and 5 systems of classification of economic activities (CNAE, NACE2, IAE, SIC, and NAICS).
Looking at Table 4, the following issues are discussed: (1) e Spanish companies have a small size for 3 reasons.First of all, 91.58% are limited companies (private companies limited by shares).Secondly, the mean number of employees is 11.51 employees per company.irdly, in the year 2018, the median operating income was only €299,130, the median EBIT was only €10,472.40, and the median EBITDA was only €18,733.35.
(2) e highest number of economic fields (operating income, EBIT, and EBITDA) were recorded in the year 2016 (about 700,000 companies), followed by 2015 and then 2017.(3) e 5 Provinces with greater weight add up to 45.38% of the total number of companies.So, the companies are concentrated in certain locations.

Complexity
Figure 3 shows the frequency histogram of the number of tenders won by the same company.e reader must not confuse this histogram with the number of tenders by received offers (bidders) which is described in Table 3. e most frequent number of tenders won by the same company is 1. is means that about 10,000 companies have won only 1 tender.It is more or less 47% of the total number of winning companies.About 3,800 companies (18%) have won 2 tenders and so on (the trend is decreasing).erefore, only 53% of companies have won 2 or more tenders.is distribution is important for the bidders recommender.It is more difficult to forecast the winning company successfully if a lot of companies have won only 1 tender because there are no patterns, trends, or relationships between tenders.
Figure 4 shows the relationship between the received offers of bidders for each tender and the underbid (also called discount).Actually, the underbid is the evaluation metric called MdAPE (median absolute percentage error) between  Complexity 11 the tender price and the award price, which is explained in Section 3.4.e trend is clear: the underbid increases until stabilising at around 35%.Hence, we have quantitatively demonstrated how the tenders with more bidders have lower award prices.In other words, the award price is lower in a tender with more competitiveness and the public procurement agencies will save money.So, the objective of the agencies should be to encourage the participation of companies to receive more offers.For this reason, the bidders recommender is a very useful tool for these agencies because they can effectively increase the number of participants in each tender.
To obtain new, relevant information through the variables in the merged dataset (the tender variables plus company variables), the Spearman correlation method was used.Figure 5 shows the Spearman correlation matrix (a symmetric matrix with respect to the diagonal).It is mathematically described in [10], and it is also used for the same purpose.
Looking at Figure 5, the most important correlations are the following: (1) Tender_Price vs. Award_Price (0.97): this high correlation is in accordance with common sense since high bids are associated with high awards and low bids with low awards.(2) Type_code vs. Subtype_code (0.77): each type of contract has its associated subtypes of contract.(3) City_Tender vs. Province_Tender (0.43): the public procurement agency is in a city which belongs to a Province.So, the relationship city-province is always the same.(4) Underbid vs. Received_Offers (0.54): the underbid (or discount) is the absolute percentage error (APE %) between Tender_Price and Award_Price.When the public procurement agency receives more offers from bidding companies, the underbid is bigger.is important correlation will be explained in detail in the following section.(5) CPV vs. Duration (0.33): each type of work is usually associated with a temporal range (duration) for its realisation.(6) CPV vs. CPV_Aggregated (0.99) has an obvious correlation: CPV_Aggregated is the first 2 digits of the CPV number.(7) Latitude_Tender vs. Latitude_Company (0.57) and Longitude_Tender vs. Longitude_Company (0.55): this means that both locations (tender and company) are close and therefore the distance tender-company will be an input parameter for the bidders recommender.(8) Employees, Operating_Income_LAY_-0, EBI-T_LAY_-0, and EBITDA_LAY_-0 are strongly correlated with each other.Big companies have a lot of employees, and these companies can earn more profits.

Bidders Recommender Validation.
ere are two related validations: firstly, to validate the classification model (random forest) applied in phase 1 (train and forecast) of the bidders recommender and secondly, and more importantly, the validation of the bidders recommender results which is phase 4 (evaluation).is checks if the real winner company is within the recommended companies group.
For validating the classification model, Figure 6 shows three different ratios between the training and testing subsets (train : test in percentage) randomly chosen: 90 : 10, 80 : 20, and 70 : 30.Furthermore, it shows the behaviour of the error metrics (accuracy, precision, balanced accuracy, and OOB) for a different number of trees generated in the random  12 Complexity forest classifier.e accuracy n�1 is the most important error for this study, and, in each graph, it is constantly of the order of 18%, 17%, and 15%, respectively.Logically, when decreasing the training data percentage, the accuracy is lower.
Hence, the number of trees is not relevant and the election of the ratio also has a minimal impact.RandomForestClassifier from Scikit-learn, which is a machine learning library for the Python programming language, has 75 trees and a ratio of 80 : 20 and is the function used in this article.
Validation of the bidders recommender results was tested over two scenarios with five different setups.In the first scenario, the testing subset is 20% and is chosen randomly.In the second scenario, the dataset is ordered by tender date and the testing subset is the latest 20%, i.e., the most recent tenders.So, the second scenario is more appropriate to test a real engine search.Each scenario has the same five setups (filter settings), from very low (restrictive) filters to very high.
e filters are described in detail in Section 3.5.Basically, there are six factors (F OI , F EBIT , F EBITDA , F E , F CEA , and F D ), and it is necessary to assign numeric values.Hence, there are 10 combinations to test the bidders recommender.
e validation of the bidders recommender is shown in Table 5. e evaluation metric to measure the success of the recommender is the accuracy: the percentage of tenders where the winning company is within the recommended companies group.For scenario 1, when N � 1 (it is predicted that the most probable company will win the tender), the accuracy is 17.07%.When N � 5 (the 5 most probable companies to win the tender), the accuracy rises to 31.58%.Finally, the bidders recommender searches a group of compatible companies, automatically including the previous 5 companies, for each tender.e range of the accuracy is from 33.25% to 38.52% according to the settings applied.e Complexity reason to the increasing accuracy is simple: there are more recommended companies.Consequently, the mean (and median) number of recommended companies is higher.Analogously for scenario 2, Accuracy n�1 � 10.25%, Accuracy n�5 � 23.12%, and Accuracy n�M � [24.79% − 30.52%]. is accuracy is significantly lower than that in scenario 1, and it could be for multiple reasons.For example, recent tenders have less business information because the annual accounts of the winner company are published the following year.In particular, the company dataset does not have information about operating income, EBIT, and EBITDA in 2019 and 2020 (see Table 4).However, there are a lot of tenders in 2019 and 2020 (see Table 3).
One area of interesting analysis is the size of the companies group generated by the bidders recommender.is recommender will be more efficient if the group is small and  14 Complexity the accuracy is high.Figure 7 shows the boxplots, disaggregated by CPV, for scenarios 1 and 2 (medium setup).CPV is the system for classifying the type of work in public contracts.e total mean is very similar in both scenarios: 430.48 potential bidders (median is 31) and 430.33 potential bidders (median is 33), respectively.e median value, disaggregated by CPV, is usually below 50 companies.However, the mean value of each CPV has great variability.

Discussion
e main objective is to find out and recommend companies for a new tender announcement.However, it is not easy to measure the performance of the bidders recommender; each company is unique and different from the rest, so the searching, comparison, and recommendation of companies is relative (subjective evaluation).Accuracy has been selected as the evaluation metric to measure the performance: the percentage of tenders where the winning company is within the recommended companies group.
Table 5 shows the results of the bidders recommender: the accuracy, mean, and median number of recommended companies over two scenarios with five different set ups (very low, low, medium, high, and very high).e main determining factor to get a good performance is due to the top 5 forecast companies (called Accuracy n�5 ). is means that the 5 most probable companies to win a tender can be incorporated to the recommender companies group (called Accuracy n�M ).For scenario Figure 7 shows the boxplots for the size of the recommended companies group, disaggregated by the type of tender's work (CPV).ere are considerable differences in the size, mean, and median values for each CPV.Other interesting analyses would be to disaggregate by geographic regions, business sectors, or markets.
As seen in this article, the bidders recommender depends strongly on the fields of public procurement announcements and the information available to characterise the bidders.erefore, the recommender cannot be the same for each country since their public procurement systems are not unified or standardised for several reasons: regulations, laws, diverse information systems, different tender criteria, distinct levels of technological maturity in public administration, etc.However, this paper establishes the basis to create a bidders recommender which can be adapted to each country according to the two basic data sources: tender information and company information.is is because the recommender is an open frame which can easily add or modify other  16 Complexity available fields or data sources.e selection and optimisation of the recommender's parameters can significantly improve it.It is a laborious task and particular to each country.In summary, the recommender is an effective tool for society because it enables and increases the bidders participation in tenders with less effort and resources.Furthermore, this will serve to modernise the public procurement systems with a new approach based on machine learning methods and data analysis.us, the beneficiaries are the government, the citizens, and the two main users: (1) Public Contracting Agencies.When they publish a tender notice, the algorithm automatically recommends suppliers which have a suitable profile for the tender.e agencies could contact these suppliers directly and invite them to participate if they are really interested in the tender.(2) Potential Bidders.ey will be able to search suitable tenders effortlessly, according to the type of tender and the profile of previous winning companies.

Conclusions and Future Research
e public procurement systems of many countries continue to use the inefficient mechanisms and tools of the 20th century for the publication of tenders and the attraction the offers and bidders.However, more and more new technologies (open data, big data, machine learning, etc.) are emerging in the public administration sector to improve their systems, proceedings, and services.is article clearly demonstrates how it is possible to create new tools using these technologies.
Especially, this paper develops a pioneering algorithm to recommend potential bidders.It is a multidisciplinary system which fills a gap in the literature.e bidders recommender proposed here is a promising and strategic instrument for improving the efficiency of public procurement agencies and should also facilitate access to the tenders for the suppliers.e recommender brings a trendy new perspective to gathering tenders and bidders.e bidders recommender is described theoretically and also validated experimentally, using a case study from Spain.Two datasets have been used: tender dataset (102,087 Spanish tenders from 2014 to 2020) and company dataset (1,353,213 Spanish companies).
e company dataset is difficult to collect because it is nonfree public information in Spain, so it is a valuable dataset.Quantitative, graphical, and statistical descriptions of both datasets have been presented.
e results of the case study have been successful because of the accuracy; it means that the winning bidding company is within the recommended companies group (from 24% to 38% of the tenders).e accuracy range is due to the two test scenarios (either being chosen from the most recent tenders or chosen at random), and each scenario has five different settings for the bidders recommender.Hence, the recommender has been validated for over 10 combinations of testing and the results are quite successful and promising, opening the research up to other countries and datasets.e main limitation of this research is inherent to the design of the recommender's algorithm because it necessarily assumes that winning companies will behave as they behaved in the past.Companies and the market are living entities which are continuously changing.On the other hand, only the identity of the winning company is known in the Spanish tender dataset, not the rest of the bidders.Moreover, the fields of the company's dataset are very limited.erefore, there is little knowledge about the profile of other companies which applied for the tender.Maybe in other countries the rest of the bidders are known.It would be easy to adapt the bidder recommender to this more favourable situation.
is paper opens the door to future research for creating bidder recommendation systems.In particular, for this recommender, some research can be done to improve it, as follows: (i) e training and forecasting phase of the algorithm (step 1) to predict the winning company is based on the random forest classifier.Alternative methods of machine learning can be studied to increase the accuracy.(ii) e aggregation phase (step 2) can use other fields of business information to create the profile of the winning company for the tender.(iii) e searching phase (step 3) implements basic rules or filters to search similar companies.It would be interesting to explore more sophisticated methods, for example: clustering to group similar companies.(iv) ere is no ranking of recommended companies.
is means that the algorithm only recommends companies without any associated probabilities, so the user cannot choose the companies that are most likely to be recommended to win the tender.is can be solved by applying a voting system or some kind of distance in the searching phase (step 3) of the algorithm.

Data Availability
e processed data used to support the findings of this study are available from the corresponding author upon request.
e raw data from Spain are available at the Ministry of Finance, Spain (open data of Spanish tenders are hosted in http://www.hacienda.gob.es/es-ES/GobiernoAbierto/Datos%20Abiertos/Paginas/licitaciones_plataforma_contratacion.aspx).

Figure 1 :
Figure 1: Flowchart of the application of the bidders recommender for a new tender.

( a )
OperatingIncome co.≥F OI •OperatingIncome forecastco. .(b) EBIT co.≥ F EBIT •EBIT forecast co. .(c) EBITDA co.≥ F EBITDA •EBITDA forecast co. .(d) Employees co.≥ F E •Employees forecast co. .(e)  C i�1 1[ Code { } co.� Code { } forecast co.] ≥ F CEA • C where 1[Code]is the indicator function (returns 1 if the codes match and 0 otherwise), C is the total number of codes of the forecast company, and Code { } is the identification number of the different systems of classifications of economic activities registered by the forecast company: Code � NACE2, IAE, SIC and NAICS { }.

( 1 ) 1 . 1 .ALGORITHM 1 :
For b � 1 to B (number of trees): (a) Draw a bootstrap sample Z * of size N from the training data.(b) Grow a random forest tree T b to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size n min is reached.(i) Select m variables at random from the p variables.(ii) Pick the best variable/split point among the m.(iii) Split the node into two daughter nodes.(2) Output the ensemble of trees T b   B To make a prediction at a new point x, let  C b (x) be the class prediction of the b − th random forest tree.en,  C B rf (x) � majority vote  C b (x)   B Simplified algorithm of random forest for classification.Complexity (f ) Distance tender −co.≤ F D •Distance tender−forecast co. .erefore, it is necessary to set up the bidders recommender by assigning numeric values to the previous six factors: F OI , F EBIT , F EBITDA , F E , F CEA , and F D .e three economic filters (operating income, EBIT, and EBITDA) are annual values.

Figure 2 :
Figure 2: Flowchart of the creation of the bidders recommender.

Figure 3 :
Figure3: Histogram of frequency (number of companies) based on the total number of tenders in the dataset won by the same company (bidder).e graph is divided into two for better visualisation.

Figure 4 :
Figure 4: Relation between the received offers of bidders and the underbid (median absolute percentage error between tender price and award price).

Figure 5 :
Figure5: Correlation matrix between the variables of the two datasets (tenders and companies).Spearman's rank correlation coefficient is the method applied.

Figure 7 :
Figure 7: Boxplots for the size of the recommended companies group generated by the bidders recommender, disaggregated by CPV.Scenario 1 (blue colour) and scenario 2 (brown colour) both have a medium setup.

Table 1 :
Most relevant data fields in the Spanish Public Procurement Notices (tenders) used in the dataset.Status of the tender during the development of the procedure: prior notice, in time, pending adjudication, awarded, resolved, or cancelled Not used (similar to Result_code) Public procurement agency that made the tender: name, identifier (NIF or DIR3), website, address, postal code, city, country, contact name, telephone, fax, e-mail, etc. CCAA is the Autonomous Community which is a first-level division in Spain.Latitude and longitude have been calculated from postal code, and they are not official fields in the notice.Common Procurement Vocabulary) is a European system for classifying the type of work in public contracts defined in the Commission Regulation (EC) No 213/2008: http://data.europa.eu/eli/reg/2008/213/oj CPV e numerical code consists of 8 digits, subdivided into divisions (first 2 digits of the code), groups (first 3 digits), classes (first 4 digits), and categories (first 5 digits) CPV_Aggregated (first 2 digits of the CPV number) Contract type Type of contract defined by legislation (Law 9/2017): works, services, supplies, public works concession, works concession, public services management, services concession, public sector and private sector collaboration, special administrative, private, patrimonial, or others Type_code Contract subtype Code to indicate a subtype of contract.If it is a type of service contract: based upon the 2004/18/CE Directive, Annex II.If it is a type of works contract: works contract codes defined by the Spanish DGPE.Subtype_code Contract execution place Contract's execution has a place through the Nomenclature of Statistical Territorial Units (NUTS), created by Eurostat [47] Not used (assumed equal to postalzone) Type of procedure Procedure by which the contracts was awarded: open, restricted, negotiated with advertising, negotiated without publicity, competitive dialogue, internal rules, derived from framework agreement, project contest, simplified open, association for innovation, derivative of association for innovation, based on a system dynamic acquisition, bidding with negotiation, or others

Table 2 :
Data fields in the company's information database.Spanish term Certificado de Identificación Fiscal) is the company registration number.isidentifierprovidesformal registration on the company tax system in Spain.In many countries, a company would be issued with a separate VAT number, while in Spain, the CIF also forms the VAT number.andtaxes(EBIT) is a company's net income before interest and income tax expenses have been deducted.It is an indicator of a company's profitability.EBITcan be calculated as revenue minus expenses excluding tax and interest.emost important difference between operating income and EBIT is that EBIT includes any nonoperating income the company generates.Value per year.It is equivalent to the European classification NACE2.It has primary and secondary codes.French term Nomenclature statistique des Activités Économiques dans la Communauté Européenne) is the statistical classification of economic activities in the European Community.ecurrent version is revision 2 and was established by Regulation (EC) No 1893/2006.It is the European implementation of the United Nations (UN) classification ISIC (revision 4).ere is a correspondence between NACE and ISIC.It has primary and secondary codes.United States (US) but also used by agencies in other countries.In the US, the SIC has been replaced by NAICS but some US government departments and agencies continued to use SIC codes.It has primary and secondary codes.
It measures the amount of profit realised from a business's operations, after deducting operating expenses (cost of goods sold, wages, depreciation, etc.).Value per year.Operating_Income Operating income � gross income − operating expenses � net profit + interest + taxes EBIT Earnings before interest CNAE CNAE (for the Spanish term Clasificación Nacional de Actividades Económicas) is the national classification of economic activities from Spain for statistical purposes.e last version of the CNAE has been adopted in 2009 (Royal Decree-Law 475/2007).e North American Industry Classification System (NAICS2017) is a classification of business establishments by type of economic activity (process of production).It has largely replaced the older SIC.It has primary and secondary codes.

Table 3 :
Quantitative description of the tender dataset.

Table 4 :
Quantitative description of the company dataset.

Table 5 :
Testing the bidders recommender for two scenarios: results of the accuracy and number of recommended companies per tender for five different setups.