A Flexible Reinforced Bin Packing Framework with Automatic Slack Selection

,e slack-based algorithms are popular bin-focus heuristics for the bin packing problem (BPP). ,e selection of slacks in existing methods only consider predetermined policies, ignoring the dynamic exploration of the global data structure, which leads to nonfully utilization of the information in the data space. In this paper, we propose a novel slack-based flexible bin packing framework called reinforced bin packing framework (RBF) for the one-dimensional BPP. RBF considers the RL-system, the instance-eigenvalue mapping process, and the reinforced-MBS strategy simultaneously. In our work, the slack is generated with a reinforcement learning strategy, in which the performance-driven rewards are used to capture the intuition of learning the current state of the container space, the action is the choice of the packing container, and the state is the remaining capacity after packing. During the construction of the slack, an instance-eigenvalue mapping process is designed and utilized to generate the representative and classified validate set. Furthermore, the provision of the slack coefficient is integrated into MBS-based packing process. Experimental results show that, in comparison with fit algorithms, MBS and MBS’, RBF achieves state-of-the-art performance on BINDATA and SCH_WAE datasets. In particular, it outperforms its baseline MBS and MBS’, averaging the number increase of optimal solutions of 189.05% and 27.41%, respectively.


Introduction
As a classical discrete combinatorial optimization problems [1,2], the bin packing problem (BPP) [3,4] aims to minimize the number of used bins to pack items and it is NP-hard [5,6].
In the past few decades, four main approaches have been extensively studied to resolve the BPP, such as exact approaches [7][8][9], approximation algorithms [4,10], heuristic algorithms, and metaheuristic algorithms [11,12]. e exact algorithms typically prune the lower bound information to address the BPP, which is suitable for small-scale instances. When the scale of datasets increases, the BPP becomes challenging to the approximation algorithms. e implementation of the metaheuristic algorithms is difficult due to the rigorous requirements on parameter adjustment and calculation complexity [13]. In the contrast, the heuristic algorithm is a popular bin packing method due to its efficiency on solving NP-hard problems.
As one of typical heuristic algorithms, the minimum bin slack (MBS) is particular useful to problems where an optimal solution requires most of the bins, if not all, to be exactly filled [14]. It is also useful for solving the problems where the sum of requirements of items is less than or equal to twice the bin capacity. In MBS, the selection of the packing sequence of the items is based on a predetermined strategy, which ignores the sampling deviation between the data of the items to be packed and cannot explore the global data space. erefore, the MBS algorithm may quickly fall into local optimal solutions ignoring the exploration of global item space in the training process. In the stage of iterative training, the deviation of the locally optimal solution is accumulated continuously, and the global optimal solution space is shifted steadily. It may result in a significant difference between the algorithm's packing result and the optimal solution, which may lead to the failure to achieve the desired performance [14].
In order to solve the problems of MBS described above, we propose a reinforced bin packing framework, dubbed RBF, to resolve the BPP, where a reinforcement learning (RL) method, i.e., the Q-learning algorithm, is exploited to select a high-quality slack for the packing process. e RBF treats Q-learning as a prior data spatial information detector. To ingeniously select data samples as representatives of the datasets, it explores an intrinsic spatial distribution of sample bins by interacting with the environment and estimating the optimal slack of the global bins. e learned slacks are finally exploited in the improved MBS algorithm to pack items. e proposed RBF can be distinguished from previous work in terms of the following characteristics: (1) e reinforced learning algorithm is exploited to generate the slack automatically, which is further integrated to the MBS algorithm. With high-quality slacks, automatically, rather than manual design or empirical speculation, our method prevents the bin packing process from falling into a local optimal solution, which is a quite challenging problem especially for the large-scale dataset. (2) e instance-eigenvalue mapping function is introduced to efficiently select representative and classified validate set of the input instances based on their similarity. is enables RBF to reduce the learning cost while generating a dynamic slack during the packing process. e rest of this paper is organized as follows. Related work is presented in Section 2. e formulation of the BPP is depicted in Section 3. In Section 4, we briefly overview the design of the RBF and then detail its key components, such as the RL-system, the reinforced-MBS strategy, and the instance mapping process. Experimental results and theoretical analyses are presented in Section 5. Finally, conclusions are drawn in Section 6.

Exact Approaches.
e exact approaches establish mathematical model and obtain the optimal solution of the problem by solving the mathematical model through optimization algorithms. CPLEX [36] solved the problem with mixed integer programming. Polyakovskiy and M'Hallah [15] characterized the properties of the two-dimensional nonoriented BPP with due dates, which packed a set of rectangular items, and experimentally proved that a tight lower bound enhanced an existing bound on maximum lateness for 24.07% of the benchmark instances. Since the quality of the solution depends on whether the model is reasonable or not, they are only applicable to small-scale instances.
Subsequent improvements focused on reconsideration about constraints in a novelty manner. Chitsaz et al. [18] proposed an algorithm to separate the subcontour elimination constraints of fractional solutions to solve production, inventory, and inbound transportation decision problems. e inequalities and separation procedures were used in a branch-and-cut algorithm. A similar idea was proposed in Mara's work [20], where an exact algorithm was proposed based on the classic ϵ-constraint method. e method addressed N single-objective problems by using reduction with test sets instead of an optimizer. Besides, one classic method belonging to this group was the arc-flow formulation method [9] which represented all the patterns in a very compact graph based on an arc-flow formulation with side constraints and can be solved exactly by general-purpose mixed integer programming solvers. Generally, when the scale of the problem becomes larger, the phenomenon of "combinatorial explosion" will lead to heavy computational overhead in the optimization process. It is difficult for the exact algorithm to be applied to large-scale combinatorial optimization problems.

Approximation
Algorithms. Approximation algorithms are popular because their time complexity is polynomial, while they do not guarantee to find the optimal solution. Typical approximation algorithms include greedy algorithms and local search. Based on the observation and arrangement of Earth observation satellites, the authors in [21] proposed an index-based multiobjective local search to solve multiobjective optimization problems. Kang and Park [37] considered the problem of variable-size bin packaging and described two greedy algorithms. e objective was to minimize the total cost of used bins when the unit size cost per bin did not increase as the bin size increased. Moreover, the survey [38] presented an overview of approximation algorithms for the classical BPP and pointed out that although the approximation algorithms are universal, there is always a gap between the solution and the optimal solution under the polynomial time complexity. However, approximation algorithms are commonly subjective to polynomial time and cannot give guarantees of solutions.

Heuristic Algorithms.
Heuristic algorithms are based on the intuitive and empirical design. Several new heuristics for solving the one-dimensional bin packing problem are presented [39]. Coffman and Garey [10] reviewed various heuristic algorithms, such as NF (Next Fit), FF (First Fit), BF (Best Fit), and WF (Worst Fit) [23]. ese are typical online packing algorithms [40,41] and are called as fit algorithms. eir corresponding offline packing algorithms are NFD [24], FFD, BFD, and WFD [23], which differ significantly from online packing algorithms in which offline algorithms rely on overall information for sorting. e fit algorithms, for example, FF, WF, and BF, give priority for further packing to the bins that have already been packed with items, and a new bin will be activated only when there are no suitable nonempty bins for the current item. e strategy adopted by the fit algorithms ensures that each arriving item can always find a bin to be accommodated. However, it cannot guarantee that the item is the target item for optimum solution under the current situation. To address the issue, Cupta and Ho [14] proposed MBS, which mainly centered on bins and tried its best to find the collection of items that fill the bins. One problem with this method is that its sequence selection strategy often falls in the local region of the input space, which makes it hard for accurate estimation of the slack. us, it may result in a locally optimal solution. To solve the above problems, some methods have been proposed. Fleszar and Hindi [42] found that one effective hybrid method integrated perturbation MBS' and a good set of lower bounds into variable neighbourhood search (VNS), so as to improve its ability in reasonably short processing times. However, due to the complexity and uncertainty of combinatorial optimization problems, heuristic algorithms that rely on empirical criteria are not always reliable.

Metaheuristic Algorithms.
Metaheuristic algorithms are widely used to find optimal solutions for solving problem of BPP. Early typical representatives include genetic algorithms [28] and simulated annealing algorithms [29]. e former is a promising tool for the BPP and one significant improvement is mainly used: grouping genetic algorithms (GGAs). Dokeroglu and Cosar [43] proposed a set of robust and scalable hybrid parallel algorithms. In GGA-CGT (grouping genetic algorithm with controlled gene transmission) [44], the transmission of the best genes in the chromosomes was promoted while keeping the balance between selection pressure and population diversity. Kucukyilmaz and Kiziloz proposed island-parallel GGA (IPGGA) in [45]. It realized the choice of communication topology, determined the migration and assimilation strategies, adjusted the migration rate, and exploited diversification technologies. Crainic et al. [46] proposed a two-level tabu search for the three-dimensional BPP by reducing the size of the solution space. Kumar and Raza [47] incorporated the concept of Paretos optimality for the BPP with multiple constraints and then proposed a family of solutions along the trade-off surface. However, due to the lack of particle diversity in the later stage of genetic algorithms as well as PSO algorithms, premature convergence always occurs [28].

RL-Based Methods.
Machine learning has been extensively studied to resolve the NP-hard BPP by scholars in recent years. Ruben Solozabal's model tackled the BPP with RL. It trained multistacked long short-term memory cells to perform a recurrent neural network agent, which could embed information from the environment. e performance of the model was just comparable to the FF algorithm when introducing neural network overhead. Inspired by Pointer Network [48], a deep learning technology was successfully applied to learn and optimize the placing order of items [32], solved the classic TSP problem [33], and tackled the 3D BPP. ese methods utilized RL to ensure the solution would not converge to local optimum, while they attempted to exploit neural networks [49] in the RL to solve the BPP, which increased the computational cost and time complexity. Heuristic algorithms rely on empirical criteria to consider predetermined strategies and ignore the dynamic exploration of the global data space in BPP. RL-based methods can intelligently mine data information from the environmental space through trial and error. Perhaps, it can help the existing heuristic algorithms to fully explore the effective information in the sample space, which inspired our method.

Formulation of the BPP
e classic one-dimensional BPP is formalized as follows. It is assumed that there are n items to be packed into bins with equal capacity C. e general objective is to find a packing way to arrange all items (J 1 , J 2 , . . . , J n ) with the minimum number of bins, of which the formal mathematical description can be defined as z: (1) erein, y i represents the indicator whether the ith bin is used or not. A value of 1 indicates that the bin is used, and a value of 0 indicates that it is not used. Note that once the bin B i is used, the total load of the items placed in B i cannot exceed the capacity of C. us, we have where w j means the load of the jth item and x ij is an indicator whether the jth item is packed into the ith bin or not.
Especially, x ij � 1 if the jth item is placed into the ith container, otherwise x ij � 0. Furthermore, an equally fundamental constraint is that each item is just placed into one bin: e detailed explanation of parameters for the formalization is defined in Table 1.

Design of RBF
In this section, the design of the proposed RBF framework is presented. First, the overview of the RBF is outlined, and then, the details of its key components, such as the RLsystem, the reinforced-MBS strategy, and the instance mapping process, are presented.

Overview.
e classical MBS algorithm follows two steps: (1) Utilize lexicographic search optimization procedure [14], also referred as the L algorithm, to find the item set J j that should be allocated to the bin B i Mathematical Problems in Engineering 3 (2) Utilize Step 1 to traverse all items to be packed and the minimum bin slack is C − n j�1 w j , where w j is the load of the packed jth item e steps above means that the slack is utilized to jump out of the optimal local trap randomly in the classical MBS algorithm, while the exact distribution of the sample space is ignored. To resolve the instability of the random slack, a new bin packing framework, RBF, is presented, where the slack is learnable and adjusted according to the samples' structure.
e framework of RBF is illustrated in Figure 1, which consists of a RL-system, a reinforced-MBS strategy, and an instance-eigenvalue mapping process, and defined as follows: (1) RL-system: the RL-system is used to generate a suitable slack by a reinforcement learning strategy, where the best action selection strategy is controlled by Q-agent (2) Reinforced-MBS strategy: with the provision of the slack coefficient from the RL-system, the reinforced-MBS strategy is exploited to resolve the packing process (3) Instance-eigenvalue mapping: instead of using the whole dataset directly, the instance-eigenvalue mapping is utilized to generate the representative and classified validate set for the RL-system based on the similarity of the input instances e main idea of RBF is to utilize the RL-system to learn the slack according to the spatial variation of the sample dataset, and then, the slack can be adapted to the distribution of bins and the remaining items in the data space during the iterative packing process. With the instance-eigenvalue mapping, the representative and classified validate set of the input instances is generated. e validate set is further integrated into the RL-system, where an adaptive slack is generated by the Q-agent. e coefficient of the slack is finally applied in the reinforced-MBS strategy for the packing process.

Instance-Eigenvalue Mapping.
To reduce the amount of calculation for the slack, the representative items are selected for the Q-agent, which can learn the data space without traversing all instances. Here, an instance classification method, called as instance-eigenvalue mapping, is proposed and defined as where x is an given instance, x i is the average value of the items in the ith instance in the dataset, x i min and x i max , respectively, represent the minimum value and maximum value of the items in the ith instance, and y i x denotes the instances eigenvalue of the ith instance.
According to the value of the instance-eigenvalue, the whole instances are reordered. e dataset U can be divided into K different subsets U 1 , U 2 , . . . , U k . e last instance of each subset is taken to form a validation set. en, the validation set is utilized to iteratively learn the slack. erefore, at each time step t in RBF, instead of using the whole instances, Q-agent utilizes the validation set to reduce the repetitive work of the system.

RL-System.
e validate set is integrated into the RLsystem with a Q-learning algorithm [50], where Q-agent is utilized to learn the appropriate strategy and then improve the MBS strategy by selecting high-quality slack. e process of the RL-system can be described as Markov decision processes (MDP) which is represented as a tuple (S, A, P, R, c). In the decision-making process of MDP, S is the state set, A is an action set, P is the transition probability between states, R is the return value after taking a certain action to reach the next state, and c is the discount factor. To be adaptive to the packing circumstances, for example, the current distribution of containers and the remaining items, we proposed a slack learning algorithm and the detailed process is shown in Algorithm 1, where the parameters are illustrated in Table 2. By observing the current state S t of the environment, Q-agent selects one action A t that maximizes the value of reward function R t according to the state observed. With Q-agent continually interacting with the environment, we explore an suitable data selection strategy of the slack coefficient. e algorithm returns both a reward r and a new state S t+1 to Q-agent in each packing iteration, of which the change of states depends on the state transition probability p(S t+1 |S t , A t ): e agent receives the performance-driven reward R t , and then, the sum of discount reward at time step t is represented as G t : erein, c ∈ [0, 1), and it defines the weight of future reward and discount reward in the sum of reward. e closer c is to 0, the more incentive is to consider short-term benefits. e closer c is to 1, the more incentive is to consider long-term benefits. e goal of the Q-agent at each time step t is to select an action A t that can maximize future discount rewards G t by finding an optimal policy π * . Here, π * is the strategy of taking the optimal action A t at state S t , while π is the strategy of taking action A t at state S t . Under the policy π, Q π (s, a) is defined as the expectation of the state-action value function. When the agent takes A t at S t , Q π (s, a) is represented as where E π is the expected function. e maximum state-action function Q * (s, a) over all policies is represented in e update rule of Q(S t , A t ) value is shown in where α ∈ (0, 1) is the learning rate of the RL agent. At each time step t, Q-agent observes the current state S t and selects the action A t from a discrete set of behaviors A � 1, 2, . . . , k { }, where the value of k is equal to the number of the items to be packed. At the beginning, the action A t is randomly initialized, that is to say, the action A t corresponding to the random number between 1 and k is selected. en, the RL-system selects the action A t that can maximize the Q(S t , a) value at each time step t: e agent uses a greedy learning strategy [51] to choose actions. It selects actions according to the optimal value of Q table with 1 − θ probability and randomly selects with θ probability. e state is represented as the remaining space capacity of the bin after each round of packing. At each time step t, the remaining items prefer to be packed into as few bins as possible. When the bin is full, the agent is given a reward R t . If the bin is overflowing, the agent is punished severely and it is told that state like this is not allowed. e slack ϵ is defined as erein, d is a constant, r is the immediate reward achieved by Q-agent, and a represents the initial value in the first iteration process. By returning the reward value R t , the slack ϵ is adjusted in each packing round accordingly. e slack ϵ can be changed in a range with the change of the reward value R t . Ultimately, the new round of environment is updated as the subtraction of bin capacity C and slack ϵ.
Q-agent captures this intuition through performancedriven rewards R t . At each time step t, the agent's reward is defined as t. erein, count(n) is the number of bins that are exactly filled, count(C − ϵ) is the number of bins that are filled in the slack space, α is the weight coefficient of positive reward, Count(m) is the number of overflowing bins, β is the punishment coefficient of negative reward less than 0, and ι is a constant to regulate the value of the entire reward function R t .
where ω is the slack parameter learned by the agent applying RL. It is calculated as ω � C − ϵ/C by minimizing the number of used bins on the validation set. en, the coefficient of slack is passed into our reinforced-MBS algorithm. In detail, the idea of the reinforced-MBS algorithm is shown in Algorithm 2.
In Algorithm 2, the improved L dictionary search procedure is utilized to find the set S k of items that should be assigned to the bin B k during the iterative process. e improved L dictionary search procedure is shown in Algorithm 3.

Experimental Evaluation
In this section, experiments are carried out to verify the effectiveness and robustness of the proposed RBF. First, experimental evaluation indexes are introduced and the Input: training data itemList with n items, container list BinList with capacity C, remaining capacity Residual Capacity of the bin, learning rate α, discount factor c, and the iterative number MAX − EPISODE.
(1) Initialize Q-table; (2) for episode in range MAX − EPISODE do (3) S t � 0 (4) Initialize container list BinList [1, , n]; (5) is_terminated � False; (6) while not is_terminated do (7) According to state S t and Q- Calculate immediate Reward R t+1 and get next State S t+1 ; is_terminated � True; (16) end if (17) Update ϵ; where SOL represents the number of bins used by the concrete algorithm and OPT is the number of bins in the optimal solution for the packing instance. e competition ratio equal to 1 means that the algorithm has found the optimal solution. Generally, OPT(σ) has a lower limit, as shown in formula (14), where ⌈⌉ is the ceiling function. Due to the limitation of bin packing conditions, the number of bins used in each bin packing iteration cannot be less than the ratio of the total load of the items to the capacity of a single bin:

FSOL.
For a dataset, FSOL represents the number of feasible optimal solution instances achieved by the algorithm; in other words, the number of instances whose CR t is 1. For the specialized algorithm alg and the dataset data, FSOL is specialized as FSOL alg (data).

Realization Rate.
Realization rate (RT) is defined as formula (15), where INS is the number of instances in the packing dataset:

Gap.
Gap is referred to the deviation between the number of used bins obtained by the algorithm and the optimal number of the packing. e relative Gap is exploited to evaluate the performance of the algorithms, which is calculated as e BINDATA [54] and SCH − WAE [55] datasets are used in the experiment for evaluation. erein, BINDATA dataset includes three subsets, such as Bin1data, Bin2data, and Bin3data. e details of datasets are shown in Table 3, such as the number of instances, weight of items, capacity of bins, and the number of items in the instances.

Experimental Results and Analysis.
e performance of the RBF is compared with that of the classical Fit algorithms, the MBS algorithm, and the MBS' algorithm on BINDATA and SCH − WAE datasets shown in Table 3. For each instance of each dataset, the number of items in each category is the same. e experimental results reported in this paper are the average of ten runs under per hyperparameter settings. Table 4 lists the results of FSOL, RT, and CR t of the algorithms involved in comparison on BINDATA, while Table 5 lists the value of Gap. In comparison with the classical heuristic algorithms, such as NFD, FFD, WFD, AWFD, BFD, MBS, and MBS', RBF obtains the maximum FSOL on BINDATA, while its CR t and Gap are minimum. Furthermore, the improvement of FSOL RBF , represented as IMP RBF (alg) and defined as formula (17), is further calculated, where alg ∈ MBS, MBS ′ and data ∈ Bin1data, Bin2data { }. Especially, for Bin1data and Bin2data, IMP RBF (MBS) is 165.08% and 179.2%, respectively, while IMP RBF (MBS ′ ) is 5.53% and 41.3%, respectively. For the dataset Bin3data, RBF is the only one that can obtain 2 optimal solutions in the total 10 cases, while the others obtain zero optimal solutions:

Robustness and Stability.
e construction of the validation set is a key procedure of RBF. is experiment is carried out to verify the validity of the eigenvalue mapping function on the Bin1data. Since 10 instances of the Bin1data are selected to form the validation set by the eigenvalue mapping function, here different selection policy are applied for comparison in the packing process. e first policy is that the first 10 instances of the Bin1data are selected, the second policy is that the last 10 instances of the Bin1data are selected, and the third policy is that 10 random instances of the Bin1data are selected to form the validation set. e packing results with different selection policies are depicted in Table 10. It can be seen that the value of slack learned by Q-agent is different with different selection policies. Especially, with the selection policy of the eigenvalue mapping function, RBF achieved the maximum FSOL and RT, while the minimum was CR t and Gap. e results verified the validity of the eigenvalue mapping function, which helped RBF achieved better performance.

Conclusion and Future Work
In this paper we propose reinforced bin packing framework (RBF) to tackle the one-dimensional BPP. e proposed RBF consists of three main components: the RL-system, the instance-eigenvalue mapping process, and the reinforced-MBS strategy. e RL-system is designed to construct a slack selection policy automatically by Q-agent to select high-quality slack for the heuristic algorithm integrated in RBF. e instance-eigenvalue mapping process is utilized to generate the representative and classified the validate set based on the similarity of the input instances, which greatly eliminates the computational overhead and improves the generalization performance of the model. Finally, with the provision of the slack coefficient from the RL-system, the reinforced-MBS strategy is exploited to resolve the packing process. We evaluate our models on BPP tasks, where RBF exhibits excellent packing ability and experimental results validate its superior performance compared to state-of-the-art proposals on BINDATA and SCH − WAE datasets. Compared to its baseline methods, MBS and MBS', the average number of optimal solutions achieved by RBF increases by 189.05% and 27.41%, respectively. For future work, we plan to investigate slack selection policies and new mechanisms to learn them automatically. We also foresee the extension of our method to more complex multiagent reinforcement learning frameworks, where the use of new aspects of the multiagent communication environment is crucial to boost the packing performance.
Data Availability e datasets used in this paper contain one-dimensional bin packing datasets, such as BINDATA and SCH_WAE datasets, which can be found in http://people.brunel.ac.uk/ ∼mastjjb/jeb/orlib/binpackinfo.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest.