Tool-Body Assimilation Model Based on Body Babbling and Neurodynamical System

We propose the newmethod of tool use with a tool-body assimilationmodel based on body babbling and a neurodynamical system for robots to use tools. Almost all existing studies for robots to use tools require predeterminedmotions and tool features; themotion patterns are limited and the robots cannot use novel tools. Other studies fully search for all available parameters for novel tools, but this leads to massive amounts of calculations. To solve these problems, we took the following approach: we used a humanoid robot model to generate randommotions based on human body babbling.These richmotion experiences were used to train recurrent and deep neural networks for modeling a body image. Tool features were self-organized in parametric bias, modulating the body image according to the tool in use. Finally, we designed a neural network for the robot to generate motion only from the target image. Experiments were conducted with multiple tools for manipulating a cylindrical target object. The results show that the tool-body assimilation model is capable of motion generation.


Introduction
Humans are capable of expanding their ability by using tools.Robots are expected to become more useful to society through the use of tools.With the development of robotics technology, robots have become very complex, with increasing numbers of applicable sensors and degrees of freedom.Therefore, complicated calculations are required to build conventional robot tool use models.Modeling robot tool use based on human cognitive development has been proposed as an approach to mitigate this problem [1].Among the many factors of human cognitive development, tool-body assimilation, studied in the field of neuropsychology, has begun to gather attention [2].Tool-body assimilation occurs when humans use a tool and treat it as an expansion of their own bodies.Iriki et al. recorded neurons called "bimodal neurons" before and after monkeys were trained to use a tool.Bimodal neurons respond both to tactile stimulation on the hand and to visual stimulation.Through tool use training, the visual receptive field of bimodal neurons expands from the monkey's hand to the surroundings of the grasping tool.This result shows that tool-body assimilation occurs at the neuron level.We aim to achieve robot tool use by this approach; by modeling tool-body assimilation, it is possible to alter the behavior of a posttrained body model by adding new neurons and expressing a "body model that is using a tool." Tool use with tool-body assimilation is also gathering attention in the field of robotics.Nabeshima et al. [3] used visual and touch stimuli to connect bodily and sensory information.After training the relationships between visual and touch stimuli, dynamic touch was performed to predict the inertial parameters of the tool.The resulting simulation model allowed the robot to perform a pulling task with a target object located in an invisible area.However, the inertial parameters used as tool features were determined in advance, and the model was incapable of adapting to nonrigid bodies.Hikita et al. [4] treated tools as an expansion of the robot's body in the form of saliency maps, basing their research on the experimental results of Iriki et al.However, no actual motion was generated.
There are various studies regarding tool manipulation, which is limited to not only focusing on tool-body assimilation, but also affordance.Affordance is the value that the environment provides to animals [5].For example, a solid horizontal plane at knee height affords sitting.In the field of robotics, affordance research is divided into two approaches.One is the use of human affordance knowledge, while the second is the acquisition of affordance by the robot itself.Montesano et al. [6], Saxena et al. [7], and Song et al. [8] took the former approach, while Stoytchev [9], Detry et al. [10], and Tikhanoff et al. [11] adopted the latter approach.Montesano et al. trained a Bayesian network by giving the network information about the relations between actions and the resulting effects.Saxena et al. taught a robot where to grasp during tool use and generalized the model by categorizing unknown tools with similar shapes, allowing the grasping of unknown tools.Song et al. succeeded in training a model that grasps differently for different objectives.For example, this allows the robot to hold a cup from its side instead of grasping the top; open side of the cup when drinking is set as the objective.However, because it is very difficult for humans to directly teach a robot about the affordance of the environment, it is unrealistic to use this approach to teach about various numbers of tools.In addition, depending on the robot's bodily structure, it may be impossible to execute actions based on the trained affordance.In the study by Stoytchev, a robot learned the relationships between movements, tool shapes, and target objects by generating motions based on predetermined motion sets.Each tool was recognized by its respective colors, and the target object was moved to the target position by referring to the learned relationships.However, although the robot successfully learned the affordance of the tools, the generated motions were limited to the predetermined motions used during training.Detry et al. succeeded in achieving almost all possible graspable parts of various tools by testing almost all graspable positions.However, this is an unrealistic approach because large amounts of time will be required when this is done under real conditions, and the resulting model will not be able to adapt to unknown tools.Tikhanoff et al. performed object pulling tasks and taught pulling motions to the robot.In this case, the robot decided what tool and which part of the tool to use, based on the position of the target object.Because humans define the decision algorithm by using mathematical equations, appropriate modeling is required for new tasks.
The concerns described above are summarized as follows: (1) Limited motion owing to predetermined motion patterns, (2) Inability to adapt to new tools owing to predetermined tool and motion features, (3) Large calculations required in generating target motions owing to the need to fully search all available parameters.
To overcome these concerns, this research will undertake the following approach with tool-body assimilation: (1) body babbling with a humanoid model, (2) recurrent neural network and deep neural networks, (3) Motion generation by minimizing the error of the final state predicted from initial states.
Our purpose is to suggest the new method for tool use by robots by applying the concept of tool-body assimilation in cognitive science with body babbling and a neurodynamical system rather than modeling human's cognitive mechanism.First we will describe the first step of the approach.Many past studies have built motion models in advance for motion generation and used teaching data that are designed by humans to train robots.This approach can adapt to conditions and tool functions that are known.However, the approach cannot adapt to unknown conditions.In addition, the creation of the teaching data is strongly influenced by the creator's knowledge and intentions, causing bias in the data.Because of this, it is difficult to effectively make use of the bodily constraints of the robot.It is hypothesized that the body has a significant influence on the development of intelligence [12].During the early days of infants, the relationship between motion and the nervous system has yet to be modeled; thus, movements and interactions between the body and the environment cannot be reproduced.The infant learns the reproducible motions and the accompanying sensory information from this interaction.As the infant learns, passive motions are replaced with autonomous motions, and as the infant interacts with the environment, relations between primitive movements and sensations are learned.By learning these relationships, various information is gained, such as the postures of the body that are easy to move.Even in the case of tool use, knowledge related to the tool is earned and generalized through trial and error [13].In the field of robotics, a phenomenon called body babbling, which is one of the steps in the infant development process, has recently began to gather attention [14][15][16][17].Humans are capable of unlimited numbers of movement patterns due to having bodies with many degrees of freedom.However, the actual movement patterns that are used are limited owing to the constraints of the body structure, causing some degrees of freedom to be easier to move than others.In the case of tool use, there are unlimited ways to use tools.However, the actual use of a tool is limited to very few methods because tool use is also influenced by bodily constraints.Thus, it can be said that the way humans use tools has a strong relationship with the body [18].When considering a model with humanlike superior tool use capabilities, by using a body model that is similar to a human body, it is expected that the model will generate a human-like motion.In addition, considering human cognitive development, many trials of movements must be performed, and a body that is able to withstand these trials is needed.If this is done with real robots, the robot will break down.Therefore, the body babbling will be performed by a robot in a simulator.
We will now describe the second step of the approach.Past efforts in robotics have required the model of robot movement, object manipulation, and features to be designed in advance.This method is highly effective in fixed environments such as factory production management.However, when considering robots that work in changing environments such as human living spaces, it is unrealistic to assume that the environment does not change.It is possible to partially overcome this problem by predicting the changes in the environment.However, the model would still not be able to adapt to unpredictable environmental changes.To overcome this issue, we took an approach in which the robot autonomously acquires features of data and learns the relationships between input and output from the sensory data that do not require preset features.With this approach, the model can adapt to new situations.Specifically, in this research, tool and motion features in the form of image features from camera data are self-organized by using deep neural networks (DNNs) with an autoencoder.The autoencoder can generate feature values automatically.This is done by training the autoencoder to give output values that are equal to the input values.The relations between the robot motion and the earned features are used to train a multiple time-scale recurrent neural network (MTRNN).With these two methods, it is possible to build a model that allows the robot to adapt to environmental changes and its own movements without predetermined information.In addition, as a characteristic of neural networks, these can estimate the output even when nontraining data are given based on the experiences that the neural networks have gained from training data.
We will now describe the third step of the approach.In the early days of infants, they tend to predict tool functions by using dynamic touch [5].Dynamic touch refers to the movement of the body to acquire the characteristics of an object by moving the object [13].Through tool manipulation experiences, humans tend to learn to use tool with toolbody assimilation.Nishide et al. [19] used a neural network model to build a tool-body assimilation model with dynamic touch.Because the tool features were self-organized with no settings determined in advance, the model is also applicable to untrained tools.The model predicted the features of the new tool, updated the parametric bias (PB) nodes connected to the neural network, and performed object pulling tasks.In their study, dynamic touch was required to differentiate between different tools.Michaels et al. 's experimental results imply that human estimates the tool function from the shape of tool as an overlay of tool use experiences [20].If robots can also estimate tool function from the shape of tool, it should be possible to estimate tool function from the shape of unknown tools.Unlike Nishide et al. 's approach, our approach is to perform the estimation of the tool function from the image of the tool.Moreover, if the robot estimates the tool function from target state and generates motion with final states close to the target state, it will be useful because there is no necessity to teach the process of motion.In Nishide et al. 's study it is necessary to search motion parameters to generate target motion before motion generation.This results in large calculations.We further developed this model, allowing the model to predict tool functions from tool shapes, and designed the model to generate motion by minimizing errors between the final goal state and the final state predicted by the MTRNN.
The rest of this paper is composed as follows.In Section 2, the DNNs are discussed.In Section 3, an overview of the tool-body assimilation model is provided.In Section 4, the experimental setup is presented.In Section 5, the experimental results are given.In Section 6, we present our conclusions and describe future studies.

Deep Neural Networks
DNNs are multilayer neural networks, usually with five to 10 layers.DNNs allow highly precise image and speech recognition, so they have attracted attention in recent years [21,22].It is possible to reduce the dimensionality of data and to automatically extract features without predetermined information by using an autoencoder with a sand-glass-type DNN.It is possible to extract features from a hidden layer by substituting the unknown data in a trained DNN.DNNs are used in field of image recognition and speech recognition; however, they are not often used in the field of robotics [23,24].This is because for training DNNs require enormous, multidimensional training data.It is time consuming to acquire training data with robots, causing robots to break owing to use for a long time.To overcome these problems, we used a robot model in a simulator to perform random movements in the form of body babbling.This will be described in detail in Section 3, describing the tool-body assimilation model.

Training of Deep Neural Networks.
A recent advance in learning for deep networks is to use layer-wise unsupervised pretraining methods [25].Applying these methods before running fine-tuning methods overcome the difficulties related to deep learning.Martens [26] proposed the approach that developed a 2nd-order optimization method based on the Hessian free approach without using pretraining.This method is easy to use, effective, and efficient in training.Therefore, we adapted the learning method proposed by Martens.

Hessian-Free Optimization.
Newton's method is the canonical 2nd-order optimization.The method is iteratively updates of the parameters  ∈ R  of an objective function  by computing search vectors  and updating  as  +  for some .The main idea of Newton's method is that  can be locally approximated around each , up to the 2nd-order, calculated as where  = () is the Hessian matrix of  at .However,  can be indefinite so that (1) may not have a minimum.Moreover, if the values of  are large, the approximation of  becomes inaccurate.Therefore,  uses damping parameter  for calculation to readjust  =  +  for some constant  ≥ 0.
In standard Newton's method,   () is optimized by computing the  ×  matrix  and then solving the system  = ∇() ⊤ .However, the calculation costs are very expensive when  is large even with modestly sized neural networks.Instead of this calculation, Martens develops the basis of the 2nd-order optimization approach with the linear conjugate gradient algorithm (CG) [26].This optimizing quadratic objectives require only matrix vector products with .Details about the implementation are provided in several other studies [26][27][28].

Competition with
Self-Organizing Map.Some studies have used a self-organizing map (SOM) for the extraction of features [19,29].Arie et al. proved that SOM is the high compatibility with MTRNN.SOM is an unsupervised learning neural network proposed by Kohonen [30].It is composed of input and output neurons.The neurons in the output layers are two-dimensionally arranged and possess weight vectors, .The weight vectors are set to have the same dimensions () as the input vector, V.The image features are defined by the following formula: where  is the dimension of the SOM and  ∈ .
It is possible to reduce the dimensionality of data by using an SOM.However, if there are many motion patterns for the robot, it is difficult to extract features using an SOM.Therefore, we used an autoencoder with the DNN for the extraction of features.This will be discussed in detail in Section 5, which describes the experimental results.

Tool-Body Assimilation Model
In this section, we provide an overview of the tool-body assimilation process.This process consists of three phases: (1) learning of the body model with random movement based on body babbling by a humanoid robot, (2) learning of tool dynamics features, and (3) generation of motion by using only a target image.Figure 1 shows an overview of the model.The model consists of three modules: Figure 2 shows the learning process of tool-body assimilation.First, body babbling is performed with a bare hand, producing motor data and camera images.The image features are extracted with the DNN.Next, the MTRNN is trained by using these motor data and image features.Upon training, the MTRNN learns the body model of the robot.Next, by training only the PB nodes with different tool images, it is possible for the robot to learn the tool features without changing the body model.PB neurons are capable of learning how to modulate the body depending on the tool being used.This combination of MTRNN and PB nodes thus expresses "body model that uses tool." The number of PB nodes is much fewer than that needed for the body model (in this study, the number of PB nodes is five and body model requires 98 nodes), and the PB nodes represent the tool characteristics.Because the tool is treated as a part of the body, we expect the robot to use tools based on the experience of how to move the body.PB nodes do not represent the complicated methods of tool use itself but only show how to modulate the original body model.Therefore, even if the number of tools is increased, the number of neurons of PB nodes will still be fewer than in the body model.The time to learn the PB nodes is shorter than that needed for the body model.However, we suspect that the proposed model cannot learn all possible types of tools.This is mainly because of the difficulties in representing tools that cannot be treated as a modulation of the body.For example, tools with rotating motions such as a mixer will be a challenging tool to use.

Learning of Body Model Based on Body
Babbling by a Humanoid Robot.In this phase, the robot, possessing a humanoid robot model, performs body babbling with a target object in its bare hand to learn its own body model.Figure 2(a) shows the learning process of body model.Firstly, the robot performs body babbling using bare hand, and motor data and camera images are produced.Then image  features are extracted by the DNN.The relationship between motor and image features is learned by the MTRNN; that is, the robot learns its own body model.Body babbling is a behavior observed in infants.Body babbling is considered as an explorative motion.Such motions that are related to human's intrinsic aspects such as motivation and preference are phenomena that are difficult to model.Hence, we modeled this as random motions in a similar way to the previous cognitive robotics studies [14][15][16][17].By using the concept of body babbling, predetermined parameters are not required for performing motion by a robot.Body babbling requires numerous numbers of movements.Simulated experiments are effective, because it is difficult to perform many trials with real robots.During body babbling, sequential images and motor data are acquired.The features are extracted from the images by an image feature extraction function.
To adapt to unknown situations, the robot should have the ability to extract image features by itself.To achieve this, we propose the use of an autoencoder with the DNN extraction of image feature.We describe DNN in the next section.In this research, the input data to the DNN consists of the raw image pixels from the robot model's camera.
For the robot's body model, we implemented the MTRNN proposed by Yamashita and Tani [31].The MTRNN is a kind of recurrent neural network (RNN) [32], which can predict the next state, IO( + 1), given the current state, IO().This MTRNN is capable of learning multiple sequential data.The MTRNN is composed of three types of neurons: fast context (  ) nodes, slow context (  ) nodes, and input-output (IO) nodes.Each type of node has a different time constant, representing the firing rate of the nodes.The faster (  ) nodes learn the movement primitives of the data, whereas the slower (  ) nodes learn the sequence of the data (Figure 3).The structure of the MTRNN proposed in this research is shown in Figure 4.During the learning of the MTRNN, the back propagation through time (BPTT) algorithm [33] is applied.
First, the output neurons are calculated by forward calculation.The internal value of the th neuron at step ,   (), is calculated as where   is the time constant of the th neuron,   () is the input value of the th neuron from the th neuron,   is the weight value of the th neuron from the th neuron, and  is the number of neurons connecting to the th neuron.The output of the   () is calculated by multiplying the internal value and the sigmoid function: The input value is calculated as where   () is the teacher signal and  is the feedback rate (0 ≤  ≤ 1).The input value is calculated by adding the output value of the previous step to the teacher signal.This is done to avoid a diverging error during training.The inputs of the context layer are the outputs of the previous step.The weight value is updated using the training error.The training error is calculated as The weight from the th neurons to the th neurons is updated with the training error /  : where  is the training coefficient and  is the number of updates.
The initial value of   ,   (0), is also calculated by the BPTT algorithm: After training, (0) represents the parameters of each learned sequence.Each sequence can be recovered by substituting the (0) values into the MTRNN.By applying the BPTT algorithm to   (0) with the fixed weights of MTRNN, it can be used as a recognition unit.After training, the PB space that represents the tool dynamic features is formed.In other words, PB nodes are able to learn the visual changes resulting from differences in tool types.Then, the PB node applies bias to the body model according to the tool dynamic features that it learns, changing the body model's dynamics accordingly.This means that there is no need to retrain the robot's body model when a new tool is introduced.The value of the PB node is calculated by using the same method as for   (0), and the weights from the PB nodes to other nodes are updated in the same manner as for other weights of the MTRNN.

Generation of Motion from Goal Image.
In this phase, firstly, the initial image and joint angle are provided to the robot.With this the robot will be able to understand the environment's and the robot's own current state.Secondly, a target image is shown to the robot.During this time, the weights of the MTRNN and PB nodes are fixed.The   (0) and PB values are calculated using the BPTT algorithm.Next, the differences between the target image and the associated image from the MTRNN are minimized.Using the PB and   (0) values calculated with the algorithm, the MTRNN generates motor sequence data (Figure 2(c)).In Nishide et al. 's research, it was necessary to apply dynamic touch for tool type recognition.However, in this research, it is possible to recognize a tool from an image of the grasped tool.

Robot Model in Simulation.
To evaluate the tool-body assimilation model, we built a humanoid robot model in the robotics simulator OpenHRP3 [34].The model's size and degrees of freedom (DOFs) were based on the humanoid robot ACTROID [35].The range of motion of the model's joint angles was based on human [36].To reduce calculation time, only the right arm, torso, and head of the robot were used.In addition, the left arm was removed and the legs were replaced with a box (Figure 5, Table 1).

Experimental Evaluation.
In this experiment, an object pulling task with the robot's bare hand, an I-shaped tool, a T-shaped tool, an L-shaped tool, a J-shaped tool, "⊢" shaped tool, a C-shaped tool was used to evaluate the model (Figures 6 and 7).The size of the object was 0.08 [] in diameter.The L-shaped tool was treated as an unknown tool, which is highly similar to the learned tools, and a J-shaped, "⊢" shaped, and C-shaped tool, which have high dissimilarities with the learned tools, are only used for evaluation and not for training.This task is commonly used in the study of robotic  tool use and tool-body assimilation [3,4,9,11,19].The robot performed body babbling in the presence of a target object (a cylinder) on a table for 6 [] using its hand and a tool.

Motion by Body Babbling.
To evaluate the effectiveness of this approach, the movement of the robot was confined to the plane of the desk (two-dimensional movements).In doing this, out of the seven DOFs of the robot's arm, only three DOFs were used.The robot's arm had two initial positions: to the left and to the right of the target object (Figure 8).For each initial position, the robot executed 75 sets of body babbling.The motions were generated by connecting the initial pose, the second pose, and the third pose.The second pose and third pose are selected randomly.The poses are connected by calculating the required acceleration with fifthorder linear interpolation.The accelerations of the beginning and end of the movement are calculated to be 0. Therefore, the movements become smooth and it is possible to control the motions according to the target motions.Although two initial positions are used for the teaching data in this study, parts of the motion paths in the training data are coded during the training of the RNN.This is synonymous to the learning of the varieties of trajectories.During body babbling, the robot obtained the teaching data that were used during training (Figure 9).The acquired data consisted of motor and image sequential data.The motor data of the three movable DOFs were recorded for 30 steps during the 6 [] of random motion, that is, 7.5 [steps/].Image data constituted a gray-scale image of 32 × 24 pixels captured by a visual sensor on the robot.Twenty-five dimensions of the image features extracted from the image data by using an autoencoder with DNN, and three dimensions of the joint angles were used for the input data to train the MTRNN.The image data and joint angle were then normalized to [0.1, 0.9] and [0.0, 1.0] for use as the data for teaching the MTRNN.Table 2 shows the construction of the DNN.Table 3 shows the construction of the MTRNN.

Extraction of Image Features by Self-Organizing Map.
In previous studies, SOM has commonly been used as an image feature extraction method [19,29,37,38].To compare image features extraction by DNN, image features were also extracted by SOM. Figure 10 shows a visualization of the reference vectors of the SOM.Reference vectors are the visualization of image features.Reference vectors represent the patterns that are extracted from the input data.The characteristics of reference vectors are that the units that are mapped close to each other will have close resemblances to each other.In addition, each input data is classified to the locations of the reference vectors that are similar to the data.The dimensions of this SOM were 5 × 5.The results show  that the difference between the bare hand and tools is not learned accurately and that the motion patterns are not learned accurately.We changed the dimensions of SOM to 10 × 10.However, the motion patterns were not learned accurately after increasing the SOM dimension.Even if feature extraction is done well by increasing the dimension, it is difficult to learn by RNN because of the greater dimension.When more tools are introduced, the various tool conditions were included in each vector, causing the tool feature classification to fail.When only a few tools are used, it is possible to learn the image features accurately with SOM.This is shown in Figure 11, where image features of the bare hand and Tshaped tool were extracted by SOM.

Extraction of Image Features by DNN.
Original images were recovered by substituting the image features extracted by DNN (Figure 12).The image of the bare hand and tools were accurately recovered.In addition, the position of the target object was recovered.Moreover, even with unknown tools it was possible to recover the shapes of the unknown tools.
In the case of the SOM, extracted features of the target object were unclear; therefore, it was difficult to recover the position of the object accurately.DNN does not use classifications but instead makes use of autoencoders that are trained to produce the same output as the input data.With this, it is possible to reduce the dimensionality of large numbers of high-dimensional data followed by high reproduction performance.In addition, there is no need to fine tune the parameter settings with DNN.Thus, if there are large amounts of training data, the DNN is superior to the SOM in the extraction of image features.

Self-Organized Tool Function from PB Values.
The principal component analysis (PCA) results of the PB values for the tool-body assimilation model are shown in Figure 13.The figure shows that self-organization failed for the features of each tool.Next, after body babbling was performed, as an additional condition we chose the motion patterns in which the bare hand and tools of the robot contacted the target object when moving between the two positions.PCA results of the PB values for the tool-body assimilation model are shown in Figure 14.The PB values are clustered based on the tool used during motion generation, that is, bare hand, Ishaped tool, and T-shaped tool.By these different conditions, we can see that the robot often moves without touching the target object when the additional conditions are not implemented.To generate motions there is no absolute need to gain tool functions.However, in this case the robot's adaptability  to novel tools will decrease.To gain tool functions, we believe that it is important to actually have a sense of purpose to use the tool.The learning result implies that it is difficult to acquire the functions of the tool to just move the arm without purpose.We hypothesize that it is necessary to move with a purpose for acquiring tool functions.This idea is also described in a previous study [39].When infants manipulate a grasped tool, it seems to be a random movement; however, we suspect that the intended movements are not performed correctly because their sensorimotor is not precisely formed.This problem will be solved by implementing explorative movements, and not random ones, for efficient learning.
Parts of the clusters are overlapping in Figure 14.In these overlapping regions, the robot manipulated the target object with part of its hand or a common part of tool (e.g., part of the "I" is common between the T-shaped and I-shaped tools) when the robot performed body babbling with a tool.
Figure 15 shows the clustering of the PB values after training of the model.As shown, each cluster is formed at a different area of the graph.Some parts of the clusters are overlapping.This is because when the robot generates the motion close to the target state by using parts of the tool that are similar to other tools, it distinguishes these as having the same tool function and chooses similar PB values.When the robot uses a different tool function, it chooses different PB values.PB values earned through recognition have large overlapping areas compared to the PB values earned during training.This is because the robot can generate the motion that uses the same tool function even if tool shape is different.It is considered that the robot generates motion with final states close to the target state which the robot has many experience with.The plots of PB values of unknown tools (L-shaped tool, J-shaped tool, "⊢" shaped tool, and C-shaped tool) overlap with each other and are hard to observe.This is because PB reflects not only the effect of tool shapes but also the tool functions used during each motion.For clarity's sake, we   plotted the center of gravity of the PB of each tool (Figure 16).Variances of the PB values of the bare hand, I-, T-, L-, J-, "⊢", and C-shaped tool are 1.41, 3.17, 4.00, 3.87, 5.43, 6.89, and 3.32, respectively.Figure 16 shows that the PB values of the bare hand and I-shaped tools are similar.This is because the similar shapes of the bare hand and I-shaped tools lead to almost equivalent tool functions.With this observation, it can be said that similar shapes lead to similar usable tool functions during motions.It is observed that L-shaped tool has the intermediate tool functions of I-shaped and T-shaped tool.This is because L-shaped tool is similar to I-shaped and T-shaped tools.In the case of the J-shaped and C-shaped tools, the PC1 values are similar to the PC1 values of the L-shaped tool.Here, the difference of PC2 values of the Lshaped and C-shaped tools is smaller than the difference of PC2 values of the L-shaped and J-shaped.Thus, it can be said that the function of C-shaped tool is more similar to L-shaped tool than J-shaped tool.It is observed that the PB values of the "⊢" shaped tool are differ from other tools except the PC1 values of bare hand and I-shaped tool.Thus, it can be said that "⊢" shaped tool has part of the functions of bare hand and I-shaped tool and different functions when comparing other tools.With these observations, it can be said that the robot recognizes that the tools each have different tool functions.

Generated Motion.
We evaluated the performance of our system by counting the number of the tasks which successfully moved the object to a position within the radius of  pixels from the goal position in the visible area (32 × 24 pixels).Figure 17 shows the relationships between the success rate and .The success rate within  = 2 was about 20 to 35 percent; however within  = 5 it was more than 50 percent even if the robot used the untrained L-shaped tool, J-shaped tool, "⊢" shaped tool, and C-shaped tool.As one of the ways to improve the success rate, it may be better to learn gradually from coarse to fine images as shown by the experimental results of Kawai et al. [40].
Examples of motion generated by the robot when given a target image are shown in Figures 18,19,20,and 21.As shown, motions close to the target state were generated even if the robot uses an untrained tool.In Figure 17, it can be observed that it is possible to manipulate the object with the same accuracy as the learned tools even when using unknown tools.In addition, Figure 22 shows an example of motion that starts from an untrained initial pose and object position.As shown, motions close to the target state were generated even if the robot starts from an untrained initial state.As a matter of interest, in Figure 19 the robot generated the pulling  motion with a J-shaped tool after first avoiding contact of the protruding part of the tool and the object.If the robot did not avoid the contact at first, the object will be repelled and the robot would not be able to manipulate the object correctly.Among the generated motions that have a final state close to a given target state, some have different movement courses compared to the teaching data (Figure 23).Because motions other than the learned ones are generated, it can be said that the model does not overfit.Figure 14 shows that the PB values formed different clusters for each tool.Figure 16 shows that robot recognizes different PB values for each tool.In such cases, it can be said that the robot have acquired the tool function (affordance) and generated the motion by using the tool function.

Conclusion
This study's objective is to achieve robot tool use without the need for predetermined features and models by having the robot self-organize the required features inspired by human development and cognitive mechanisms.By using the concept of tool-body assimilation, it is possible to treat a tool as an extension of the body.Therefore, it is possible to represent the body and tool use models in one single model.Previous models implemented additional models per additional tool to be used in addition to the body model.However, our proposed model made it possible to represent all this with only one model.Related studies have required preset motions, preset tool and motion features, and full searches of all possible motions during motion generation.To overcome these issues, we propose the following approach: (1) body babbling with a humanoid model that does not require preset motions, (2) learning algorithm that does not require preset sensorimotor integration and tool features, with the concept of tool-body assimilation by using MTRNN and image feature extraction by an autoencoder with DNN, and (3) recognition of motion from the goal state.The evaluation experiment is an object manipulation task conducted with OpenHRP3, a robotics simulator.As a result, when given an image of a final state, the robot is able to generate a motion similar to the final state.
As next steps, we plan to extend the study to a sevendegrees-of-freedom model, design research settings that consider more of the human body, and set a more specific task for quantitative assessments.

Figure 1 :
Figure 1: Overview of tool manipulation model.

Figure 5 :
Figure 5: Robot based on ACTROID with T-shaped tool.

Figure 7 :
Figure 7: Tools used in experiment.

Figure 8 :
Figure 8: Hand posture used in motion pattern.

Figure 9 :
Figure 9: Body babbling with hand and T-shaped tool.

Figure 11 :
Figure 11: Reference vector of SOM (bare hand and T-shaped tool only).

Figure 13 :
Figure 13: PCA PB space trained from the target image.

Figure 14 :
Figure 14: PCA PB space trained from the target image with additional conditions.

Figure 17 :Figure 18 :
Figure 17: Relationships between the success rate and the position error.

Table 1 :
DOF and link length of the robot.Range of motion of the joint angle. *