Common Laws Driving the Success in Show Business

In this paper, we want to find out whether gender bias will affect the success and whether there are some common laws driving the success in show business. We design an experiment, set the gender and productivity of an actor or actress in a certain period as the independent variables, and introduce deep learning techniques to do the prediction of success, extract the latent features, and understand the data we use. Three models have been trained: the first one is trained by the data of an actor, the second one is trained by the data of an actress, and the third one is trained by the mixed data. Three benchmark models are constructed with the same conditions. The experiment results show that our models are more general and accurate than benchmarks. An interesting finding is that the models trained by the data of an actor/actress only achieve similar performance on the data of another gender without performance loss. It shows that the gender bias is weakly related to success. Through the visualization of the feature maps in the embedding space, we see that prediction models have learned some common laws although they are trained by different data. Using the above findings, a more general and accurate model to predict the success in show business can be built.


Introduction
"Do I need to change a job?" is one of the major concerns to most actors and actresses since the show business is really competitive [1]. Matthew effect [2] or the so-called "rich-getricher" phenomenon is proved to exist in the show business which demonstrates the scarcity of the resources [1]. Luck is proved to be a key element in driving the success [3]. It is well known that the effect of rich-get-richer is quite arbitrary and unpredictable [4]. Hence, most actors and actresses will meet a problem of avoiding the famine and building a sustainable career in acting [1]. Some studies have found that boosting productivity is a key metric to evaluate the success of an actor or actress, and it can be more of a network effect [5,6] than a consequence of acting skills; in other words, success is not highly related to the acting skills [1]. And, some studies show the relationship between the dynamic collaboration network and success [7]: success is a collective phenomenon [8]. Startup network is proved to have predictive power in show business [9]. And, future success can be predicted by monitoring the behavior of a small set of individuals [10]. To study the law of success, a great deal of work has been done [11][12][13][14][15][16][17][18][19].
Recently, a study shows that the success in show business is predictable and uses a heuristic threshold-based binary classifier to achieve an accuracy up to 85% [1]. In their study, they find a strong gender bias in the waiting time statistics, the location of annus mirabilis, and the career length distribution of these data. However, we have some questions here: Whether gender bias is one of the key elements driving the success? Can we find some common laws driving the success in show business? Since we want to build a general prediction model, the common laws which determine the growth and the shape of the series are more important than the differences.
To solve our questions, we design this study. e data we use are collected from the International Movie Database (IMDb), http://www.imdb.com in [1]. It consists of millions of profile sequences of actors and actresses from the birth of the film in 1888 up to the present day [1]. Each sequence records the yearly time series of credited jobs over the entire working life of the actor or actress [1]. We just consider the number of credited jobs regardless of the impact of the work, the screen time, and so on, which is the same as in [1]. e original feature space is a non-Euclidean space. We must to do the representation learning to map these features to a Euclidean space. To do this, we construct a deep model which consists of an encoder and a classifier. Since gender is an independent variable in our experiment, we train three models: (1) MAO, (2) MAE, and (3) MM. ey all have the same structure but are trained by different datasets (MAO is trained by the data of an actor, MAE is trained by the data of an actress, and MM is trained by the mixed data). Our problem can be reconstructed like follows: (1) if MAO can achieve nondegradation performance on the data of an actress like MAE and MAE can achieve nondegradation performance on the data of an actor like MAO, then it can be proved that there are common features in the series which are unrelated to the gender. (2) If MM can achieve similar and nonsuperior performance against MAO and MAE, then these features which have gender bias are not dominative features in this prediction problem; that is to say, gender bias may cause some differences into the resource allocation, but it is weakly related to success. e contributions of this paper can be concluded as follows: (1) We found that there are some common laws/features driving the success in show business by extracting and understanding the data. (2) Using these common features, a more general prediction model with an accuracy up to 90% can be built.
(3) Our experiment shows that gender bias is weakly related to success despite a recent study which shows that it affects strongly the waiting time statistics, the location of annus mirabilis, the career length distribution, etc.

2.1.
Data. e data we use consist of the careers of 1,512,472 actors and 896,029 actresses from 1888 up to 2016 and are collected from the International Movie Database (IMDb) http://www.imdb.com. Each career is viewed as a profile sequence: the yearly time series of acting jobs in films or TV series over the entire working life of the actor or actress [1]. We refer to [1] and relax their selection constraint to select the sequences of actors and actresses with working lives L ≥ 5 years, and the number of credited jobs in the annus mirabilis (AM) is ≥ 5. e sequences obtained by some more relaxed cutoffs are too short to be analyzed, and they are considered as the outliers and not included in the experiment. en, the subset we use consists of 37896 (2.51%) sequences of actors and 22025 (2.46%) sequences of actresses which is larger than the data used in the prediction model in [1]. We divide this subset into several groups for experiment: (1) Group 1: the data of an actor with AM ≥ 5 and L ≥ 20, including 21994 sequences; (2) Group 2: the data of an actress with AM ≥ 5 and L ≥ 20, including 9034 sequences; (3) Group 3: the data of an actor with AM ≥ 5.5 ≤ L < 20, including 15902 sequences; (4) Group 4: the data of an actress with AM ≥ 5.5 ≤ L < 20, including 12991 sequences. Group 1 and Group 2 can be considered as some very successful actors which are used to train the prediction model mainly. Group 3 and Group 4 can be considered as some actors who are not very successful, and they might need a prediction model more than previous groups, and these data will be used to test the prediction model.

Data Preprocessing.
To do an early prediction, we need to do some preprocessing on the data before training the model. At first, we refer to [1] to truncate each sequence into several subsequences or called subcareer series. For each sequence, we randomly sample several subsequences with a sampling rate n. e subsequences which are sampled before the annus mirabilis are regarded as class 1. e subsequences which are sampled after the annus mirabilis are regarded as class 2. Hence, it is a binary classification problem. e aim of this sampling is to get some samples of class 1 since we only have the entire working life of the actor or actress. An example of the sampling process with a sampling rate r � 4 is shown in Figure 1. NatComm19 uses the following function [1] to transfer these subsequences to scalars for the training: where w T is the number of credited jobs at year T and T is the length of the subsequence. e above transformation will lose some information like the increasing or decreasing trend. In this paper, we revise equation (1) as follows to get a new sequence and not a scalar which will protect these information: en, we use the new sequence D to train the model. Since gender is an independent variable, we construct three prediction models which will be trained by different subsets of the whole data. e details of separation of training data and test data for each model are shown in Table 1.

Prediction Model.
Recurrent neural network (RNN) or long short-term memory (LSTM) [20,21] is powerful to solve the time series prediction problem with sequential data. Compared to the standard feedforward neural network, RNN is a kind of neural networks which is as the feedback connections (memory), as shown in Figure 2. It can process not only single data points, but also the entire sequences of data. For example, LSTM is applied in some tasks such as speech recognition [22], sign language translation [23], object cosegmentation [24,25], and airport passenger management [26]. Hence, here, we use RNN with LSTM units to build an end-to-end prediction model, where the LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. Figure 3 shows the structure of our model. Sequentially, our model can be divided into two parts: (1) encoder; (2) binary classifier. e encoder consists of an LSTM layer with 30 hidden units and outputs at the last time step. And, the classifier consists of a fully connected layer, a softmax layer, and a classification layer with the cross entropy as the loss function. Our model is trained in a supervised fashion, on a set of training sequences, using an optimization algorithm, gradient descent. Since sequences have different lengths as shown in Figure 4, the feature space of these sequences is a non-Euclidean space. It is difficult to train a classifier in this feature space. Hence, each input sequence will be embedded by the encoder to a Euclidean space using the following transformation: where H is an n-dim sequence. rough the encoder, the dimension of the feature is also reduced. en, the following loss function is minimized to get the optimized parameters: where C is the real label and C is the label predicted by the classifier.
In the process of forward propagation, LSTM does not simply compute a weighted sum of the input signal. It applies a nonlinear function. For each j-th LSTM unit, it maintains a memory c j i at time j and an output gate weight o e memory cell c j t is updated by partially forgetting the existing memory and adding a new memory content c j′ t : where f j t is the weight of the forget gate and p j t is the weight of the input gate. e details of each layer's configuration are shown in Table 2. e training settings for the prediction model: max epoch is set to 15, size of the minibatch is set to 100, optimizer is Adam, and gradient threshold is set to 1. More complex models like the models with deep layers and the models with complex structures (biLSTM) have also been tested, but there is no obvious performance improvement.
at is to say, these are all fairly "off the shelf " classifiers. Since simpler is better, we just use the simplest model to show the results. Table 3-5 show the comparison between our model and a recent study NatComm19 [1] on the test data. MM_ours denotes the prediction model trained by the mixed data of an actor and actress, MAO_ours denotes the prediction model trained by the data of an actor only, and MAE_ours denotes the prediction model trained by the data of an actress only. MM_NatComm19 denotes the model of NatComm19 [1] trained by the mixed data of an actor and actress, and the learned threshold d � 6.1523; MAO_NatComm19 denotes the model of NatComm19 [1] trained by the data of an actor only, and the learned threshold d � 6.9580; and MAE_-NatComm19 denotes the model of NatComm19 [1] trained by the data of an actress only, and the learned threshold d � 5.6640. All models are trained on the training data with a cutoff value (AM ≥ 5, L ≥ 20). We can see that our models outperform NatComm19 in terms of all quantity metrics in all subsets of the test data. Our models are more general than NatComm19 and can still maintain the performance on the Computational Intelligence and Neuroscience  , whereas the performance of three models of NatComm19 degrades to near the baseline. e details of the baseline model can be found in [1]. ere is almost no difference between the performance of our three models. And, interestingly, the difference between the performance of the three models of NatComm19 can also be ignored.  Through the encoder (a LSTM layer), the input sequences will be embedded to a n-dim embedding space.

…
These n-dim embeddings will be classified by a fully connected neural network. … Figure 3: e workflow of our model. It has an end-to-end structure and can be divided into two parts: (1) encoder; (2) binary classifier. e encoder of our model is a single LSTM layer which is used to embed different sequences to an n-dim embedding space. e binary classifier is a fully connected neural network.  Computational Intelligence and Neuroscience    learn some common features that are used to classify. Since the model of NatComm19 uses a learnable threshold to classify the original feature space as shown in Figure 5, the case of MAE_NatComm19 and MAO_NatComm19 shows that the distribution and the shape of the original feature space of the data of an actor and the data of an actress are similar just as shown in Figure 6. MM_ours achieves similar and nonsuperior results compared to MAE_ours and MAO_ours, and MM_NatComm19 also achieves similar and nonsuperior results compared to MAE_-NatComm19 and MAO_NatComm19. It shows that these features which have gender bias are not dominative features in this prediction problem; that is to say, gender bias may cause some differences in some aspects like resource allocation, but it is weakly related to success. To further validate our conclusion, we visualize the embedding space in Figure 7. It seems that three models learn some different features. But, it was caused by the randomness of the neural network, and the order of these features has no meaning because it is like the eigen decomposition. From the weight of each embedding feature which is obtained in the fully connected layer, we can see that most of these embedding features are unimportant. And interestingly, all three models have only one dominative feature. e floating range of the corresponding feature in three models is also similar [− 1, s], where s is a positive scalar. We can believe that they have learned a similar feature that is used to classify. Sequence year Figure 5: e workflow of the model in NatComm19 [1]. d is a scalar threshold which is learnable. e target of this model is to get an optimal d to separate two classes in the original feature space.  Figure 6: Feature maps of the original feature space. Note. ere are a few outliers (sequences with a length over 100). It is caused by a few films that in some sense exist but have not been released. Since they are so rare and are the correct data, they are also considered as in

Conclusion
In this paper, we design a data-driven research to find out whether the gender bias is a key element and try to find some common laws/features driving the success in show business.
e experiment results show that there are some common features between the success of an actor and the success of an actress. And, gender bias is weakly related to the success. We use this property to build a general model to predict the success in show business. Compared to the benchmark, the improvement of the model is obvious. In the future, we plan to do a further research on whether gender bias is a key element and try to find some common laws driving the success in other fields.

Data Availability
e data used in this study can be accessed at https://doi.org/ 10.17605/OSF.IO/NDTA3.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Figure 7: Feature maps of the testing data of an actor and actress in the embedding space obtained by different models. Blue line denotes class 1, and red line denotes class 2. It can be seen that the curves of different datasets show the same distribution and shape in the same embedding space. And, the boundary between two classes is clearer than the original feature space. Although it seems that the embedding spaces of different models are different, they are actually equivalent because they are different approximations of the global optimum obtained by the neural network. And, the curves of each feature's weight show that there is one feature dominating the classification. Note that it is like the eigen decomposition. Hence, the order of these weights has no meaning. And, the dominative feature of each model shows a similar floating range, and there is a clear boundary between two classes in this feature. It further proves that three models have learned a similar feature.