^{1}

^{2}

^{1}

^{2}

This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.

In Section

Targeted Learning resurrects the pillars of statistics such as the facts that a model represents actual knowledge about the data generating experiment and that a target parameter represents the feature of the data generating distribution we want to learn from the data. In this manner, Targeted Learning defines a truth and sets a scientific standard for estimation procedures, while current practice typically defines a parameter as a coefficient in a misspecified parametric model (e.g., logistic linear regression, repeated measures generalized linear regression) or small unrealistic semi parametric regression models (e.g., Cox proportional hazards regression), where different choices of such misspecified models yield different answers. This lack of truth in current practice, supported by statements such as “All models are wrong but some are useful,” allows a user to make arbitrary choices even though these choices result in different answers to the same estimation problem. In fact, this lack of truth in current practice presents a fundamental drive behind the epidemic of false positives and lack of power to detect true positives our field is suffering from. In addition, this lack of truth makes many of us question the scientific integrity of the field we call statistics and makes it impossible to teach statistics as a scientific discipline, even though the foundations of statistics, including a very rich theory, are purely scientific. That is, our field has suffered from a disconnect between the theory of statistics and the practice of statistics, while practice should be driven by relevant theory and theoretical developments should be driven by practice. For example, a theorem establishing consistency and asymptotic normality of a maximum likelihood estimator for a parametric model that is known to be misspecified is not a relevant theorem for practice since the true data generating distribution is not captured by this theorem.

Defining the statistical model to actually contain the true probability distribution has enormous implications for the development of valid estimators. For example, maximum likelihood estimators are now ill defined due to the curse of dimensionality of the model. In addition, even regularized maximum likelihood estimators are seriously flawed: a general problem with maximum likelihood based estimators is that the maximum likelihood criterion only cares about how well the density estimator fits the true density, resulting in a wrong trade-off for the actual target parmaeter of interest. From a practical perspective, when we use AIC, BIC, or cross-validated log-likelihood to select variables in our regression model, then that procedure is ignorant of the specific feature of the data distribution we want to estimate. That is, in large statistical models it is immediately apparent that estimators need to be targeted towards their goal, just like a human being learns the answer to a specific question in a targeted manner, and maximum likelihood based estimators fail to do that.

In Section

We refer to our papers and book on Targeted Learning for overviews of relevant parts of the literature that put our specific contributions within the field of Targeted Learning in the context of the current literature, thereby allowing us to focus on Targeted Learning itself in the current outlook paper.

Our research takes place in a subfield of statistics we named Targeted Learning [

A statistical model

In various applications, careful understanding of the experiment that generated the data might show that even these rather large statistical models assuming the data generating experiment equals the independent repetition of a common experiment are too small to be true: see [

In a study in which one observes a single community of

In group sequential randomized trials, one often may use a randomization probability for a next recruited

Indeed, many realistic statistical models only involve independence and conditional independence assumptions and known bounds (e.g., it is known that the observed clinical outcome is bounded between

An important by-product of requiring that the statistical model needs to be truthful is that one is forced to obtain as much knowledge about the experiment before committing to a model, which is precisely the role a good statistician should play. On the other hand, if one commits to a parametric model, then why would one still bother trying to find out the truth about the data generating experiment?

The target parameter is defined as a mapping

For example, if

For example, suppose that the true conditional probability of death is given by some logistic function

However, this particular statistical estimand

This structural causal model allows one to define a corresponding postintervention probability distribution that corresponds with replacing

In general, causal models or, more generally, sets of nontestable assumptions can be used to define underlying target quantities of interest and corresponding statistical target parameters that equal this target quantity under these assumptions. Well known classes of such models are models for censored data in which the observed data is represented as a many to one mapping on the full data of interest and censoring variable, and the target quantity is a parameter of the full data distribution. Similarly, causal inference models represent the observed data as a mapping on counterfactuals and the observed treatment (either explicitly as in the Neyman-Rubin model or implicitly as in the Pearl structural causal models), and one defines the target quantity as a parameter of the distribution of the counterfactuals. One is now often concerned with providing sets of assumptions on the underlying distribution (i.e., of the full-data) that allow identifiability of the target quantity from the observed data distribution (e.g., coarsening at random or randomization assumption). These nontestable assumptions do not change the statistical model

The estimation problem is defined by the statistical model (i.e.,

The empirical mean of the influence curve

Targeted Learning is not just satisfied with asymptotic performance such as asymptotic efficiency. Asymptotic efficiency requires fully respecting the local statistical constraints for shrinking neighborhoods around the true data distribution implied by the statistical model, defined by the so called tangent space generated by all scores of parametric submodels through

Let us succinctly review the immediate relevance to Targeted Learning of the above mentioned basic concepts: influence curve, efficient influence curve, substitution estimator, cross-validation, and super-learning. For the sake of discussion, let us consider the case that the

Substitution estimators are estimators that can be described as the target parameter mapping applied to an estimator of the data distribution that is an element of the statistical model. More generally, if the target parameter is represented as a mapping on a part

In our running example, we can define

Not every type of estimator is a substitution estimator. For example, an inverse probability of treatment type estimator of

The construction of targeted estimators of the target parameter requires construction of an estimator of infinite dimensional nuisance parameters, specifically the initial estimator of the relevant part

In order to optimize these estimators of the nuisance parameters

The super-learner is defined by a library of estimators of the nuisance parameter and uses cross-validation to select the best weighted combination of these estimators. The asymptotic optimality of the super-learner is implied by the oracle inequality for the cross-validation selector that compares the performance of the estimator that minimizes the cross-validated risk over all possible candidate estimators with the oracle selector that simply selects the best possible choice (as if one has available an infinite validation sample). The only assumption this asymptotic optimality relies upon is that the loss function used in cross-validation is uniformly bounded and that the number of algorithms in the library does not increase at a faster rate than a polynomial power in sample size when sample size converges to infinity [

In our running example, we have that

Similarly, one can define a super-learner of the conditional distribution of

The super-learner’s performance improves by enlarging the library. Even though for a given data set, one of the candidate estimators will do as well as the super-learner, across a variety of data sets, the super-learner beats an estimator that is betting on particular subsets of the parameter space containing the truth or allowing good approximations of the truth. The use of super-learner provides on important step in creating a robust estimator whose performance is not relying on being lucky but on generating a rich library so that a weighted combination of the estimators provides a good approximation of the truth, wherever the truth might be located in the parameter space.

An asymptotically efficient estimator of the target parameter is an estimator that can be represented as the target parameter value plus an empirical mean of a so called (mean zero) efficient influence curve

The efficient influence curve is also called the canonical gradient and is indeed defined as the canonical gradient of the pathwise derivative of the target parameter

In our running example, it can be shown that the efficient influence curve of the additive treatment effect

As noted earlier, the influence curve

The efficient influence curve is a function of

For example, maximum likelihood estimators solve all score equations, including this efficient score equation that targets the target parameter, but maximum likelihood estimators for large semi parametric models

In our running example, we have

Due to this identity (

In fact, combining

This is a good moment to review the roadmap for Targeted Learning. We have formulated a roadmap for Targeted Learning of a causal quantity that provides a transparent roadmap [

defining a full-data model such as a causal model and a parameterization of the observed data distribution in terms of the full-data distribution (e.g., the Neyman-Rubin-Robins counterfactual model [

defining the target quantity of interest as a target parameter of the full-data distribution;

establishing identifiability of the target quantity from the observed data distribution under possible additional assumptions that are not necessarily believed to be reasonable;

committing to the resulting estimand and the statistical model that is believed to contain the true

a subroadmap for the TMLE discussed below to construct an asymptotically efficient substitution estimator of the statistical target parameter;

establishing an asymptotic distribution and corresponding estimator of this limit distribution to construct a confidence interval;

honest interpretation of the results, possibly including a sensitivity analysis [

That is, the statistical target parameters of interest are often constructed through the following process. One assumes an underlying model of probability distributions, which we will call the full-data model, and one defines the data distribution in terms of this full-data distribution. This can be thought of as modeling: that is, one obtains a parameterization

The TMLE [

Typically, the risk at a candidate parameter value

Secondly, one computes the efficient influence curve

In our running example, we have

The dimension of

One fits the unknown parameter

In our running example, we have

As apparent from the above presentation, TMLE is a general method that can be developed for all types of challenging estimation problems. It is a matter of representing the target parameters as a parameter of a smaller

We have used this framework to develop TMLE in a large number of estimation problems that assumes that

An original example of a particular type of TMLE (based on a double robust parametric regression model) for estimation of a causal effect of a point-treatment intervention was presented in [

It is beyond the scope of this overview paper to get into a review of some of these examples. For a general comprehensive book on Targeted Learning, which includes many of these applications on TMLE and more, we refer to [

To provide the reader with a sense, consider generalizing our running example to a general longitudinal data structure

One may now assume a structural causal model of the type discussed earlier and be interested in the mean counterfactual outcome under a particular intervention on all the intervention nodes, where these interventions could be static, dynamic, or even stochastic. Under the so called sequential randomization assumption, this target quantity is identified by the so called G-computation formula for the postintervention distribution corresponding with a stochastic intervention

In many data sets one is interested in assessing the effect of one variable on an outcome, controlling for many other variables, across a large collection of variables. For example, one might want to know the effect of a single nucleotide polymorphism (SNP) on a trait of a subject across a whole genome, controlling each time for a large collection of other SNPs in the neighborhood of the SNP in question. Or one is interested in assessing the effect of a mutation in the HIV virus on viral load drop (measure of drug resistance) when treated with a particular drug class, controlling for the other mutations in the HIV virus and for characteristics of the subject in question. Therefore, it is important to carefully define the effect of interest for each variable. If the variable is binary, one could use the target parameter

Software has been developed in the form of general R-packages implementing super-learning and TMLE for general longitudinal data structures: these packages are publicly available on CRAN under the function names tmle(), ltmle(), and superlearner().

Beyond the development of TMLE in this large variety of complex statistical estimation problems, as usual, the careful study of real world applications resulted in new challenges for the TMLE, and in response to that we have developed general TMLE that have additional properties dealing with these challenges. In particular, we have shown that TMLE has the flexibility and capability to enhance the finite sample performance of TMLE under the following specific challenges that come with real data applications.

For example, suppose that in our running example

We hope that the above clarifies that Targeted Learning is an ongoing exciting research area that is able to address important practical challenges. Each new application concerning learning from data can be formulated in terms of a statistical estimation problem with a large statistical model and a target parameter. One can now use the general framework of super-learning and TMLE to develop efficient targeted substitution estimators and corresponding statistical inference. As is apparent from the previous section, the general structure of TMLE and super-learning appears to be flexible enough to handle/adapt to any new challenges that come up, allowing researchers in Targeted Learning to make important progress in tackling real world problems. By being honest in the formulation, typically new challenges come up asking for expert input from a variety of researchers, ranging from subject-matter scientists, computer scientists, to statisticians. Targeted Learning requires multidisciplinary teams, since it asks for careful knowledge about data experiment, the questions of interest, possible informed guesses for estimation that can be incorporated as candidates in the library of the super-learner, and input from the state of the art in computer science to produce scalable software algorithms implementing the statistical procedures.

There are a variety of important areas of research in Targeted Learning we began to explore.

Therefore we believe that our models that assume independence, even though they are so much larger than the models used in current practice, are still not realistic models in many applications of interest. On the other hand, even when the sample size equals 1, things are not hopeless if one is willing to assume that the likelihood of the data factorizes in many factors due to conditional independence assumptions and stationarity assumptions that state that conditional distributions might be constant across time or that different units are subject to the same laws for generating their data as a function of their parent variables. In more recent research we have started to develop TMLE for statistical models that do not assume that the unit-specific data structures are independent, handling adaptive pair matching in community randomized controlled trials, group sequential adaptive randomization designs, and studies that collect data on units that are interconnected through a causal network [

For that purpose, acknowledging that one likes to mine the data to find interesting questions that are supported by the data, we developed statistical inference based on CV-TMLE for a large class of target parameters that are defined as functions of the data [

In the previous sections the main characteristics of TMLE/SL methodology have been outlined. We introduced the most important fundamental ideas and statistical concepts, urged the need for revision of current data-analytic practice, and showed some recent advances and application areas. Also research in progress on such issues as dependent data and data adaptive target parameters has been brought forward. In this section we put the methodology in a broader historical-philosophical perspective, trying to support the claim that its relevance exceeds the realms of statistics in a strict sense and even those of methodology. To this aim we will discuss both the significance of TMLE/SL for contemporary epistemology and its implications for the current debate on Big Data and the generally advocated, emerging new discipline of Data Science. Some of these issues have been elaborated more extensively in [

First and foremost, it must be emphasized that rather than extending the toolkit of the data analyst, TMLE/SL establishes a new methodology. From a technical point of view it offers an integrative approach to data analysis or statistical learning by combining inferential statistics with techniques derived from the field of computational intelligence. This field includes such related and usually eloquently phrased disciplines like machine learning, data mining, knowledge discovery in databases, and algorithmic data analysis. From a conceptual or methodological point of view, it sheds new light on several stages of the research process, including such items as the research question, assumptions and background knowledge, modeling, and causal inference and validation, by anchoring these stages or elements of the research process in statistical theory. According to TMLE/SL all these elements should be related to or defined in terms of (properties of) the data generating distribution and to this aim the methodology provides both clear heuristics and formal underpinnings. Among other things this means that the concept of a statistical model is reestablished in a prudent and parsimonious way, allowing humans to include only their true, realistic knowledge in the model. In addition, the scientific question and background knowledge are to be translated into a formal causal model and target causal parameter using the causal graphs and counterfactual (potential outcome) frameworks, including specifying a working marginal structural model. And, even more significantly, TMLE/SL reassigns to the very concept of estimation, canonical as it has always been in statistical inference, the leading role in any theory of/approach to learning from data, whether it deals with establishing causal relations, classifying or clustering, time series forecasting, or multiple testing. Indeed, inferential statistics arose at the background of randomness and variation in a world represented or encoded by probability distributions, and it has therefore always presumed and exploited the sample-population dualism, which underlies the very idea of estimation. Nevertheless, the whole concept of estimation seems to be discredited and disregarded in contemporary data analytical practice.

In fact, the current situation in data analysis is rather paradoxical and inconvenient. From a foundational perspective the field consists of several competing schools with sometimes incompatible principles, approaches, or viewpoints. Some of these can be traced back to Karl Pearsons goodness-of-fit-approach to data-analysis or to the Fisherian tradition of significance testing and ML-estimation. Some principles and techniques have been derived from the Neyman-Pearson school of hypothesis testing, such as the comparison between two alternative hypotheses and the identification of two kinds of errors of usual unequal importance that should be dealt with. And, last but not least, the toolkit contains all kinds of ideas taken from the Bayesian paradigm, which rigorously pulls statistics into the realms of epistemology. We only have to refer here to the subjective interpretation of probability and the idea that hypotheses should be analyzed in a probabilistic way by assigning probabilities to these hypotheses, thus abandoning the idea that the parameter is a fixed, unknown quantity and thus moving the knowledge about the hypotheses from the meta-language into the object language of probability calculus. In spite of all this, the burgeoning statistical textbook market offers many primers and even advanced studies, which wrongly suggest a uniform and united field with foundations that are fixed, and on which full agreement has been reached. It offers a toolkit based on the alleged unification of ideas and methods derived from the aforementioned traditions. As pointed out in [

First, nearly all scientific disciplines have experienced a probabilistic revolution since the late 19th century. Increasingly, key notions are probabilistic, research methods, entire theories are probabilistic, if not the underlying worldview is probabilistic, that is, they are all dominated by and rooted in probability theory and statistics. When the probabilistic revolution emerged in the late 19th century, this transition became recognizable in old, established sciences like physics (kinetic gas theory, statistical mechanics of Bolzmann, Maxwell, and Gibbs), but especially in new emerging disciplines like the social sciences (Quetelet and later Durkheim), biology (evolution, genetics, zoology), agricultural science, and psychology. Biology even came to maturity due to close interaction with statistics. Today, this trend has only further strengthened, and as a result there is a plethora of fields of application of statistics ranging from biostatistics, geostatistics, epidemiology, and econometrics to actuarial science, statistical finance, quality control, and operational research in industrial engineering and management science. Probabilistic approaches have also intruded many branches of computer science; most noticeably they dominate artificial intelligence.

Secondly, at a more abstract level, probabilistic approaches also dominate epistemology, the branch of philosophy committed to classical questions on the relation between knowledge and reality like: What is reality? Does it exist mind-independent? Do we have access to it? If yes, how? Do our postulated theoretical entities exist? How do they correspond to reality? Can we make true statements about it? If yes, what is truth and how is it connected to reality? The analyses conducted to address these issues are usually intrinsically probabilistic. As a result these approaches dominate key issues and controversies in epistemology such as the scientific realism debate, the structure of scientific theories, Bayesian confirmation theory, causality, models of explanation, and natural laws. All too often scientific reasoning seems nearly synonymous with probabilistic reasoning. In view of the fact that scientific inference more and more depends on probabilistic reasoning and that statistical analysis is not as well-founded as might be expected, the issue addressed in this chapter is of crucial importance for epistemology [

Despite these philosophical objections against the hybrid character of inferential statistics, its successes were enormous in the first decades of the twentieth century. In newly established disciplines like psychology and economics significance testing and maximum likelihood estimation were applied with methodological rigor in order to enhance prestige and apply scientific method to their field. Although criticism that a mere chasing of low

In EDA Tukey endeavors to emphasize the importance of confirmatory, classical statistics, but this looks for the main part a matter of politeness and courtesy. In fact, he had already put his cards on the table in 1962 in the famous opening passage from The Future of Data Analysis: “

First, Tukey gave unmistakable impulse to the emancipation of the descriptive/visual approach, after pioneering work of William Playfair (18th century) and Florence Nightingale (19th century) on graphical techniques, that were soon overshadowed by the “inferential” coup, which marked the probabilistic revolution. Furthermore, it is somewhat ironic that many consider Tukey a pioneer of computational fields such as data mining and machine learning, although he himself preferred a small role for the computer in his analysis and kept it in the background. More importantly, however, because of his alleged antitheoretical stance, Tukey is sometimes considered the man who tried to reverse or undo the Fisherian revolution and an exponent or forerunner of today’s erosion of models, the view that all models are wrong, the classical notion of truth is obsolete, and pragmatic criteria as predictive success in data analysis must prevail. Also the idea currently frequently uttered in the data analytical tradition that the presence of Big Data will make much of the statistical machinery superfluous is an import aspect of the here very briefly sketched antithesis. Before we come to the intended synthesis, the final stage of the Hegelian triptych, let us make two remarks concerning Tukey’s heritage. Although it almost sounds like a cliché, yet it must be noted that EDA techniques nowadays are routinely applied in all statistical packages along with in itself sometimes hybrid inferential methods. In the current empirical methodology EDA is integrated with inferential statistics at different stages of the research process. Secondly, it could be argued that Tukey did not so much undermine the revolution initiated by Galton and Pearson but understood the ultimate consequences of it. It was Galton who had shown that variation and change are intrinsic in nature, and that we have to look for the deviant, the special or the peculiar. It was Pearson who did realize that the constraints of the normal distribution (Laplace, Quetelet) had to be abandoned and who distinguished different families of distributions as an alternative. Galton’s heritage was just slightly under pressure hit by the successes of the parametric Fisherian statistics on strong model assumptions and it could well be stated that this was partially reinstated by Tukey.

Unsurprisingly, the final stage of the Hegelian triptych strives for some convergence if not synthesis. The 19th century dialectical German philosopher G.F.W. Hegel argued that history is a process of becoming or development, in which a thesis evokes and binds itself to an antithesis; in addition both are placed at a higher level to be completed and to result in a fulfilling synthesis. Applied to the less metaphysically oriented present problem, this dialectical principle seems particularly relevant in the era of Big Data, which makes a reconciliation between inferential statistics and computational science imperative. Big Data sets high demands and offers challenges to both. For example, it sets high standards for data management, storage, and retrieval and has great influence on the research of efficiency of machine learning algorithms. But it is also accompanied by new problems, pitfalls, and challenges for statistical inference and its underlying mathematical theory. Examples include the effects of wrongly specified models, the problems of small, high-dimensional datasets (microarray data), the search for causal relationships in nonexperimental data, quantifying uncertainty, efficiency theory, and so on. The fact that many data-intensive empirical sciences are highly dependent on machine learning algorithms and statistics makes bridging the gap of course for practical reasons compelling.

In addition, it seems that Big Data itself also transforms the nature of knowledge: the way of acquiring knowledge, research methodology, nature, and status of models and theories. In the reflections of all the briefly sketched contradiction often emerges and in the popular literature the differences are usually enhanced, leading to annexation of Big Data by one of the two disciplines.

Of course the gap between both has many aspects, both philosophical and technical, that have been left out here. However, it must be emphasized that for the main part Targeted Learning intends to support the reconciliation between inferential statistics and computational intelligence. It starts with the specification of a nonparametric and semiparametric model that contains only the realistic background knowledge and focuses on the parameter of interest, which is considered as a property of the as yet unknown, true data-generating distribution. From a methodological point of view it is a clear imperative that model and parameter of interest must be specified in advance. The (empirical) research question must be translated in terms of the parameter of interest and a rehabilitation of the concept model is achieved. Then, Targeted Learning involves a flexible, data-adaptive estimation procedure that proceeds in two steps. First an initial estimate is searched on the basis of the relevant part of the true distribution that is needed to evaluate the target parameter. This initial estimator is found by means of the super learning-algorithm. In short, this is based on a library of many diverse analytical techniques ranging from logistic regression to ensemble techniques, random forest, and support vector machines. Because the choice of one of these techniques by human intervention is highly subjective and the variation in the results of the various techniques usually substantial, SL uses a sort of weighted sum of the values calculated by means of cross-validation. Based on these initial estimators, the second stage of the estimation procedure can be initiated. The initial fit is updated with the goal of an optimal bias-variance trade-off for the parameter of interest. This is accomplished with a targeted maximum likelihood estimator of the fluctuation parameter of a parametric submodel selected by the initial estimator. The statistical inference is then completed by calculating standard errors on the basis of “influence-curve theory” or resampling techniques. This parameter estimation retains a crucial place in the data analysis. If one wants to do justice to variation and change in the phenomena, then you cannot deny Fisher’s unshakable insight that randomness is intrinsic and implies that the estimator of the parameter of interest itself has a distribution. Thus Fisher proved himself to be a dualist in making the explicit distinction between sample and population. Neither Big Data nor full census research or any other attempt to take into account the whole of reality or a world encoded or encrypted in data can compensate for it. Although many aspects have remained undiscussed in this contribution we hope to have shown that TMLE/SL contributes to the intended reconciliation between inferential statistics and computational science and that both, rather than being in contradiction, should be integrating parts in any concept of Data Science.

The expansion of available data has resulted in a new field often referred to as Big Data. Some advocate that Big Data changes the perspective on statistics: for example, since we measure everything, why do we still need statistics? Clearly, Big Data refers to measuring (possibly, very) high dimensional data on a very large number of units. The truth is that there will never be enough data so that careful design of studies and interpretation of data is not needed anymore.

To start with, lots of bad data are useless, so one will need to respect the experiment that generated the data in order to carefully define the target parameter and its interpretation, and design of experiments is as important as ever so that the target parameters of interest can actually be learned.

Even though the standard error of a simple sample mean might be so small that there is no need for confidence intervals, one is often interested in much more complex statistical target parameters. For example, consider the average treatment effect of our running example, which is not a very complex parameter relative to many other parameters of interest such as an optimal individualized treatment rule. Evaluation of the average treatment effect based on a sample (i.e., substitution estimator obtained by plugging in the empirical distribution of the sample) would require computing the mean outcome for each possible strata of treatment and covariates. Even with

Targeted Learning was developed in response to high dimensional data, in which reasonably sized parametric models are simply impossible to formulate and are immensely biased anyway. The high dimension of the data only emphasizes the need for realistic (and thereby large semiparameric) models, target parameters defined as features of the data distribution instead of coefficients in these parametric models, and Targeted Learning.

The massive dimension of the data does make it appealing to not be necessarily restricted by a priori specification of the target parameters of interest so that Targeted Learning of data adaptive target parameters discussed above is particularly important future area of research providing an important additional flexibility without giving up on statistical inference.

One possible consequence of the building of large data bases that collect data on total populations is that the data might correspond with observing a single process, like a community of individuals over time, in which case one cannot assume that the data is the realization of a collection of independent experiments, the typical assumption most statistical methods rely upon. That is, data cannot be represented as random samples from some target population since we sample all units of the target population. In these cases, it is important to document the connections between the units so that one can pose statistical models that rely on the a variety of conditional independence assumptions as in causal inference for networks developed in [

Such statistical models do not allow for statistical inference based on simple methods such as the bootstrap (i.e., sample size is 1), so that asymptotic theory for estimators based on influence curves and the state of the art advances in weak convergence theory is more crucial than ever. That is, the state of the art in probability theory will only be more important in this new era of Big Data. Specifically, one will need to establish convergence in distribution of standardized estimators in these settings in which the data corresponds with the realization of one gigantic random variable for which the statistical model assumes a lot of structure in terms of conditional independence assumptions.

Of course, Targeted Learning with Big Data will require the programming of scalable algorithms, putting fundamental constraints on the type of super-learners and TMLE.

Clearly, Big Data does require integration of different disciplines, fully respecting the advances made in the different fields such as computer science, statistics, probability theory, and scientific knowledge that allows us to reduce the size of the statistical model and to target the relevant target parameters. Funding agencies need to recognize this so that money can be spent in the best possible way: the best possible way is not to give up on theoretical advances, but the theory has to be relevant to address the real challenges that come with real data. The Biggest Mistake we can make in this Big Data Era is to give up on deep statistical and probabilistic reasoning and theory and corresponding education of our next generations and somehow think that it is just a matter of applying algorithms to data.

The authors declare that there is no conflict of interests regarding the publication of this paper.

The authors thank the reviewers for their very helpful comments which improved the paper substantially. This research was supported by an NIH Grant 2R01AI074345.