“While we can predict house prices with accuracy, we cannot use such ML models to answer questions like whether one needs more dining rooms.”
Artificial Intelligence has been a force of nature in many fields. From augmenting advancements in health and education to bridging gaps through speech recognition and translation AI—machine intelligence is becoming more vital to us every day. Sendhil Mullainathan, a professor at the University of Chicago Booth School of Business, and Jann Spiess, an assistant professor at the Stanford Graduate School of Business, observed how machine learning, specifically supervised machine learning, was more empirical than it was procedural. For instance, face recognition algorithms do not use rigid rules to scan certain pixel recognitions. Au contraire, these algorithms utilise large datasets of photographs to predict how a face looks. This means that the machine would use the images to estimate a function f(x) that predicts the presence (y) of a face from pixels (x).
Another discipline that heavily relies on such approaches is econometrics. Econometrics is the application of statistical procedures in economic data to provide empirical analysis on economic relationships. With machine learning being used on data for uses like forecasting, can empirical economists employ ML tools in their work?
New Methods For New Data
Today, we see a considerable change in what constitutes the data individuals can work within. Machine learning enables statisticians and analysts to work with data considered too high dimensional for standard estimation methods, such as online posts and reviews, images, and language information. Statisticians could barely look at such data types for processes such as regression. In a 2016 study, however, researchers used images from Google Street View to measure block-level income in New York City and Boston. Moreover, a 2013 research developed a model to use online posts to predict the outcome of hygiene inspections. Thus, we see how machine learning can augment how we research today. Let’s look at this in further detail.
Traditional estimation methods, like ordinary least squares (OLS), are already used to make predictions. So how does ML fit into this? To see this, we return to Sendhil Mullainathan and Jann Spiess’ work—which was written in 2017, when the former taught and the latter was a PhD candidate at Harvard University. The paper took an example, predicting house prices, for which they selected ten thousand owner-occupied houses (chosen at random) from the 2011 American Housing Survey’s metropolitan sample. They included 150 variables on the house and its location, such as the number of bedrooms. They used multiple tools (OLS and ML) to predict log unit values on a separate set of 41,808 housing units—for out-of-sample testing.
Applying OLS to this will require making specifically curated choices on which variables to include in the regression. Adding every interaction between variables (e.g. between base area and the number of bedrooms) is not feasible because that would consist of more regressors than data points. ML, however, searches for such interactions automatically. For instance, in regression trees, the prediction function would take the form of a tree that splits at each node, representing one variable. Such methods would allow researchers to build an interactive function class.
One problem here is that a tree with these many interactions would result in an overfit—i.e. It would not be flexible enough to work with other data sets. This problem can be solved by something called regularisation. In the case of a regression tree, a tree of a certain depth will need to be chosen based on the tradeoff between a worse in-sample fit and a lower overfit. This level of regularisation will be selected by empirically tuning the ML algorithm—by creating an out-of-sample experiment within the original sample.
Thus, picking the ML-based prediction function involves two steps: selecting the best loss-minimising function and finding the optimal level of complexity by empirically tuning it. Trees and their depts are just one such example. Mullainathan and Speiss stated that the technique would work with other ML tools such as neural networks. For their data, they tested this on various other ML methods, including forests and LASSO, and found them to outperform OLS (trees tuned by depths, however, were not more effective than the traditional OLS). The best prediction performance was seen by an Ensemble that ran several separate algorithms (the paper ran LASSO, tree and forest). Thus, econometrics can guide design choices to help improve prediction quality.
There are, of course, a few problems associated with the use of ML here. The first is the lack of standard errors on the coefficients in ML approaches. Let’s see how this can be a problem: The Mullainathan-Spiess study randomly divided the sample of housing units into ten equal partitions. After this, they re-estimated the LASSO predictor (with the regulariser kept fixed). The results displayed a massive problem: a variable used by the LASSO model in one partition may be unused in another. There were very few stable patterns throughout the partitions.
This does not affect the prediction accuracy too much, but it does not help decipher whether two variables are highly correlated. In traditional estimation methods, such correlations are reflected as significant standard errors. Due to this, while we can predict house prices with accuracy, we cannot use such ML models to answer questions like whether a variable, e.g. number of dining rooms, is unimportant in this research just because the LASSO regression did not use it. Regularisation also leads to problems: it allows the choice of less complex but potentially wrong models. It could also bring up concerns of omitted variable biases.
Finally, it is essential to understand the type of problems ML solves. ML revolves around predicting a function y from variable x. However, many economic applications work around estimating parameter β that might underlie the relationship between x and y. ML algorithms are not built for this purpose. The danger here is taking an algorithm built for y=ŷ and presuming that its β̂ value would have the properties associated with estimation output.
Still, ML does improve prediction—so one might benefit from it by looking for problems with more significant consequences (i.e. situations where improved predictions have immense applied value).
One such category is within the new kinds of data (language, images) mentioned earlier. Analysing such data involves prediction as a pre-processing step. This is particularly relevant in the presence of missing data on economic outcomes. For example, a 2016 study trained a neural network to predict local economic outcomes with the help of satellite data in five African countries. Economists can also use such ML methods in policy applications. An example provided by Mullainathan and Spiess’ paper was of deciding which teacher to hire. This would involve a prediction task (deciphering the teacher’s added value) and help make informed decisions. These tools, therefore, make it clear that AI and ML are not to be left unnoticed in today’s world.