**How Markets Fool Algorithms, and Us**Data science used to be the domain of statisticians, scientists and Wall Street quants, but thanks to the ubiquity of data and open source libraries, all of us can now develop powerful, predictive models. Of course, these models also have the power to breed overconfidence, especially in the stock market where data, models, and randomness abound.

The discipline of Statistics was designed to help us avoid getting fooled by data and models, but not all of us have the time nor inclination for a rigorous study of Statistics. Fortunately, as the following simple simulations show, it’s easy to get a feel for how misleading predictive models can be by training algorithms on random data.

**"Fooled By Randomness" - Nassim Taleb**

Curve fitting, or the process of finding the function that best tracks a series of data points, isn't always a bad thing. In domains like the game of chess, or face and voice recognition, curve fitting can be an efficient approach, and it's fundamentally what Deep Learning's neural networks do.

But curve fitting can be dangerously misleading when trying to predict future market behavior from models fitted to past market data since:

- most market activity is random noise
- people's behavior changes, from central bankers pursuing QE post 2008, to investors who were burned during some prior bear market deciding to remain in cash too long.
- successful algorithms change future market behavior via feedback loops, making the future perpetually uncertain for all algorithms. Like Heisenberg's Uncertainty Principle, if algorithms observe some profitable pattern to buy low/sell high, they will exploit this "alpha" via trading until the prices adjust with little to no opportunity remaining (the alpha decay that happens continuously in trading).

**Part 1: Datasets with one Dependent Variable (Y), and one Independent Variable (X)**

Let's start with a simple linear regression of 100 Y values randomly chosen between 0.9 and 1.1 with an expected mean of 1. Linear regression's curve fitting algorithm finds the slope and intercept of the line that best approximates the data points. In other words, it finds the slope and intercept that minimizes the error between the Y points predicted by the line, and the observed Y values.

As can be seen in the graph's equation above, the slope of the red regression line is nearly zero, and the intercept is 1. But what is more noteworthy is the tiny value of R-Squared, which basically tells us that the proportion of the changes in Observed Ys that is due to changes in X is only 0.000007. In other words, the tiny R-Squared tells us to have very low confidence in this model's ability to predict the next value of Y when X=101.

So, it's not a useful model, but at least it didn't mislead us to use it to make predictions.

However, if we increase the model's complexity and solution space by adding exponential terms, or by adding more factors, then the algorithm will be able to find curves that track the data more closely, and the R-Squares will rise. For example, notice how adding an X squared term in the next Polynomial Regression increased the R-Squared dramatically from 0.000007 to 0.1017, indicating that 10% of the variability in Y is due to X.

As can be seen in the graph's equation above, the slope of the red regression line is nearly zero, and the intercept is 1. But what is more noteworthy is the tiny value of R-Squared, which basically tells us that the proportion of the changes in Observed Ys that is due to changes in X is only 0.000007. In other words, the tiny R-Squared tells us to have very low confidence in this model's ability to predict the next value of Y when X=101.

So, it's not a useful model, but at least it didn't mislead us to use it to make predictions.

However, if we increase the model's complexity and solution space by adding exponential terms, or by adding more factors, then the algorithm will be able to find curves that track the data more closely, and the R-Squares will rise. For example, notice how adding an X squared term in the next Polynomial Regression increased the R-Squared dramatically from 0.000007 to 0.1017, indicating that 10% of the variability in Y is due to X.

Increasing the polynomial order to 5 further increases the R-Squared to 0.1684. Such curve fitting may not be misleading if there was a persistent exponential relationship between Y and X, but this data is random.

Increasing the polynomial order to 5 further increases the R-Squared to 0.1684. Such curve fitting may not be misleading if there was a persistent exponential relationship between Y and X, but this data is random.

In the following animation you can see how switching to more complex polynomial models increases Y prediction errors for X=101, while simultaneously increasing the confidence in the predictions as indicated by the higher R-Squares:

**Part 2: Data sets with Multiple Independent Variables (Xs)**

Most real world phenomena are caused by multiple independent variables, or factors, so it usually makes sense to build multi

*-*factor models. But adding factors also increases the risk of curve fitting, since each additional factor increases the model's solution space exponentially, and hence its ability to find a curve that better tracks the observed points, but not necessarily the

*out-of-sample*points. This is especially true in the case of markets with all their random noise.

The following animations show why you should be wary of curve fitting and model complexity when modeling data sets with a high degree of randomness. In these simulations the Independent and Dependent Variables were given 100 random values between -10 and +10 with expected means of 0. Then, multiple regressions were performed which generated optimized Slope values for each X factor, and these Slopes (or Weights) were used to plot the blue Multiple Regression line, alongside the observed Y points plotted in red.

First, here's the simplest model, which regresses Y against only 2 Independent Variables. Notice how low the R-Squares tend to be, indicating that the models do a poor job at explaining the variability in Y's value, and that they shouldn't be relied on for predicting future outcomes.

Next, the number of independent variables is increased to 16, providing the algorithm with a much larger solution space in the random noise to search for a curve with a better fit, and the R-Squares rise as a result

**:**

**Part 3: Curve-Fitting Price and Compounded Return Data**

It's even easier to fool ourselves by curve-fitting price and compounded return data sets. Below the random Y-values and 16 X-factors were treated as returns and compounded independently. Then, the compounded Y values were regressed against the 16 compounded X-factors.

The models appear to explain Y's variability well, right?

Of course, since the data is random

Of course, since the data is random

**the curve fitting algorithm is simply finding patterns in the noise that will not be repeated in the future.**Below the data is extended for 30 points beyond the in-sample training data, which shows how worthless these curve-fitted models are for explaining out-of-sample data:Hopefully this helps to highlight how easy it is for randomness to fool algorithms, and us. To me, the most sobering aspect of the above forecast is that most actual market activity is also random noise, implying that we can fool ourselves just as easily when backtesting with actual market data.

**Conclusions?**

- The simpler your market models, the better.
- When backtesting with market data, avoid curve fitting the entire data set.
- Backtest with a disciplined approach by building models on in-sample data, followed by testing them on out-of-sample data, and once tested, resist improving your models. Once out-of-sample data has been used it becomes in-sample, and
**any additional improvements made to models or factors from that point on will essentially be curve fitting**, which may well degrade the model's ability to predict future outcomes while fueling overconfidence in the predictions. - Avoid using Cross-Validation tools naively, like Scikit-Learn’s Stratified Shuffle Split, otherwise you will end up training your model on all the data, resulting in an ideal model — that is, if you can go backward in time.
- Be your own skeptic! There are many subtle ways the future, out-of-sample data can bleed into and taint the past, in-sample data, which can fool model developers who don't think critically enough about why their new model's Sharpe Ratio is so high in backtests. For example, if you build and test a multi-factor model on your out-of-sample data, you may well gain insight into how each factor independently performed out-of-sample.
*How can you avoid using these out-of-sample insights when developing other models that incorporate some of these factors?* - Human intelligence evolved to make sense of reality's complexity by finding patterns, and without our cognitively efficient heuristics we'd struggle to even walk across a room. But, like algorithms, we also find patterns in random noise, and we are very good at rationalizing why the patterns we perceive are true - so again, play your own devil's advocate, especially when backtests look promising.

Write an algorithm to search for faces in clouds, and if given enough noisy data, it will find them

Wade Vagle, CFA, CAIA

Get in touch at Wade@SchoolsThatLast.com