Forecasting S&P500 Stock Returns with R

Disclaimer: I am not a data scientist, but I am an R, investing, and data science enthusiast.

1. Introduction

Is it possible to predict stock returns? We will look at a group of 500 large U.S. companies and try to identify any historical patterns that can be used to predict future returns. Specifically, we will focus on the S&P 500 stock index which represents the stock performance of a diverse group of 500 of the largest U.S. companies.

Why predict S&P 500 returns? Millions of people are invested in U.S. stocks through pension plans and individual investment accounts. Many of these investors want to protect their investments by buying before prices rise and selling before prices fall.

We will perform the analysis in three broad steps:

  • Data preparation,
  • Model fitting, and
  • Evaluation of model performance

Data

  • S&P 500 monthly returns, 2011-2019 (FRED symbol: SP500)

This data is available on the FRED website.

2. Initial Setup - Load the packages and dataset

# Load libraries
suppressPackageStartupMessages({
    
    # General purpose
    library(tidyverse) # for mutate(), select(), filter()
    library(tidyquant) # for ROC(), endpoints()
    library(data.table) # for fread()
    library(lubridate) # for days()
    library(magrittr) # for is_less_than()
    library(janitor) # for clean_names()
    library(kableExtra) # for kable()
    
    # Models
    library(tidymodels)
    library(feasts) # for STL()
    library(lmtest) # for bptest() for heteroskedasticity
    library(forecast) # for forecast()
    
    # Time series
    library(tsibble) # for as_tsibble()
    library(fable)  # for ARIMA(), ETS(), MEAN(), NAIVE(), SNAIVE() TSLM()
    library(timetk) # for tk_ts(), tk_tbl()
    
    # Plotting
    library(ggthemes) # for theme_hc()
    library(formattable) # for formattable()
    
    # Data processing
    library(sweep) # for sw_tidy(), sw_glance(), sw_augment()
})


# Variables
symbols <- "SP500"

# Download data
# data <- symbols %>%
#   tq_get(get = "economic.data",
#          from = "2010-12-31", to = "2020-01-31")


# Load data
tickers <- readxl::read_xlsx("C:/Users/user/Desktop/Aaron/R/Projects/Fundamentals-Data/data/FRED Tickers.xlsx", sheet = 1, range = "A1:A300") %>%
      drop_na() 
    
# Load data 
data <- fread("C:/Users/user/Desktop/Aaron/R/Projects/Fundamentals-Data/data/keep/Prices from FRED (2021 11 15).csv") %>%
      as_tibble() %>%   
      mutate(date = ymd(date)) %>%  
      filter(symbol %in% symbols)   

3. Data Preparation and Preprocessing

Prepare the data

# Reformat the data
SP500_tbl <- data %>%
      complete(date = seq.Date(from = min(date), to = max(date), by = "days")) %>%
      arrange(date) %>%
      fill(price, symbol) %>%
      pivot_wider(names_from = symbol, values_from = price) %>%
      slice(endpoints(date, on = "month")) %>%
      mutate(ROC.SP500 = ROC(SP500)) %>%
      drop_na() %>%
      filter(date > "2011-12-31" & date <= "2021-10-31")

# Coerce the data to a time series
SP500_ts <- SP500_tbl %>%
    tk_ts(select = ROC.SP500, start = .$date[1] %>% as.yearmon(), frequency = 12)

# Coerce the data to a tsibble
SP500_tsi <- as_tsibble(SP500_ts, index = date) %>%
    rename(ROC.SP500 = value, date = index)

Univarite Analysis

Let’s look at the S&P 500 time series as a univariate series. Is it possible to predict with reasonable accuracy future returns of the S&P 500 just looking at historical data over the nine year period 2011-2019 (an admittedly arbitrary period)? Is there predictable seasonality, for example? Let’s take a look at the data in its original form and do some exploratory analysis.

# Scatterplot
SP500_tbl %>%
    ggplot(aes(x = date, y = ROC.SP500)) + #, color = ROC.SP500 > mean(ROC.SP500))) +
    geom_line(show.legend = FALSE, color = "steelblue") +
    geom_hline(aes(yintercept = mean(ROC.SP500)), color = "firebrick2", 
               linetype = "dashed", size = 0.7) +
    labs(title = "S&P 500 monthly Returns, 2011-2019",
         subtitle = "(Mean return in red)",
         y = "Return", x = "") +
    scale_y_continuous(labels = scales::percent_format()) +    
    scale_x_date(date_labels = "%Y %b") +
    theme_minimal()

# Get summary data
summary(SP500_tbl)
##       date                SP500        ROC.SP500       
##  Min.   :2012-01-31   Min.   :1310   Min.   :-0.13367  
##  1st Qu.:2014-07-07   1st Qu.:1925   1st Qu.:-0.00351  
##  Median :2016-12-15   Median :2219   Median : 0.01775  
##  Mean   :2016-12-14   Mean   :2441   Mean   : 0.01100  
##  3rd Qu.:2019-05-23   3rd Qu.:2910   3rd Qu.: 0.03282  
##  Max.   :2021-10-31   Max.   :4605   Max.   : 0.11942

The simple scatter plot doesn’t seem to reveal any clear trends or seasonality. Let’s look at distributions of returns grouped by year. The violin plot is useful here.

The mean monthly return over the nine-year period 2011-2019 was 0.9%.

# Violin plot of returns    
SP500_tbl %>%
    mutate(yr = year(date)) %>%
    ggplot(aes(x = date, y = ROC.SP500)) +
    geom_violin(aes(group = yr), draw_quantiles = c(0.25, 0.5, 0.75), 
                show.legend = FALSE, fill = "lightblue2", alpha = 0.5) +
    geom_hline(yintercept = 0, color = "darkgrey", size = 0.7) +
    geom_hline(aes(yintercept = mean(ROC.SP500)), color = "firebrick2", 
               size = 0.7, linetype = "longdash") +
    labs(title = "Violin plots of Returns by Year",
         subtitle = "Mean in red",
         x = "", y = "Return") +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
    geom_point(aes(color = factor(yr)), size = 0.9,
               show.legend = FALSE) +
    theme_hc() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          panel.grid = element_blank()) +
    facet_grid(~ yr, scale = "free")

The violin plots show high volatility of returns in 2011 followed by decreased volatility in subsequent years and finally high volatility again in 2018 and 2019. The range of returns (distance from bottom of violin plot to its top) seems to increase and decrease without any clear regularity. No clear patterns of volatility stand out.

Let’s look at some more distributions of returns, but this time grouped by month. For example, the distribution for January includes January returns for each year (2011-2019).

# Get summary data
SP500_tbl %>%
      group_by(month = month(date)) %>%
      summarize(mean = mean(ROC.SP500),
                median = median(ROC.SP500),
                stdev = sd(ROC.SP500),
                downdev = DownsideDeviation(ROC.SP500, MAR = 0)) %>%
      mutate(mean_to_sd = mean / stdev,
             mean_to_dd = mean / downdev) %>%
      mutate_all(~ round(., 3)) %>%
      kable() %>%
      kable_styling("striped")
month mean median stdev downdev mean_to_sd mean_to_dd
1 0.011 0.008 0.044 0.023 0.245 0.473
2 0.011 0.028 0.044 0.031 0.242 0.348
3 0.002 0.012 0.055 0.043 0.032 0.040
4 0.025 0.009 0.038 0.002 0.660 10.446
5 0.002 0.013 0.037 0.030 0.045 0.057
6 0.014 0.012 0.026 0.008 0.541 1.683
7 0.024 0.021 0.020 0.005 1.228 5.075
8 0.007 0.010 0.038 0.023 0.176 0.286
9 -0.004 0.002 0.028 0.022 -0.140 -0.174
10 0.012 0.021 0.047 0.026 0.248 0.446
11 0.030 0.028 0.030 0.000 1.011 Inf
12 0.001 0.010 0.040 0.033 0.013 0.016
# Density plot of returns by month
SP500_tbl %>%
    group_by(month(date)) %>%
    mutate(medians = median(ROC.SP500),
           means = mean(ROC.SP500)) %>%
    ggplot(aes(x = ROC.SP500, group = month(date))) +
    geom_density(aes(x = ROC.SP500, fill = factor(months(date, abbr = TRUE), levels = month.abb)), show.legend = FALSE, alpha = 0.4) +
    labs(title = "Density Plot of Returns by Month",
         subtitle = "(Mean in red)\n(Median in green)",
         x = "Return", y = "Count") +
    facet_wrap(~ factor(months(date, abbr = TRUE), levels = month.abb), ncol = 4) +
    geom_vline(xintercept = 0) +
    geom_vline(aes(xintercept = medians), color = "green", 
               linetype = "longdash", size = 0.8) +
    geom_vline(aes(xintercept = means), color = "firebrick2", 
               linetype = "longdash", size = 0.8) +

    scale_x_continuous(labels = scales::percent_format()) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90),
          panel.grid = element_blank())

If we calculate the ratio of average return to standard deviation for returns grouped by month, we notice that February, April, July, and November especially have high average returns to standard deviation. More importantly, we notice that the ratio of mean return to downside deviation is especially high for February, April, July, and November. In other words, the returns to risk ratio appears to be especially good for investors in these months.

The density plots seem to reveal that February, July, and November especially experience a high proportion of returns greater than zero. This can be visualized by noticing that the mean and medians of each of these months’ returns are considerably greater than the zero-lines (in black) and their density curves have high peaks on the right side of the zero-line. This would suggest that stocks held in these months experience consistently positive monthly returns.

Let’s look at a calendar plot that shows returns across months and years.

# Calendar plot
SP500_tbl %>% 
    mutate(date = date - lubridate::days(1)) %>%
    mutate(month = month(date),
           year = year(date)) %>%
    mutate(bin = cut(ROC.SP500, breaks = c(-Inf, -0.02, 0, 0.02, Inf), labels = c("Below -2%", "(-2%, 0%)", "(0%, 2%)", "Above 2%"))) %>%
    ggplot(aes(x = factor(months(date, abbr = TRUE), levels = month.abb), y = substr(year, 1, 4), fill = bin)) +
    geom_tile(color = "white") +
    scale_fill_manual(values = c("firebrick2", "pink", "lightblue", "mediumturquoise")) +
    labs(title = "Calendar plot", subtitle = "S&P 500 Monthly Returns", 
         x = "", y = "", fill = "") +
    theme_hc() +
    theme(panel.grid.major = element_blank(),
          panel.border= element_blank(),
          axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "top",
          axis.ticks = element_blank())

A look at the calendar plot reveals that February, April, July, and November experience positive returns in at least seven of the nine years observed. It appears that returns in these months are consistently positive.

# Breush Pagan Test for heteroskedasticity
# - First create a linear model and use residuals to test for heteroskedasticity
lmMod <- tslm(SP500_ts ~ trend) # initial model
bptest(lmMod) %>%
    sw_glance() %>%
    pull(p.value) %>%
    is_less_than(0.05) %>%
    if_else("Heteroskedastic (alpha = 5%)", 
            "Not heteroskedastic (alpha = 5%)")
## [1] "Heteroskedastic (alpha = 5%)"

Looking at the S&P 500 monthly returns, it appears that over the nine year period, most monthly returns were positive. There also appeared to be high volatility of returns followed by low volatility and then high volatility again. However, the Breush Pagan test gave evidence that the series is not heteroskedastic (i.e., has stable volatility).

Let’s check for autocorrelation.

# Augmented Dickey-Fuller Test for stationarity
ndiffs(SP500_ts)
## [1] 0
# View the ACF and PCF plots
SP500_ts %>% 
    ggtsdisplay(theme = theme_minimal(), main = "S&P 500 returns monthly")

# Box Test for Autocorrelation
# - Test of whether any of a group of autocorrelations of a time series are different from zero. Instead of testing randomness at each distinct lag, it tests the "overall" randomness based on a number of lags.
Box.test(SP500_ts, type = "Ljung") %>%
    sw_glance() %>%
    pull(p.value) %>%
    is_less_than(0.05) %>%
    if_else("Autocorrelation exists (alpha = 5%).", 
            "No autocorrelation exists (alpha = 5%).")
## [1] "No autocorrelation exists (alpha = 5%)."

We tested if the series is trend-stationary by determining how many first differences should be applied to the series to make it stationary. The result was 0, so the series is likely stationary (i.e., mean-reverting).

The autocorrelation function (ACF) plot shows no significant correlations with any of the prior 12 lagged months. The same was observed for the partial autocorrelation function (PACF) plot. The Ljung-Box test gave evidence that no autocorrelation exists in the time series. Overall, it appears that the S&P 500 monthly returns series is white noise (i.e., follows a random walk). To further test that, we could take several time slices of the series and check that the means and volatilities of each subseries match those of the entire series.

Let’s now check for seasonality. The S&P 500 returns appears to be white noise. Let’s decompose the returns and look at the seasonal component. We’ll also test if the seasonal component is useful in forecasting. We’ll check this by building two ETS models: one with a seasonal component and one without. If the RMSE of the model with the seasonal component is higher than that of the model without the seasonal component, then the seasonal component likely contributed to model performance.

# Seasonal decomposition
SP500_tsi %>%
    model(STL(ROC.SP500 ~ season())) %>%
    components() %>%
    rename(original = ROC.SP500,
           seasonal = season_year) %>%
    clean_names() %>%
    select(-model) %>%
    pivot_longer(-c(date, season_adjust), names_to = "component") %>%
    arrange(factor(component, levels = unique(component)), date) %>% 
    ggplot(aes(x = date, y = value)) +
    geom_line() +
    scale_y_continuous(labels = scales::percent_format()) +
    facet_wrap(~ component) +
    labs(title = "Plot of Decomposed S&P 500 Returns",
         x = "", y = "") +
    theme_minimal()

# Informal test for signifance of seasonality
# Exponential time series model with "Additive" Error, "None" trend, and "Additive" season types
SP500_tsi %>% 
    model(ETS(ROC.SP500 ~ season("A") + trend("N") + error("A"))) %>% 
    forecast::accuracy() %>% 
    pull(RMSE)
## [1] 0.03657014
# Exponential time series model with "Additive" Error, "None" trend, and "None" season types
SP500_tsi %>% 
    model(ETS(ROC.SP500 ~ season("N") + trend("N") + error("A"))) %>%
    forecast::accuracy() %>% 
    pull(RMSE)
## [1] 0.03775592

We noticed that the RMSE of the ETS model without the seasonal component was higher and therefore suggests the seasonal component did not add to the performance of the model.

Model - Univariate Time Series

Although there appears to be no seasonality or significant trend, let’s fit several forecast models and check our prediction results. We will fit the following models:

  • Naive, random walk - This method uses the last observation as the forecast for the next period.

  • Naive, random walk with a drift - This method uses the last observation plus the historical trend to create a forecast for the next period.

  • Mean (Historical average) - This method simply uses the historical average as the forecast for the next period.

  • ARIMA (Autoregressive Integrated Moving Average) - This method models lags of the data as well as current and lagged errors to create forecasts.

  • TSLM (Time series linear model with trend and seasonality) - This method uses a linear model with trend and seasonality to create forecasts.

  • ETS (Exponential time series) - This method uses an exponential model with seasonality to create forecasts.

  • Mixed - This method simple finds an average of the forecasts created from the models above to create forecast.

# Split the time series into training and test sets
initial_time_split <- initial_time_split(SP500_tsi, prop = 0.8) 
training_data <- initial_time_split %>% training() 
test_data <- initial_time_split %>% testing()

SP500_recipe <- training_data %>%
    recipe(ROC.SP500 ~ .) %>%
        prep()

# Extract the data from the SP500_recipe object
SP500_training <- SP500_recipe %>% juice()

# Apply the recipe to the testing data
SP500_testing <- SP500_recipe %>% bake(test_data)


# Fit multiple time series models
fit <- SP500_training %>%
    model(
        # Naive, Random Walk Forecasts
        # Forecasts equal to last observed value (appropriate for many financial series)
        rw = RW(ROC.SP500),
        # Drift method
        # Forecasts equal to last value plus average change over series (appears as line)
        rw.drift = RW(ROC.SP500, drift = TRUE),
        # Forecasts equal to mean of historical data
        mean = MEAN(ROC.SP500),
        # Seasonal Naive
        # Forecasts equal to last value from same season
        snaive = SNAIVE(ROC.SP500 ~ lag("year")),
        # ARIMA
        # Forecasts based on lagged values of series as well as lagged errors
        arima = ARIMA(ROC.SP500),
        # TSLM (Time Series Linear Model)
        # Applies a trend, seasonal, and error terms to the data
        tslm = TSLM(ROC.SP500 ~ trend() + season()),
        # ETS (Exponential Time Series)
        # Uses an exponential model with trend and seasonality to create forecasts
        ets = ETS(ROC.SP500)) %>%
        mutate(mixed = (rw + rw.drift + mean + snaive + arima + tslm + ets) / 7)

# Optimal ARIMA model parameters
fit %>% select(arima) %>% report()
## Series: ROC.SP500 
## Model: ARIMA(0,0,1) w/ mean 
## 
## Coefficients:
##           ma1  constant
##       -0.3032    0.0092
## s.e.   0.1244    0.0022
## 
## sigma^2 estimated as 0.0009633:  log likelihood=194.01
## AIC=-382.01   AICc=-381.75   BIC=-374.38
# Create the forecasts
fcast <- fit %>% 
    forecast(h = nrow(SP500_testing))

# Plot the forecasts
fcast %>% 
    filter(.model %in% c("rw", "snaive", "rw.drift", "mean")) %>%
  autoplot(SP500_training) +
  labs(title = "Forecasts for S&P 500 monthly returns",
       x = "Year", y = "Monthly return") +
  guides(colour = guide_legend(title = "Forecast")) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_minimal() +
    facet_wrap(~ .model)

fcast %>% 
    filter(.model %in% c("arima", "tslm", "ets", "mixed")) %>%
  autoplot(SP500_training) +
  labs(title = "Forecasts for S&P 500 monthly returns",
       x = "Year", y = "Monthly return") +
  guides(colour = guide_legend(title = "Forecast")) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_minimal() +
    facet_wrap(~ .model)

# Model performance
forecast::accuracy(fcast, SP500_testing) %>% 
      arrange(desc(RMSE)) %>%
      mutate_if(is.numeric, ~ round(., 3)) %>%
      kable() %>%
      kable_styling(bootstrap_options = c("striped"),
                    full_width = FALSE)
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
snaive Test 0.008 0.077 0.066 370.768 411.014 NaN NaN -0.025
tslm Test 0.012 0.058 0.046 132.466 132.466 NaN NaN 0.044
mixed Test 0.006 0.057 0.044 150.533 151.580 NaN NaN 0.044
arima Test 0.008 0.056 0.043 107.178 112.863 NaN NaN 0.006
ets Test 0.008 0.055 0.043 106.764 112.474 NaN NaN 0.006
mean Test 0.008 0.055 0.043 106.882 112.837 NaN NaN 0.006
rw Test -0.003 0.055 0.040 114.838 138.226 NaN NaN 0.006
rw.drift Test -0.003 0.055 0.040 114.838 138.226 NaN NaN 0.006
# Check residuals
interp.residuals <- fit %>% 
    residuals(type="response")

# Plot residuals
interp.residuals %>%
    ggplot(aes(x = as_date(date), y = .resid)) +
    geom_point() +
    geom_smooth(method = "loess", se = TRUE, level = 0.95) +
    scale_y_continuous(labels = scales::percent_format()) +
    scale_x_date(date_labels = "%Y") +
    facet_wrap(~ .model, ncol = 3) +
    labs(title = "Residuals plot", x = "", y = "Residual") +
    theme_hc()

# Plot density of residuals
interp.residuals %>%
    ggplot(aes(x = .resid)) + 
    geom_density(aes(color = .model), show.legend = FALSE) +
    scale_y_continuous(labels = scales::label_number(accuracy = 1)) +
    scale_x_continuous(labels = scales::percent_format()) +
    facet_wrap(~ .model, ncol = 3) +
    labs(title = "Density plots of residuals", x = "Residual", y = "Frequency") +
    theme_hc()

It appears that the Random Walk and Random Walk with a Drift models are equally the best models among the eight applied. They resulted in the highest RMSEs for out-of-sample test data. The performance metrics for these two models are roughly equal, perhaps because the trend component in the Random Walk with a Drift was approximately flat.

Relatively speaking, the Random Walk and Random Walk with a Drift models outperform the others, but are they useful in forecasting returns? Actually, an RMSE of 6.5% (0.0652) is not an acceptable forecast error when considing the average monthly return of 0.9% over the nine-year period 2011-2019. It is informative to notice that the MAPE (Mean Absolute Percentage Error) of 329% for these two models. Perhaps a MAPE of less than 20% would be acceptable.

4. Conclusions

Based on the preceding analysis, it appears that stock returns cannot be effectively modeled with the handful of forecast models employed in this analysis. There appears to be too much randomness that cannot be effectively captured in the models used.

Other Considerations

Type of analysis

We conducted a supervised learning model of a numerical variable (monthly returns), but we could have conducted a supervised learning model of a categorical variable. We could have classified monthly returns as a logical variable (positive return = TRUE, negative return = FALSE) and performed a random forest and logical regression models, among others.

Limitations

While the results are interesting, they aren’t necessarily useful for several reasons:

  • Data Insufficiency - First, there doesn’t appear to be a sufficient number of data. Ideally, we would use several decades that span economic cycles and government regimes. Moreover, the breadth of the data (we used only the S&P 500 index) should include a greater range of stocks to be able to make such broad conclusions.

  • Transaction Costs - Another limitation is that transaction costs were not considered in the monthly return calculations which could prove costly and change the conclusions reached herein.

  • Model Insufficiency - Seven forecast models were fit to the data, but there are other forecast models that could be analyzed and applied. It is possible that this could change the results and conclusions of the analysis.

  • Academic Findings - Many leaders in the academic community have analyzed stock returns over longer time horizons and over wider groups of stocks (not just 500 of the largest) and found that generally, stock returns follow random walks (i.e., behave like white noise).

Aaron Hardy
Aaron Hardy
Data analyst, programmer, and investor

My research interests include programming, finance, dashboards, and algorithmic modeling.