Do Popular Market Index Returns Follow a Normal Distribution?

Introduction

Many statistical tests assume normality within the data set.  In other words, before you perform any type of parametric test, one should first view the data sets and run experiments to test this assumption.  This article will show you how to pull several market indices’ historical returns and see if they follow a normal distribution using the Shapiro-Wilk test.

The main reason this article is testing the normal distribution validity of portfolio returns is to set up Monte Carlo simulations to project future retirement balances and their respective probabilities in Tableau.  

To view the interactive application and estimate your own retirement savings, click here.  

Sources: 

Market Index Background

We were able to pull the following historical index information using a convenient package in R called tidyquant.  We will test the normality of the various benchmarks listed below.

Dow Jones Industrial Average: DJI
“The Dow Jones Industrial Average, Dow Jones, or simply the Dow, is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States.” This historical information for this index includes data from Dec 1985 to Dec 2020.  For more information on the complete article source, please click here.

S&P 500: GSPC
“The S&P 500, or simply the S&P, is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices.” This historical information for this index includes data from Dec 1927 to Dec 2020.  For more information on the complete article source, please click here.

NASDAQ: IXIC
“The NASDAQ Composite is a stock market index that includes almost all stocks listed on the Nasdaq stock market. Along with the Dow Jones Industrial Average and S&P 500 Index, it is one of the three most-followed stock market indices in the United States.” This historical information for this index includes data from Dec 1971 to Dec 2020.  For more information on the complete article source, please click here.

Russell 2000: RUT
“The Russell 2000 Index is a small-cap stock market index of the smallest 2,000 stocks in the Russell 3000 Index. It was started by the Frank Russell Company in 1984. The index is maintained by FTSE Russell, a subsidiary of the London Stock Exchange Group.” This historical information for this index includes data from Dec 1987 to Dec 2020.  For more information on the complete article source, please click here.

Pull Historical Information Into R

Before we being, we need to load some libraries.

library(tidyverse)
library(lubridate)
library(ggpubr)
library(rstatix)
library(tidyquant)

We’ll set up some date parameters and pull the various index historical data using their respective ticker symbols.

from  <- today() - years(100)
from

benchmarks_tbl <- tq_get(c("^DJI", "^GSPC", "^IXIC", "^RUT"), 
                      get  = "stock.prices",
                      from = from) %>%
  group_by(symbol) %>%
  ungroup() %>%
  select(symbol, date, adjusted)

benchmarks_tbl %>% View()







In the next section, we create the beginning and end dates for each of the years to join the adjusted prices to the data set later in the code flow.

min_max_date_by_year_by_index <-benchmarks_tbl %>%
  mutate(benchmark_year = year(date)) %>%
  group_by(symbol, benchmark_year) %>%
  summarise(max_date = max(date),
            min_date_Y1 = min(date)) %>%
  ungroup() %>%
  select(symbol, benchmark_year, min_date_Y1, max_date) %>%
  mutate(min_date_Y2 = lag(max_date)) %>%
  select(symbol, benchmark_year, min_date_Y1, min_date_Y2, max_date)

Before adding this information to the adjusted prices, we also need to distinguish the index origin date from other start dates.  

rank_dates <- min_max_date_by_year_by_index %>%
  group_by(symbol) %>%
  mutate(date_rank = order(benchmark_year, decreasing=FALSE)) %>%
  
  mutate(min_date_final = case_when(
    date_rank == 1 ~ min_date_Y1,
    TRUE ~ min_date_Y2)
  ) %>%
  select(symbol, benchmark_year, date_rank, min_date_final, max_date)

rank_dates %>% View()








Finally, we can join the min_date_final and the max_date data columns back to our original table and calculate each year’s returns.

annual_returns_tbl <- rank_dates %>%
  left_join(benchmarks_tbl, by = c("symbol" = "symbol", "min_date_final" = "date")) %>%
  rename(open_adjusted = adjusted) %>%
  left_join(benchmarks_tbl, by = c("symbol" = "symbol", "max_date" = "date")) %>%
  rename(close_adjusted = adjusted) %>%
  mutate(annual_return = (close_adjusted/open_adjusted)-1) %>%
  mutate(symbol_desc = case_when(symbol == "^DJI"  ~ "Dow Jones",
                                 symbol == "^GSPC" ~ "S&P 500",
                                 symbol == "^IXIC" ~ "NASDAQ",
                                 symbol == "^RUT"  ~ "Russell 2000",
                                 TRUE ~ "Other")) %>%
  select(symbol, symbol_desc, everything())


Visualize Index Return Distribution Using geom_density

Now that our data set is ready, we can view the distribution of each of the data sets using ggplot. “The density plot provides a visual judgment about whether the distribution is bell-shaped” - Normality Tests in R.

annual_returns_tbl %>%
  ggplot(aes(x=annual_return, fill = symbol_desc)) +
  scale_fill_manual(values=c("#4f6980", "#849db1", "#a2ceaa", "#638b66")) +
  geom_density(alpha=0.9) +
  facet_wrap(~ symbol_desc) +
  theme_classic()

For the most part, at least, visually, each of the indices follows a normal distribution pattern.


  
Visualize Index Return Distribution Using ggqqplot

Another visualization test we can perform is called the qqplot.  The "QQ plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution.  A 45-degree reference line is also plotted. In a QQ plot, each observation is plotted as a single dot. If the data are normal, the dots should form a straight line” - Normality Tests in R.

ggqqplot(annual_returns_tbl, x = "annual_return",
         color = "symbol_desc", 
         palette = c("#4f6980", "#849db1", "#a2ceaa", "#638b66")) +
  facet_wrap(~ symbol_desc)















Other than a few dots, most of the data falls within the bands, indicating again, most of the data represents a normal distribution.

Test Significance of Normality Using shapiro_test

The last and probably the most robust test thus far is the Shapiro-Wilk test. “Shapiro-Wilk’s method is widely recommended for normality test, and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores (Ghasemi and Zahediasl 2012)” - Normality Tests in R.  The test setup and results can be viewed below.

annual_returns_stats_tbl <- annual_returns_tbl %>% 
  group_by(symbol_desc) %>%
  summarise(mean_return = mean(annual_return),
            st_dev_return = sd(annual_return),
            shaprio = shapiro_test(annual_return)) %>%
  ungroup()

annual_returns_stats_tbl





Remember, the NULL hypothesis for this test assumes that the data distribution is normal.  Since no p-values are < 0.05, we assume that all index returns follow a normal distribution return.

Final Thought and Conclusions

Based on the visualizations and the test results, we assume that all four index returns follow a normal distribution pattern.  Therefore, we can confidently use the mean and standard deviations to build our Monte Carlo simulation later in Tableau.  

To view the interactive visualization and estimate your retirement account balance, click here.

Popular posts from this blog

MySQL Part 1: Getting MySQL Set Up in goormIDE