S&P 500

Introduction

This is an analysis of S&P 500 companies and the correlations between their stock prices.

The analysis is based on the New York Stock Exchange dataset, Version 3 (license CC0: Public Domain), authored by Dominik Gawlik and posted on Kaggle.com 5 years ago.
The dataset contains a table of stock prices for 501 stocks in the S&P 500 index (a well-known stock index comprising some of the most important companies in the US), from 1/4/2010 to 12/30/2016. For each of the 501 stocks and up to 1762 working days in this interval, the table includes the opening price, the closing price, the maximum and minimum price for the day, and the transaction volume, for a total of 4,256,320 entries. Another table accounts for stock splits: when a stock splits, its price changes, but for comparison it is better to maintain continuity. 467 stocks’ prices are given for the full time interval in the split-adjusted table. 
The companies are grouped into 11 sectors and 124 sub-industries. 

The analysis goals were threefold: understanding negative correlations, determining which stocks influence one another's prices, and exploring whether high correlations between stocks coincide with their classification by sectors and sub-industries.

Negative Correlations

Below is a diagram of all negative correlations among the S&P 500 stocks for the interval 2010-2016. 
As can be seen, the 67 stocks (out of 501) that have negative correlations can be grouped into two clusters such that a stock in each cluster only has negative correlations with stocks from the other cluster, plus one stock that has negative correlations with stocks from both clusters.

The correlations were computed between the stock prices' day gains or losses (the differences between closing and opening prices).

Diagram of negative correlations between S&P stock prices, 2010-2016

The companies in the smaller cluster (Citizens Financial Group, Navient, Synchrony Financial, CSRA, Fortive, Qorvo, WestRock (a packaging company), Envision Healthcare, Willis Towers Watson, TripAdvisor, Charles Schwab, Michael Kors, count=12) and the stock that has negative correlations with both clusters, Mallinckrodt, are mostly in financials, IT, consumer discretionary, and other tertiary industries.

The companies in the larger cluster (count=54) include almost all utilities (86% of them), some consumer staples (one third of them, 100% of tobacco), some real estate companies (27.5% of them), some pharmaceutical companies (Allergan, Lilly, Pfizer), some energy companies, and others, like Deere (a heavy equipment manufacturer). Surprisingly, Netflix is also in this second cluster.

The opposition expressed by negative correlations mostly overlaps with the opposition between primary sector companies that provide something concrete and tertiary industry companies that provide advice or luxury goods.

 More domain knowledge is needed for any further analysis.

Influences between Stocks

Below is a diagram of  influences between S&P 500 stocks for 2010-2016.

Influencing was defined in a somewhat technical way, taking into account the correlations between one stock's price change and other stocks' next day price change and making sure the correlations were significant (by comparing the p-value with various reference values). Other definitions are possible and are left for a future study.

There are 37 stocks that influence others, 32 stocks that are influenced, and, since there is some overlap, 52 stocks that either influence or are influenced (so 17 do both), out of 467 stocks for which we have complete data.

Influences between S&P 500 stocks, 2010-2016

There are two pairs of stocks, one influencing the other, with no connections to other stocks. All the other 48 stocks influence or are influenced by each other.

One stock influencing another

The stock price of CF Industries (a producer of fertilizers) influences that of Tyson Foods.

A stranger pairing, perhaps spurious, is between Humana (a healthcare company) and Activision Blizzard (a computer games producer).

Structure of influencing

Among the other 48 stocks, the top-level influencers are Valero Energy, Cigna (a healthcare company), and Yahoo! (an Internet company).

These 3 stocks influence the stock prices of second-level companies such as SCANA, Wec Energy, American Electric Power, Altria, Edison Int’l, American Water Works Company, and lower-level companies such as Duke Energy, ConEd, and Realty Income Corp. The second level also includes Applied Materials, BNY Mellon, Citigroup, CMS Energy, Dominion Resources, Eversource Energy, Goldman Sachs, Martin Marietta Materials, Micron Technology, PG&E, Royal Caribbean Cruises, State Street, and Xcel Energy (total count=19). Among these 19 companies are 10 utilities, 4 financial companies (including 3 banks), and 2 semiconductor manufacturers.

In turn, these stocks influence the prices of the next cluster (count=15), consisting of 9 financial institutions, 4 utilities, a rental company, and Kohl’s. Within this cluster, all stock prices influence each other (directly or indirectly) as well, except for Kohl’s, Northern Trust, and NextEra Energy, which are separate.

On the lowest (fourth) level are the 11 companies whose stock prices are influenced by the above. This level contains 4 real estate companies, 5 financial companies, a prescription benefit management company, and a media company (Viacom).

Prolific influencers

The top influencers by count are utilities, starting with ConEd (which influences the most other stocks, 14) and the American Water Works Company. The top influenced company by count is ConEd again, followed by Comerica, KeyBank, and Duke Energy (2 of which are financial institutions). 

The structure of S&P 500 stock influencing shows an original picture of the US economy. Utilities, financial companies, and real estate companies seem to be the most sensitive to each other’s evolutions.

S&P 500 stocks by sector

Clustering and Classifications

I applied clustering methods based on either the correlation between stocks (the r-value) or the significance of the correlation (the p-value for the correlation being nonzero), with a variable number of clusters. The results were extremely similar for the two similarity measures.

When grouping stocks into just 2 clusters, one of them consisted of all the utilities and the other contained all the other S&P 500 stocks. This is a sign that the clusters largely overlap with the official classification of stocks into sectors and sub-industries (and that utilities behave differently from all other S&P 500 stocks).

I used two measures for the fit between the clusters I obtained and the given classifications: adjusted mutual information (henceforth AMI, better when close to 1) and the p-value of the chi-square test for independence of the categories (better when close to 0).

The p-value started at 1E-100 for sectors and 1E-40 for sub-industries, when compared to a classification into 2 clusters, and became indistinguishable from 0, as a float number, for 5 clusters for the classification into sectors and for 10 clusters for the classification into sub-industries.

It follows that the stocks' natural groupings into clusters are highly correlated to the classifications (into sectors and sub-industries).

The AMI score started around 0.16 for sectors and 0.05 for sub-industries, for 2 clusters, grew rapidly, and reached a maximum of 0.67 at 10 clusters for sectors and of 0.56 at 52 clusters for sub-industries. Scores above 0.5 indicate a good fit, especially considering that the first two scores, 0.16 and 0.05, already correspond to a perfect fit: in the case of 2 clusters, every sector or sub-industry is completely contained in one of the clusters. The AMI score between the sectors and the sub-industries was also only 0.55, again in spite of a perfect overlap. Thus, AMl scores over 0.5 indicate a perfect or an almost perfect fit.

Anomalous Stocks

Here I will focus on the models with 10 and 52 clusters (significant in view of their AMI scores) and describe anomalous stocks in these models: stocks that were clustered differently from the rest of their sector or sub-industry.

A list of such special stocks includes Yum! Brands, News Corp, Charter, Agilent, Teradata, Monster, First Solar, Northern Trust, Regions Financial, Host Hotels and Resorts, and Weyerhouser. All these companies are different from others in their sub-industries, either in scale or in focus.

The list also includes Amazon, Expedia, Garmin, Priceline, and TripAdvisor, all 5 of which fit better in the IT sector, instead of “Consumer Discretionary”. Newmont Mining, the only gold mining company in the index, apparently fits better with the energy sector than with the mining industry. Costco, Kroger, and Whole Foods fit better in the “Consumer Discretionary” sector than in the “Staples” sector (since they are perhaps too expensive to be staples).

The “Internet Software & Services”, “Consumer Finance”, "Restaurants", “Specialty Stores”, and “Oil & Gas Refining & Marketing & Transportation” sub-industries are split in two almost equal parts by clustering, so perhaps a new classification is needed in some of these cases.

The IT sector is split as well, with a large part that should rather be grouped with the industrial sector. The financial sector is also split roughly in half, with banks on one side (forming their own cluster) and insurance companies on the other (as part of a larger cluster).

More domain knowledge is needed to understand what makes these stocks special and why these sub-industries are evenly split between clusters. For example, for restaurant chains the distinction is between Chipotle and Yum! on one hand and McDonalds, Starbucks, and Darden on the other. The former are clustered together with some industrial stocks, while the latter behave more similarly to retail stores.

In each case, clustering leads to a more accurate and interesting classification of S&P 500 companies than the official classification.

Universal Features

The dataset and the distribution of the correlations between stocks present some universal features:

Correlations between all stocks from one day to the next have a non-normal distribution.
The p-values for these correlations being nonzero follow a power law (a Pareto distribution).
The same is true for the stocks' self-correlations from one day to the next. In fact, the self-correlations (the r values) have the exact same distribution as the general correlations (p=2.2E-6), but the p-values might not (p=0.12).

These observations reflect facts about correlations of random vectors. For the same-day correlations, on the other hand, the p-values follow a different law (Gaussian to a first approximation).

Distribution of day-to-day correlations

Distribution of p-values for day-to-day correlations

Distribution of day-to-day self-correlations

Distribution of p-values for day-to-day self-correlations

Page updated

Google Sites

Report abuse