The analysis is based on the New York Stock Exchange dataset, Version 3 (license CC0: Public Domain), authored by Dominik Gawlik and posted on Kaggle.com 5 years ago.
The dataset contains a table of stock prices for 501 stocks in the S&P 500 index (a well-known stock index comprising some of the most important companies in the US), from 1/4/2010 to 12/30/2016. For each of the 501 stocks and up to 1762 working days in this interval, the table includes the opening price, the closing price, the maximum and minimum price for the day, and the transaction volume, for a total of 4,256,320 entries. Another table accounts for stock splits: when a stock splits, its price changes, but for comparison it is better to maintain continuity. 467 stocks’ prices are given for the full time interval in the split-adjusted table.
The companies are grouped into 11 sectors and 124 sub-industries.
The correlations were computed between the stock prices' day gains or losses (the differences between closing and opening prices).
Diagram of negative correlations between S&P stock prices, 2010-2016
The companies in the smaller cluster (Citizens Financial Group, Navient, Synchrony Financial, CSRA, Fortive, Qorvo, WestRock (a packaging company), Envision Healthcare, Willis Towers Watson, TripAdvisor, Charles Schwab, Michael Kors, count=12) and the stock that has negative correlations with both clusters, Mallinckrodt, are mostly in financials, IT, consumer discretionary, and other tertiary industries.
The companies in the larger cluster (count=54) include almost all utilities (86% of them), some consumer staples (one third of them, 100% of tobacco), some real estate companies (27.5% of them), some pharmaceutical companies (Allergan, Lilly, Pfizer), some energy companies, and others, like Deere (a heavy equipment manufacturer). Surprisingly, Netflix is also in this second cluster.
More domain knowledge is needed for any further analysis.
Influencing was defined in a somewhat technical way, taking into account the correlations between one stock's price change and other stocks' next day price change and making sure the correlations were significant (by comparing the p-value with various reference values). Other definitions are possible and are left for a future study.
Influences between S&P 500 stocks, 2010-2016
A stranger pairing, perhaps spurious, is between Humana (a healthcare company) and Activision Blizzard (a computer games producer).
These 3 stocks influence the stock prices of second-level companies such as SCANA, Wec Energy, American Electric Power, Altria, Edison Int’l, American Water Works Company, and lower-level companies such as Duke Energy, ConEd, and Realty Income Corp. The second level also includes Applied Materials, BNY Mellon, Citigroup, CMS Energy, Dominion Resources, Eversource Energy, Goldman Sachs, Martin Marietta Materials, Micron Technology, PG&E, Royal Caribbean Cruises, State Street, and Xcel Energy (total count=19). Among these 19 companies are 10 utilities, 4 financial companies (including 3 banks), and 2 semiconductor manufacturers.
In turn, these stocks influence the prices of the next cluster (count=15), consisting of 9 financial institutions, 4 utilities, a rental company, and Kohl’s. Within this cluster, all stock prices influence each other (directly or indirectly) as well, except for Kohl’s, Northern Trust, and NextEra Energy, which are separate.
On the lowest (fourth) level are the 11 companies whose stock prices are influenced by the above. This level contains 4 real estate companies, 5 financial companies, a prescription benefit management company, and a media company (Viacom).
The top influencers by count are utilities, starting with ConEd (which influences the most other stocks, 14) and the American Water Works Company. The top influenced company by count is ConEd again, followed by Comerica, KeyBank, and Duke Energy (2 of which are financial institutions).
S&P 500 stocks by sector
I applied clustering methods based on either the correlation between stocks (the r-value) or the significance of the correlation (the p-value for the correlation being nonzero), with a variable number of clusters. The results were extremely similar for the two similarity measures.
When grouping stocks into just 2 clusters, one of them consisted of all the utilities and the other contained all the other S&P 500 stocks. This is a sign that the clusters largely overlap with the official classification of stocks into sectors and sub-industries (and that utilities behave differently from all other S&P 500 stocks).
I used two measures for the fit between the clusters I obtained and the given classifications: adjusted mutual information (henceforth AMI, better when close to 1) and the p-value of the chi-square test for independence of the categories (better when close to 0).
The p-value started at 1E-100 for sectors and 1E-40 for sub-industries, when compared to a classification into 2 clusters, and became indistinguishable from 0, as a float number, for 5 clusters for the classification into sectors and for 10 clusters for the classification into sub-industries.
It follows that the stocks' natural groupings into clusters are highly correlated to the classifications (into sectors and sub-industries).
The AMI score started around 0.16 for sectors and 0.05 for sub-industries, for 2 clusters, grew rapidly, and reached a maximum of 0.67 at 10 clusters for sectors and of 0.56 at 52 clusters for sub-industries. Scores above 0.5 indicate a good fit, especially considering that the first two scores, 0.16 and 0.05, already correspond to a perfect fit: in the case of 2 clusters, every sector or sub-industry is completely contained in one of the clusters. The AMI score between the sectors and the sub-industries was also only 0.55, again in spite of a perfect overlap. Thus, AMl scores over 0.5 indicate a perfect or an almost perfect fit.
Here I will focus on the models with 10 and 52 clusters (significant in view of their AMI scores) and describe anomalous stocks in these models: stocks that were clustered differently from the rest of their sector or sub-industry.
The “Internet Software & Services”, “Consumer Finance”, "Restaurants", “Specialty Stores”, and “Oil & Gas Refining & Marketing & Transportation” sub-industries are split in two almost equal parts by clustering, so perhaps a new classification is needed in some of these cases.
The IT sector is split as well, with a large part that should rather be grouped with the industrial sector. The financial sector is also split roughly in half, with banks on one side (forming their own cluster) and insurance companies on the other (as part of a larger cluster).
More domain knowledge is needed to understand what makes these stocks special and why these sub-industries are evenly split between clusters. For example, for restaurant chains the distinction is between Chipotle and Yum! on one hand and McDonalds, Starbucks, and Darden on the other. The former are clustered together with some industrial stocks, while the latter behave more similarly to retail stores.
In each case, clustering leads to a more accurate and interesting classification of S&P 500 companies than the official classification.
The dataset and the distribution of the correlations between stocks present some universal features:
Correlations between all stocks from one day to the next have a non-normal distribution.
The p-values for these correlations being nonzero follow a power law (a Pareto distribution).
The same is true for the stocks' self-correlations from one day to the next. In fact, the self-correlations (the r values) have the exact same distribution as the general correlations (p=2.2E-6), but the p-values might not (p=0.12).
These observations reflect facts about correlations of random vectors. For the same-day correlations, on the other hand, the p-values follow a different law (Gaussian to a first approximation).
Distribution of day-to-day correlations
Distribution of p-values for day-to-day correlations
Distribution of day-to-day self-correlations
Distribution of p-values for day-to-day self-correlations