Association Rule Mining (ARM) is a technique used to identify relationships among a large set of data items, and it operates on a dataset made up of transactions, where each transaction is a set of objects. The goal of ARM is to discover rules that predict the occurrence of an item based on the occurrences of other items within the transaction. These rules are defined by three key metrics: support, confidence, and lift: support measures the frequency of the rule within all transactions, confidence measures the likelihood of the rule's conclusion given the conditions, and lift measures how much more often the rule conclusion and conditions occur together than expected if they were statistically independent.
In order to utilize ARM the first step is to combine and preprocess the data to ensure it is in a suitable format for the technique. This involves transforming the datasets into a list of transactions, where each transaction is a set of words. An ARM algorithm, Apriori, is then used on the data to identify frequent word sets within the transactions. This involves setting appropriate thresholds for support and confidence to ensure that the rules discovered are both statistically significant and relevant. From there, the final step is to generate association rules from the frequent word sets and then evaluate these rules based on their support, confidence, and lift metrics; this analysis will help identify common themes or topics across different platforms (Medium, News, and Reddit), which could provide insights into content trends, user interests, and insight into the discourse surrounding blockchain.
As previously mentioned, ARM requires a specific data format in order for the algorithm to work -- in this case that is unlabeled transaction data. In order to obtain this, the preprocessed, lemmatized data for each platform was used to create a list of all unique terms for each article or post, and the terms within become a transaction. Below is an image of the different sets of transaction data that were created:
NewsAPI, Reddit, and Medium Transaction Datasets for ARM
NewsAPI
Support Threshold: 0.02 (2% of transactions)
Confidence Threshold: 0.5 (50% probability)
The top rules based on confidence show that there are pairs of items that appear together with 100% confidence, which means whenever one item appears in a transaction, the other is guaranteed to appear as well. Notably, ‘wall’ → ‘street’ and ‘street’ → ‘wall’ indicate a perfect bi-directional association, possibly indicating topics that are frequently discussed together in news articles, like ‘Wall Street.’ Interestingly, lift values are particularly high for these rules (ranging up to 50), indicating a very strong association between the items compared to what would be expected if they were independent. The coverage for these rules is equal to the support, which suggests these items do not appear outside of this association within the dataset, reinforcing the strength of the relationship. For example, ‘halving’ → ‘bitcoin’ has a lift of approximately 4.22, indicating that the occurrence of ‘halving’ is over 4 times more likely when ‘bitcoin’ is also present, than it would be under standard conditions.
Support Threshold: 0.01 (1% of transactions)
Confidence Threshold: 0.5 (50% probability)
The top rules in the Reddit dataset reveal a similar pattern, with pairs of items such as ‘arent’ → ‘us’ and ‘us’ → ‘arent’ also exhibiting a perfect confidence of 1. However, the lift values are even higher, reaching 100, which suggests extremely strong associations within the context of Reddit posts.
Medium
Support Threshold: 0.02 (2% of transactions)
Confidence Threshold: 0.5 (50% probability)
In the Medium dataset, ‘contract’ → ‘smart’ indicates that every transaction that includes ‘contract’ also includes ‘smart’, with a lift of over 38, suggesting a strong association in topics related to ‘smart contracts.’ The count values reflect the number of transactions that contain the association, providing a measure of frequency and reinforcing the importance of the rules that have a higher count. Lastly, the combined dataset analysis, with a support of 0.01 and confidence of 0.5, also shows strong associations between certain pairs, such as ‘xrp’ → ‘flip’ and ‘flip’ → ‘xrp’ with a lift of almost 100. This might indicate discussions or articles about a particular event or feature specific to the XRP cryptocurrency.
In the context of security, the perfect bi-directional association between ‘wall’ and ‘street’ from the NewsAPI dataset, and similar patterns observed across other platforms, point towards an intense focus on the financial aspects of blockchain technology. This suggests a heightened awareness and concern regarding the security of investments and transactions in blockchain spaces, and the ethical implications here revolve around the responsibility of entities operating within these markets to ensure transparency, prevent fraud, and protect investors from potential risks associated with blockchain technologies. Moreover, the strong association between ‘halving’ and ‘bitcoin’ indicates a significant emphasis on cryptocurrency mechanisms and their implications for market behavior and security. Halving events, which reduce the reward for mining new blocks, have profound implications for the security of the blockchain, potentially increasing its robustness against attacks but also raising concerns about miner incentivization and the long-term viability of the network. The association between ‘contract’ and ‘smart’, particularly evident in the Medium dataset, emphasizes the growing importance of smart contracts in discussions around blockchain. These automated contracts promise to revolutionize traditional contract law, offering enhanced security through blockchain-based verification processes. However, they also introduce new ethical considerations regarding the coding and execution of legally binding agreements, the potential for exploits or unintended consequences, and the need for new frameworks to address these issues. Too, the lift values observed, especially in the context of Reddit discussions, highlight the intensity and specificity of interests within communities engaged in blockchain-related discussions. These strong associations suggest a concentrated focus on certain topics or events, reflecting the community's perception of their importance to the blockchain ecosystem's security and ethical landscape.
In conclusion, the exploration of association rules within datasets from NewsAPI, Reddit, and Medium has illuminated the intricate connections between blockchain technology, security concerns, and ethical debates. The patterns of discourse revealed through this analysis not only underscore the complexity of the blockchain ecosystem but also highlight the critical need for ongoing dialogue, research, and policy development to navigate the challenges and opportunities presented by this technology.
NewsAPI Network
Reddit Network
Medium Network