ARM
ARM
Association Rule Mining (ARM) is a powerful data mining technique used to find interesting item relationships in large databases. ARM is commonly used in market basket analysis, where it finds items occurring together in transactions more frequently than not.
The "If X, then Y" rule forms the foundation of ARM.
While the second image illustrates how transactional data, such as grocery shopping, can be analyzed to find valuable patterns.
The most significant ARM measures are support, or the frequency an itemset appears in the database, confidence, or measuring the rule's reliability, and lift, which defines the extent to which the consequent is more likely to occur under the antecedent than individually. ARM-generated rules aid in decision-making by identifying dependency between variables. The Apriori algorithm is also among the most popular ARM techniques, and it iteratively discovers frequent itemsets and generates rules from them. It applies a "bottom-up" approach in which it scans the transactions to identify the individual frequent items first and then expands them into larger itemsets until it is not possible to do so anymore.The second diagram shows this concept graphically, whereby co-occurring items (e.g., butter and bread) that occur together often can be mined to generate association rules.
ARM is used in this project to identify patterns in movies based on conditions such as genres, directors, and MPA ratings. Association rules help us identify relationships among different items by transforming movie details into transactional format, like the probability of a particular genre to be well rated or directors to be linked with a given type of film. The Apriori algorithm helps eliminate meaningful relationships that are useful for movie success prediction and viewer preference. We derive insights from ARM on the mutually dependent relationships between the different attributes of movies so that we can make better marketing, recommendation, and strategic planning of production decisions.
Before reformating the data for Association Rule Mining (ARM), the dataset contained structured categorical data in the form of genres, MPA ratings, and directors. Each record in the original dataset was a movie with attribute lists stored in the form of string representations. There were multiple values per movie in the lists, i.e., multiple genres, multiple directors, and an MPA rating, not yet in a transactional data analysis format. The data was presented in a tabular form with separate columns for each categorical attribute, but it was not represented in the ARM form, where each row would consist of a single transaction composed of different categorical elements.
After transformation, the data was converted into a transactional format, where each movie's genres, directors, and MPA rating were combined into a single list, treating each movie as a transaction. To prepare the data, unnecessary characters were stripped, and only a limited number of genres and directors were retained to keep transactions concise. The lists were then alphabetized to maintain consistency. This transformation allowed the application of the Apriori algorithm by treating each movie as a basket of items (e.g., genres, MPA rating, and directors). The processed data was then used for ARM, where frequent itemsets and association rules were generated to uncover relationships between different elements in movies.
Ranking by support is useful in determining the frequent itemsets, that is, the set of genre and rating appearing most frequently within the data. One of the most frequent rules in this example is Drama → R, i.e., that the majority of dramas in the data set should be R-rated. Other higher support rules include Action → Adventure and Action → Not Rated and they indicate how frequently these two attributes occur in conjunction.
Ranked by confidence, the top 15 rules report which pairs are most reliable based on conditional probability. The most confident rule is that a movie being Animation and PG has an 84% probability of being Adventure. Rules like Dark Comedy → Comedy or Crime, Drama → R rating reveal content rating and audience acceptance patterns.
The top 15 rules by lift show the most powerful associations that strongly raise the chances of the consequent occurring when the antecedent occurs. For example, the combination of Adventure and PG has a high lift value (11.17) in predicting Animation, which means that adventure films rated PG often fall under the animation category. In the same way, Animation and PG also strongly predict Adventure, which indicates that these genres and ratings naturally occur together.
The network graph illustrates the leading rules for the highest co-occurrences of most important movie attributes, such as genres, MPAA rating, and themes, from the leading 15 rules by lift. PG-rated Adventure and Animation reflect a strong co-occurrence of most movies in these genres receiving the PG rating. Action films are also highly correlated with PG-13, given that this is a standard rating for the genre. The other obvious trend exists between Horror and Mystery, as both genres share high thematic correlation. Dark Comedy and R-rated films are highly correlated with each other, as they contain shared mature themes common with the latter genre. Documentary and Biography films are often seen to be Not Rated very frequently, presumably due to them being independent or festival films. These results provide insightful trends in film categorization that are valuable for recommendation engines, targeting audiences, and content marketing for improved classification and movie discovery in the company.
Association rule mining in this project helps in uncovering unseen patterns in the movie dataset, which evokes patterns in genres, viewers' ratings, and movie categories. By analyzing frequent co-occurrences, we can find trends that can be useful in identifying what constitutes a successful movie based on viewers' expectations and ratings. This method gives data-driven insight that can perhaps be used to make data-driven decisions on film recommendation sites, content strategy, and marketing by showing which attributes are most likely to co-occur in successful films.