This project aims to explore the use of machine learning to extract simple logical rules for distinguishing edible and poisonous mushrooms. The dataset consists of descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms from the 'Audubon Field Guide', specifically the Agaricus and Lepiota families.
Each species in the dataset is labelled as either definitely edible or definitely poisonous. The guide emphasises that there is no straightforward rule for determining if a mushroom is edible, unlike the "leaves three, let it be" rule for poisonous oak and ivy. The objective is to investigate how well machine learning algorithms can extract simple logic, specifically logical disjunction (inclusive OR), which combines multiple statements and yields a true result if at least one of the statements is true.
Linear separability is relevant to disjunctive logic, indicating that if data points are linearly separable, there exists a simple rule (a linear decision boundary) to accurately classify them into different classes. The project demonstrates that by using the features 'odour' and 'spore_print_color', it is possible to distinguish if a mushroom is poisonous. For instance, a foul-smelling mushroom with green spores is likely to be poisonous. It is worth noting that the origin of these rules is not clear, and they are inferred based on human intuition and experience.
In terms of data treatment, all nominal features are encoded, and there is a variant dataset that only uses ordinal encoding with the 'odour' feature. The 'Veil_type' feature is removed. The project utilises various models, including logistic regression, support vector machines (SVM), naive bayes, K-nearest neighbours (KNN), and decision trees. Scoring is done based on accuracy and the F1 score. Feature importance is determined through weights with different regularisation penalties and the 'feature_importances" attribute of the Decision Tree Classifier (dtc). Feature selection is performed using recursive feature elimination and selectFromModel techniques. Decision tree attributes such as depth and the number of nodes are also explored.
It is unclear if the logical rule for the 'odour' feature in the mushroom dataset stems from human intuition. However, it can be inferred that the logical rule is based on empirical observations of the physical characteristics of mushrooms and their edibility. Other feature importance analyses, such as Random Forest and gradient boosting, also indicate that the 'odour' feature is one of the most important features for predicting mushroom edibility. Interestingly, when using less-sophisticated machine learning approaches, 'odour' does not rank high in feature importance, with 'gill size' often being ranked higher.
However, a decision tree model is able to identify the importance of the 'odour' feature, especially when it is encoded as ordinal. To identify a similar relationship in the nominal dataset, a more sophisticated screening with decision tree models, employing multi-parameter grid search and cross-validation, becomes necessary.
Overall, the project aims to investigate to what extent predictive models can construct simple logical rules for determining the edibility of mushrooms. The analysis suggests that while the logical rules might not have a clear origin, they can be inferred from empirical observations, and specific features such as 'odour' play a crucial role in distinguishing between edible and poisonous mushrooms.
link: notebook page
link: html version of notebook