Data Mining @UPM Sem I-2019/2020

This year, the Data Mining course is taken as one of the pioneer courses for substitutional blended learning, where according to UPM's rules, up to 3 weeks of lecture could be delivered fully online. This is implemented through UPM's learning management platform called PutraBlast, which is using Moodle version 3.8. This approach has given flexibility to both the instructor (Associate Professor Dr. Nurfadhlina Mohd Sharef) and the students to undergo teaching and learning activities.

Lessons have been conducted through various modes: face to face, blended and online. Flipped classroom and digital tools are used to make the lessons engaging such as FlipGrid, Padlet, Whatsapp, PutraBlast and Quizizz. In this semester, emphasis has been given on the students' innovation development skills to complement their theoretical understanding. Many peer-based learning activities were conducted, including celebrating diversity where teams of various demographic backgrounds are formed and tasks that require learning of cross-culture and background taken place. One of the tasks given was to discuss about the train services in each of their countries. This exercise allows students to identify the strengths and weakness of the service, which is to practice their analytical skills.

Among the memorable fun moments in the course was the visit to Petrosains which was meant to inspire the students for digital innovation solutions requirements by stakeholders. The students are inspired by the real-world's data scientists job, and the course is indebted to the engagement of a basketball association in Putrajaya called Pelontar XI which has shared their games record. This is taken as the dataset for the assignment and project where students have came out with various wonderful data mining solutions. To be able to deliver the tasks specified, the students have to familiarise themselves with the basketball game. For this purpose, they have to explore about the sport such as through https://www.breakthroughbasketball.com/stats/9_stats_basketball_coach_should_track.html. The direct engagement of the basketball association through briefing, shared materials and presence during the students' project presentation really boost the students' confidence that they are capable of producing problem solving for the real world. This hybrid service and challenge based learning approach is simply the best suit for the course.

There are 2 assignments and 1 project conducted in the course, besides a designed array of problem based learning activities. Videos have been mainly used in the course as the way to let the students demonstrate their acquired knowledge and skills, especially in using tools such as PowerBI and Python to perform data exploration, data analysis and machine learning models development. Dataset from industry is used, mainly from Kaggle. By using the video, students co-curate learning as their output can be shared by their peers, and making the videos public allow the students to contribute their knowledge to the mass. The students are also trained to be able to identify stakeholders' problems, craft experiment objectives, develop machine learning models and perform analysis to relate their data mining solutions with the problem.

Data Exploration

Data exploration is the first step to be done in a data mining solution development so that the developer could get information about the attributes, relationship among the attributes, relationship between the attributes and the target variable, identify needed preprocessing where techniques that can be applied include filling/replace/remove missing data, data transformation and data discretization; before developing the machine learning models. To prepare students for this task (Assignment 1), activities done include sharing of one's understanding of data mining concepts was done using Flipgrid, students work in pairs for discussion, and a video demonstration by the instructor is provided, besides reference to the past semester's work. Students work in a team to come out with the demonstration. Data exploration is also known as descriptive data analysis.

Data Analysis

This is the second step of the data mining task, which allows deeper exploration of the dataset is done. More complex visualisation is used, and students are asked to apply data similarity and clustering technique to identify correlative behavior. Then, predictive data analytic is asked to be performed through Assignment 2 and Project. Students are asked to identify a dataset from Kaggle and perform data mining solution. They have to craft a problem statement for suitable stakeholders. They have to demonstrate data analysis technique and develop data mining solutions. Tools such as PowerBI, RapidMiner and Python are used including libraries such as Numpy, Pandas and bokeh. Simulator models are also encouraged to be developed, such as https://demos.datasciencedojo.com/demo/titanic/ This could be developed using Python Flask or RapidMiner server. Dashboard from PowerBI and Tableau can also be generated.

Model Development

Once the data is preprocessed, machine learning models are developed according to the earlier specified objectives. Examples of models for descriptive analytic purpose are association rules (e.g., FP-growth and apriori) and clustering (e.g., k-means), while classification and prediction are for predictive analytics (e.g., support vector machine, naive bayes, neural network and decision tree). Various traininga nd test split can be performed, besides investigation effect of feature selection. For deep learning architecture building, various hyperparameter settings can be specified to tune the model's performance. Results are reported by using testing dataset where metric such as root mean squared error and accuracy are used. For association rules, bouldin index and elbow algorithm are benefited.

Google Sites

Report abuse