Introduction to Scripting, Data Mining, and Machine Learning

Semester page

Framing Techno-Anthropological Transformation

Digital Anthropology

Project

LEARNING OBJECTIVES

KNOWLEDGE

explain gathering of / obtaining larger datasets and assessing the quality of the data, i.e. its discursive power.
describe assessment of the reliability, predictive power and generalizability of the processed data and visualizations of it.
account for advanced data gathering and data processing techniques.

SKILLS

cleaning and preparing larger datasets for analysis.
writing simple code snippets in a scripting language, e.g. Python.
handling data in a scripting tool.
structured debugging and problem-solving during the scripting process.
visualizing data via relevant types of data diagrams.

COMPETENCES

analyzing datasets both with inductive / explorative approaches and driven by a hypothesis.
assessing the quality of both data, findings and data visualizations.
presenting transparent descriptions of applied data mining processes.
assessing the coherence of the data processing in relation to the result presented.
ability to reflect about the relation between data, findings and discourses.

Important links:

E24 TAN7 - Introduction to Scripting, Data Mining and Machine Learning [C-A-M]

Lecture notes III

Pre-lesson understanding

What are the key characteristics of the three components of the course?

Introduction to Scripting

Scripting refers to writing small programs or scripts that automate tasks or control other software applications. It is often used for repetitive tasks, such as data manipulation, system administration, and web development.

Key Characteristics:

Interpreted Languages: Scripting languages (e.g., Python, JavaScript, R, Bash) are usually interpreted, meaning they are executed line by line rather than compiled into machine code.
Automation: Scripting is ideal for automating workflows. For instance, automating data collection from multiple sources or cleaning datasets.
Efficiency: Scripts make it easier to perform complex operations with minimal human intervention.

Example in Techno-Anthropology:

In the field of techno-anthropology, scripting can be used to automate the collection of large datasets from social media platforms to analyze how users engage with technology. For example, using Python libraries such as BeautifulSoup or Selenium, one could scrape data from websites to examine user interaction patterns.

Tools and Languages:

Python: Ideal for scientific computing and data manipulation.
R: Widely used in data analysis and statistics.
Bash: Popular for task automation in Unix-based systems.

Introduction to Data Mining

Data Mining refers to the process of discovering patterns, trends, and relationships in large datasets using algorithms and statistical techniques. It is a key technique within the broader field of data science and often serves as a precursor to machine learning.

Key Components:

Data Preparation: Cleaning and organizing data to make it usable for analysis. This includes dealing with missing values, removing duplicates, and standardizing formats.
Pattern Recognition: Identifying meaningful patterns or anomalies in datasets. These can be trends over time, clustering of data points, or correlations between variables.
Association and Classification: Data mining methods can categorize data into groups (classification) or find associations between different data attributes.

Example in Techno-Anthropology:

Imagine exploring patterns in user behavior across different technological systems within a community. By mining data on technology usage (e.g., smart homes, wearables), a techno-anthropologist could uncover cultural and social practices influencing or influenced by these technologies.

Techniques:

Clustering: Grouping similar data points together (e.g., k-means, DBSCAN).
Classification: Assigning data points to predefined categories (e.g., decision trees, SVM).
Association Rule Mining: Discovering relationships between variables in datasets (e.g., Apriori algorithm).

Tools:

Weka: A comprehensive suite for data mining.
RapidMiner: A user-friendly platform for data analysis.
Python (with libraries such as Pandas, scikit-learn): A versatile language for data mining and machine learning.

Introduction to Machine Learning

Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from data without explicit programming. ML algorithms analyze large datasets to make predictions, identify trends, or detect anomalies. The focus is on enabling systems to improve their performance over time as they gain experience.

Key Concepts:

Supervised Learning: Involves training a model on labeled data. The goal is to predict outcomes for unseen data. (e.g., predicting housing prices based on features like size, location).
Unsupervised Learning: Deals with unlabeled data, aiming to discover hidden structures or patterns. (e.g., clustering users by their purchasing habits).
Reinforcement Learning: Involves training an agent to make decisions by rewarding desirable behaviors and penalizing undesirable ones.

Example in Techno-Anthropology:

Machine learning can be applied to study human-technology interaction by predicting how certain groups will adapt to new technological systems. For instance, a supervised learning algorithm could analyze historical data from smart city projects to predict the success of new urban technologies within specific socio-cultural contexts.

Techniques:

Regression Analysis: Predicting a continuous outcome (e.g., house prices, energy consumption).
Decision Trees & Random Forests: Used for classification and regression.
Neural Networks: Inspired by the human brain, used for tasks like image and speech recognition.
Deep Learning: A subfield of ML involving neural networks with many layers, particularly useful in tasks like natural language processing (NLP) and computer vision.

Tools:

TensorFlow and PyTorch: Deep learning frameworks.
scikit-learn: A robust library for classical machine learning algorithms.
Keras: Simplifies building neural networks.

Interdisciplinary Reflections: Techno-Anthropological Perspectives

The convergence of scripting, data mining, and machine learning represents a powerful socio-technical dynamic in understanding human-technology interactions. Techno-anthropologists are well-placed to examine not only the technical aspects of these technologies but also their implications for society and culture.

Data Ethics: As data mining and machine learning grow more pervasive, questions about data privacy, surveillance, and the socio-political power of algorithms arise. How are human behaviors and identities shaped when algorithms determine social outcomes, such as creditworthiness or job prospects?
Algorithmic Bias: Machine learning models can perpetuate societal biases if trained on biased data. From a techno-anthropological perspective, this raises concerns about fairness, accountability, and the role of human oversight in technological systems.
Cultural Context: Data-driven technologies are embedded within specific cultural and political systems. How does the deployment of machine learning differ across societies, and how do cultural norms influence the design and implementation of these technologies?
Power Dynamics: Techno-anthropology critically examines the power relations that are embedded in the development and use of data-driven systems. Who controls the data, and who benefits from the insights gained from data mining and machine learning? What are the implications for marginalized communities?

Suggestions for Further Reading:

Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank, and Mark A. Hall.
Artificial Intelligence: A Guide for Thinking Humans by Melanie Mitchell (for ethical and philosophical insights into AI).
Ethics of Data and Analytics by Kirsten Martin (for a critical take on the ethical dimensions of data mining and ML).
Postphenomenology and Technoscience: The Peking University Lectures by Don Ihde (to explore human-technology relations from a phenomenological standpoint).

KNOWLEDGE

Gathering and Assessing the Quality of Large Datasets

The process of gathering large datasets often involves a variety of methods depending on the source and type of data. Data can be gathered from primary sources such as surveys, sensors, or experiments, and secondary sources like public databases, web scraping, or social media platforms. Key challenges include ensuring the data’s reliability, validity, and representativeness.

Gathering Large Datasets:

Web Scraping: Extracting data from websites using automated scripts (e.g., Python's BeautifulSoup or Selenium).
APIs (Application Programming Interfaces): Accessing data from social media platforms, public databases, or other online services (e.g., Twitter API).
Sensor Data: Using IoT devices to collect real-time environmental, physiological, or behavioral data.
Surveys and Interviews: Gathering structured and unstructured data from human respondents.

Assessing Data Quality:

Data quality assessment involves evaluating the completeness, consistency, accuracy, timeliness, and relevance of the data.

Completeness: Are all necessary data points present? Missing data can lead to biased results.
Consistency: Is the data consistent across different sources or time periods?
Accuracy: Is the data error-free and precise? Inaccurate data can distort analysis.
Timeliness: How recent and up-to-date is the data?
Relevance: Does the data reflect the variables or phenomena under study?

Discursive Power of Data:

The discursive power of data refers to its ability to shape narratives, construct social realities, and influence decisions. In the context of techno-anthropology, data's discursive power can determine how social behaviors, technological trends, or political issues are framed and understood.

Cultural Context: Data is not neutral. Its collection, classification, and use are shaped by cultural norms, social priorities, and political agendas. For example, data about social media activity in one country might be interpreted differently based on local political or cultural contexts.
Framing and Representation: What data is collected (and what is not) reflects the priorities and power dynamics of those who design and deploy data collection tools. This can reinforce certain narratives while marginalizing others.

Assessing Reliability, Predictive Power, and Generalizability of Processed Data and Visualizations

Once data has been processed and visualized, it is crucial to assess its reliability, predictive power, and generalizability to ensure the conclusions drawn are valid and meaningful.

Reliability:

Reliability refers to the consistency of the data and results obtained. A reliable dataset should yield the same results under consistent conditions.

Internal Consistency: Data should not contradict itself. Statistical techniques like Cronbach's Alpha or Split-Half Reliability can assess internal consistency, particularly in surveys or psychological tests.
Reproducibility: If the data collection process or analysis were repeated, would the same results be obtained? Reproducibility is critical for ensuring the reliability of scientific findings.

Predictive Power:

Predictive power refers to the ability of a dataset or model to accurately forecast outcomes based on patterns within the data. This is essential for making informed predictions about future events or behaviors.

R-squared: In regression models, R² measures the proportion of variance in the dependent variable that can be predicted by the independent variables.
ROC Curves: For classification tasks, Receiver Operating Characteristic (ROC) curves help assess the predictive power of a model, especially by analyzing the trade-off between sensitivity and specificity.

Generalizability:

Generalizability refers to the extent to which findings from the data can be applied to broader contexts beyond the specific sample studied.

Cross-Validation: One method to assess generalizability is cross-validation, which involves splitting the data into training and testing sets to ensure that the model performs well on unseen data.
Bias and Representativeness: Is the dataset representative of the broader population or phenomena? Overfitting to a specific dataset can reduce generalizability.

Visualizations:

Data visualizations (e.g., graphs, heatmaps, interactive dashboards) help reveal patterns, but their interpretation also needs careful consideration.

Clarity and Accuracy: Visualizations should accurately represent the underlying data without misleading by exaggerating trends or minimizing variability.
Interpretability: Data visualizations should be easy to interpret by non-experts. Misleading visualizations can lead to false conclusions, even if the data is robust.

Advanced Data Gathering and Data Processing Techniques

As data collection and processing become more complex, advanced methods are required to handle big data, unstructured data, and real-time data streams.

Advanced Data Gathering Techniques:

Big Data Harvesting: Using cloud computing platforms like Hadoop or Google Cloud to collect and store massive amounts of data.
Crowdsourcing: Engaging the public to collect or annotate data, such as using platforms like Amazon Mechanical Turk for labeling large datasets.
Internet of Things (IoT) Data: Capturing data from connected devices in real-time (e.g., smart homes, wearables). This data is often vast and continuous, requiring specialized infrastructure for storage and analysis.

Advanced Data Processing Techniques:

Natural Language Processing (NLP):
- NLP is essential for processing unstructured text data. Techniques such as topic modeling (e.g., Latent Dirichlet Allocation), sentiment analysis, and entity recognition allow for the extraction of meaningful patterns from text.
- Example: Analyzing millions of social media posts to identify trends in public sentiment toward technology adoption.
Stream Processing:
- Tools like Apache Kafka and Apache Spark enable real-time data processing, which is crucial for time-sensitive applications like financial markets or smart cities. These systems allow for the continuous ingestion and analysis of data without storing it in traditional databases.
Deep Learning for Image and Video Data:
- Deep learning models, particularly Convolutional Neural Networks (CNNs), are used for processing large volumes of visual data. These techniques are useful in contexts like facial recognition, medical imaging, and autonomous vehicles.
- Example: Techno-anthropologists studying surveillance technologies might apply deep learning to analyze patterns in CCTV footage and understand societal implications.
Data Fusion:
- Combining data from multiple sources (e.g., sensor data, social media, public records) to provide a more holistic view. This involves resolving conflicts in data formats, temporal alignment, and dealing with missing or incomplete data.
- Example: In a smart city project, integrating traffic sensor data, weather data, and social media updates can improve real-time traffic predictions and city management.
Dimensionality Reduction:
- Large datasets often contain thousands of variables, some of which may be irrelevant or redundant. Techniques like Principal Component Analysis (PCA) and t-SNE reduce the complexity of the data while preserving the most important information.
- Example: Reducing the number of features in a survey dataset about technology use can make it easier to interpret the results without losing critical insights.

In summary, data gathering, assessment, and processing are critical steps in analyzing socio-technical systems from a techno-anthropological perspective. Advanced methods, from NLP to deep learning and data fusion, enable us to extract insights from complex, unstructured, and massive datasets. The challenge, however, lies in not just the technical execution but also understanding the broader implications—ethical, cultural, and political—of how data is collected, processed, and interpreted. By critically engaging with these processes, we can uncover how technological systems are shaping, and being shaped by, human behaviors, values, and institutions.

The Course

IntroPythonPro

PythonCrashCourseCheat cheat below

Python10Weeks

Main themes

Thinking like a programmer – basic algorithmic thinking.
Introduction to programming basics with Python as a case
Basic handling/processing of data
Challenges in creating/collecting data sets
Potentials and challenges in advanced data processing

Literature and other resources

There is no one "perfect" book for learning programming, and following a prescribed curriculum with the textbook can be limiting. Different people learn to code at different speeds and using different resources, so to remain flexible, we will not be following a textbook. Instead, we will follow a list of programming topics at a speed that fits us best.

A lot of relevant information can also be easily found online, either in form of forums, blog posts, articles, or even youtube videos explaining some concepts. Below is a list of some resources that you can use in preparing for class, or for reviewing information after class, or for expanding your knowledge beyond what we are covering in class.

Books:

E. Matthes, Python Crash Course 2nd edition (2019) (part 1: basics) - AUB permalink, online resources - abbreviated as PythonCrashCourse later.

S. Gowrishankar & A. Veena, Introduction to Python Programing 1st edition (2018) - AUB permalink - abbreviated as IntroPythonProg later.

C. P. Milliken, Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming 1st edition (2020) - AUB permalink - abbreviated as Python10Weeks later.

Other resources:

Python Programming Exercises, Gently Explained by Al Sweigart

The best python books

W3Schools

Python documentation:

The Python standard library

PyCharm resources:

Learn PyCharm

Forums/communities:

Stack Overflow

Discuss Python

From Moodle

Lessons/seminars

#00 Introduction to programming and Python
#01 Strings, numbers and boolean values
#02 Lists and for loops
#03 Flow control and conditional tests
#04 Dictionaries, nesting & while-loops
#05 Functions and Object-oriented programming'
#06 working with modules
#07 Data visualization
#08 Intro to AI and machine learning
#09 Review and exam preparation
#10 Exam: November 29, 2024.

Oral, individual examination based on the student's understanding of the course’s contents.

The exam will be based on three problems that will be developed during the course.

At the exam, you will randomly select one of the three problems, present your solution and answer questions related to the problem.

You are encouraged to make your own code and solutions, as well as work in groups to solve the problems.

Grading will be based on the completeness of the solution, and your understanding of the code used.

For the exam:

Bring your computer with the code and presentation/visual aids to present your solution.

A projector/screen will be available.

Lessons in Github Repository

GitHub - roedorpi/TAN7_Scripting_classnotes: Class notes for scripting course TAN7 e23Class notes for scripting course TAN7 e23. Contribute to roedorpi/TAN7_Scripting_classnotes development by creating an account on GitHub.

To learn more about coding

GitHub - pawelborkar/awesome-repos: A curated list of GitHub Repositories full of FREE Resources.A curated list of GitHub Repositories full of FREE Resources. - pawelborkar/awesome-repos

Python introduction

Datacamp - Python for beginners cheat sheet - PDF

Recommended by Coco

GitHub - gto76/python-cheatsheet: Comprehensive Python CheatsheetComprehensive Python Cheatsheet. Contribute to gto76/python-cheatsheet development by creating an account on GitHub.

beginners_python_cheat_sheet_pcc_all.pdf

#00 Introduction to programming and Python

A flow chart, also known as a flowchart, is a type of diagram, made of boxes and arrows. It can be used to show:

An algorithm, a step-by-step list of directions that need to be followed to solve a problem
A process, a series of stages in time where the last stage is the product, result or goal.
The planned stages of a project.

The flow chart uses boxes, arrows and other elements:

Boxes show the process operations, the various steps and actions.
Arrows show the order of the steps, and/or different options.
Other elements representing materials involved, decisions, people, time or process measurements.

https://kids.kiddle.co/Flow_chart

Background knowledge

Flow charts:

Python basics, data types, identifiers and simple operations.

PythonCrashCourse: ch2
IntroPythonProg: ch2
Python10Weeks: ch2

Points of interest

Remember to include a symbol key if you are using specialised symbols in your charts.

#09 Exam

Exam exercises:

At the exam, you will choose one of the following exercises at random, show and explain what you have done and answer questions about the program. The problem descriptions below have 4 different parts, for a passing grade (02), you should be able to complete part 1.

Completion of one extra part (2, 3 or 4) gives a grade of 04 or 07, depending on the answers to follow up questions.

Completion of two extra parts will give grades 07 or 10. Finally, completion of all four parts will can give a grade of 10 or 12. The final grade in the interval will be depending on the answers to follow up questions.

Program 1: Functional programming skills.

Write a program that selects a random word from a text file and asks the user to guess letters. Every time a person guesses a letter that is in the word, they are shown in the corresponding place of the word, else a new part of the hangman is drawn. The game ends either when the person guesses the word or the entire hangman is drawn.

start program: session05_hangman.py.

Grading:

Be able to explain the start program and answer relevant questions (minimum for passing).
Create a Hangman class that implements all functions as methods.
Create a UserData class with methods for reading data from a file, and saves a time stamp, user name, number of guesses, time used to guess the word and if the word was guessed correctly.
Use both classes in your program.

Program 2: Working with and plotting data

Write a program that can load data from a csv file, clean the data set, visualize the data in scatter plots, linear regressions and box plots. Finally export the figures created to vector graphic and pixel based formats.

start program: session07_data_analysis_workshop.py (uses helmets_data.csv) or session07_DatasaurusDozen.py (uses DatasaurusDozen.csv).

New data file: DataProb2.csv

Be able to explain one of the start programs and answer relevant questions (minimum for passing).
Load DataPro2.csv into a panda DataFrame, clean the data set and prepare it for analysis. Remove all invalid numbers and categories as well as any numeric value over 100. Eliminate all categories from the 'Profile' variable that has less than 10 elements.
Make a scatter plot with a linear regressions for Profiles A and B, for variables JFC_fitting and Speech in the same graph. Create a new categorical variable using the variable 'RespRate', the variable should be 'low' for values of 'RespRate' below 0.2 and 'high' for values higher that 0.2. Use this new variable to make a figure with four box plots, one for each combination of factors, that is A-low, A-high, B-low and B-high.
Save the figures that you generated as vector based graphics, either in pdf or svg format. Save the figures in pixel based format as png or jpeg.

PD. You are allowed to use any other data set, as long as you can show that you are able to:

Load, clean and save the data set.
Plot scatter plots and linear regressions
Create new categorical variables from numeric data, and use the new categorical variables in box plots arranged by groups
Export figures as vector base and pixel based figures.

Program 3: Working with text data.

Make a program that can read text data into python and generate a "bag of words" basically a python dictionary with each unique word as a key and the frequency of occurrence as the value. Use the bag of words to make a word cloud and use a machine learning model to identify specific features of the text you have loaded.

Start program: session08_language_detection.py (data dowloaded from url) or session08_sarcasm_dectection.py (uses Sarcasm.json)

Be able to explain one of the start programs and answer relevant questions (minimum for passing).
Adapt the CountVectorizer used in the programs so its tokens consisting of: a) words made up of 5 to 20 letters; b) words of 3 letters or less; c) combinations of 3 to 6 words.
Considering that both programs deal with classification, compare the performance of the classifier with the different settings of the ContVectorizer.
Make figures showing the different tokens produced by the CountVectorizer settings, using the wordcloud module. Use a mask so that the word cloud is contained in a circle.

Page updated

Google Sites

Report abuse