Creating and utilizing machine learning models, especially in image processing applications, generally requires a large amount of organized data. This data needs to be structured in a way that makes it efficient for the computer to ‘comb’ through and search for patterns. This activity focuses on introducing you to two of the most useful data structures for building and organizing your data, lists and datasets.
Understand the structure of Python lists.
How lists can be applied and used for dataset development and organization in Python.
Understand and explain the process of creating lists, arrays, and datasets using Python.
Understand and explain the process for indexing list elements and how this process can be applied to dataset management.
Understand the important aspects of datasets and evaluate dataset balances.
Gain experience in building and manipulating lists, list items, and datasets in Python.
Access to a computer and a large screen (if you want to share with others)
A Google account to access Google Colab
A pen or pencil
Lists are incredibly important for sorting through data and information. Understanding them is necessary for training Machine Learning models, which use massive amounts of data. There are many functions and libraries like NumPy available to help with sorting lists. Organizing all of the necessary data for training takes lots of time and effort so being able to effectively sift through large datasets is an essential skill.
Variable - A label or name that refers to a particular value stored in the memory of a computer. It can sometimes be beneficial to consider a variable as a placeholder for a value.
List - A built-in Python datatype used to store sequences of data in an ordered collection.
Index - The numerical position of an element with an ordered sequence, such as a list. The index of the first element in a list is represented by list[0], due to Python using zero-based indexing.
Element - An individual item located at an index within a structured dataset such as lists, tuples, and dictionaries.
Array - A special variable that is capable of storing multiple values within one index.
Matrix - A two dimensional data structure, often forming a rectangular array, which is indexed with a combination of row and column indexes.
Text String - An ordered sequence of characters used to represent textual data such as a title, name, etc.
NumPy (Numerical Python) - An open source library meant to assist in scientific computing. NumPy provides support for large arrays, lists, and matrices in multiple dimensions.
Dataset - A structured collection of data representing patterns relating to various categories and applications.
Read through the Python Lists and Datasets - Handout and familiarize yourself with the concepts used in this activity.
Read through and interact with the attached Python Lists and Datasets Colab notebook
Complete the Python Lists and Datasets - Knowledge Assessment
There are multitudes of available resources to assist in furthering your understanding of the concepts presented in this activity. The resources listed below are here to help you get started.