Data Structures

Special Collections

Theory

While Python includes built-in types like list and dict, the collections module offers specialized, high-performance containers designed for specific data handling tasks. These tools are often faster and cleaner for professional-grade development.

•Counter: A dictionary subclass specifically built for counting hashable objects. It is ideal for frequency analysis, such as counting words in a document or occurrences in a list.

•defaultdict: A dictionary that never raises a KeyError. It uses a "factory function" (like list, int, or set) to automatically initialize a value when you access a missing key.

•deque: Short for "Double-Ended Queue," this is a list-like container designed for fast appends and pops from both the beginning and the end.

•namedtuple: A factory function for creating tuple subclasses with named fields. It allows you to access data using descriptive names (like p.x) instead of vague index numbers (like p[0]) without the memory overhead of a full class.

Code Snippets

Python

from collections import Counter, defaultdict, deque, namedtuple

# 1. Counter: Frequency analysis

# Perfect for when you need to know "How many of each?"

text = "high performance python"

c = Counter(text.split())

print(c.most_common(1)) # Output: [('high', 1)] - Tallying word frequency

# 2. defaultdict: Auto-initializing missing keys

# This avoids checking 'if key in dict' before appending [cite: 120]

data = [('tech', 'Python'), ('tech', 'Rust'), ('art', 'Sketching')]

d = defaultdict(list)

for category, item in data:

d[category].append(item) # Automatically creates the list on first use

# 3. deque: Efficient queuing

# Fast O(1) operations for both ends

dq = deque([1, 2, 3], maxlen=3) # Fixed size

dq.append(4) # 1 is pushed out automatically from the left

dq.rotate(1) # Shifts elements: [4, 2, 3] becomes [3, 4, 2]

# 4. namedtuple: Self-documenting data structures

# Combines the immutability of a tuple with the readability of a class

User = namedtuple("User", ["name", "role", "level"])

admin = User("Alice", "Superuser", 10)

print(f"{admin.name} is a {admin.role}") # Access via name instead of index

Think About It: Why use deque instead of a list for a queue?
Answer: Removing the first item of a list (list.pop(0)) is slow ($O(n)$) because Python must shift every other item in memory to fill the gap. A deque.popleft() is nearly instantaneous ($O(1)$) because it only updates a pointer.

Common Mistake:

The defaultdict "Invisible Key" Trap:

The Scenario: You are building a system that tracks active users. You check if a user is in your "Active" dictionary, and if they aren't, you assume they are logged out.

Code Snippet:

Python

from collections import defaultdict

# We use a defaultdict to store user session data

sessions = defaultdict(dict)

# MISTAKE: Checking if a user exists by just accessing the key

def is_user_logged_in(user_id):

if sessions[user_id]: #This line CREATES an empty dict for user_id!

return True

return False

print(f"Total sessions before check: {len(sessions)}") # 0

is_user_logged_in("user_99")

print(f"Total sessions after check: {len(sessions)}") # 1 (A "Ghost" user was created)

Think about it: If you have 1 million visitors and you check each one this way, your dictionary will grow to 1 million empty entries, potentially crashing your server's memory just by "checking" status. Always use if user_id in sessions: instead.

The "Fixed Size" deque Over-rotation:

Theory: When using a deque with a maxlen, developers often forget that adding elements beyond that limit silently discards data from the opposite end. If you rotate a full deque and then append, you might lose the very data you just moved.

Code Snippet:

#MISTAKE: Losing data in a fixed-size deque

from collections import deque

cache = deque([1, 2, 3], maxlen=3)

cache.rotate(1) # [3, 1, 2]

cache.append(4) # 2 is dropped! [3, 1, 4]

# If you expected to keep '2' as your oldest record, it's gone.

Edge Case Scenario:

namedtuple is still a Tuple:

Theory: Students often love namedtuple because of the names, but they forget it is still a Tuple. This means once you create it, you cannot change it.

Code Snippet:

Python

from collections import namedtuple

User = namedtuple("User", "name id")

u = User("Alice", 1)

# THE TRAP: Trying to change a value

# u.name = "Bob"

# Output: AttributeError: can't set attribute

Think About It: If you need to change values frequently, should you use a namedtuple?
Answer: No. You should use a Dataclass or a Dictionary. namedtuple is best for data that should stay "locked" and never change.

Counter Addition & Subtraction:

Theory: You can actually perform math on Counter objects. This is incredibly useful for inventory systems (e.g., Stock - Sales = Remaining).

Python

from collections import Counter

inventory = Counter(apple=10, banana=5)

sales = Counter(apple=2, banana=7)

remaining = inventory - sales

print(remaining) # Output: Counter({'apple': 8})

Think about it: Notice that 'banana' disappeared! Python's Counter automatically removes any key where the count drops to zero or becomes negative.

Dataclasses vs. Dictionaries

Theory

In Python, Dictionaries are the "general-purpose" way to store data using keys and values. However, as your code grows, using dictionaries can become messy because they don't have a fixed structure. Dataclasses (introduced in Python 3.7) provide a way to define "structured" data that is much easier to read and maintain.

Think of a Dictionary like a loose folder where you can throw any piece of paper. A Dataclass is like a pre-printed form with specific boxes you must fill out.

•Dictionaries: Best for data where you don't know the keys ahead of time (like an API response).

•Dataclasses: Best when you know exactly what fields your object should have (like a User, a Product, or a Student)

Code Snippets

Python

from dataclasses import dataclass

from typing import List

# --- The Dictionary Way ---

# No structure; easy to make typos in the keys

user_dict = {"name": "Alice", "id": 1}

print(user_dict["name"]) # You have to use strings to get data

# --- The Dataclass Way ---

# Clear structure; Python helps you catch mistakes

@dataclass

class Student:

id: int

grades: List[int]

# Creating an object is cleaner

s1 = Student(name="Bob", id=101, grades=[85, 90])

# Getting data is easier with "Dot Notation"

print(s1.name) # No more typing ["name"] every time!

Think About It: If you misspell a key in a dictionary (e.g., user["nmae"]), what happens compared to misspelling a field in a Dataclass (e.g., user.nmae)?
Answer: With a Dictionary, the code only crashes when it finally runs (Runtime Error). With a Dataclass, modern code editors (like VS Code) will highlight the mistake in red before you even run the code because it knows the "schema" of the class.

Think About It: Can you easily compare two dictionaries for equality like dict1 == dict2?
Answer: Yes, Python compares the content. Dataclasses also do this automatically, but they add "validation" via type hints that dictionaries lack.

Common Mistake:

Mutability in Dataclass Defaults:

Theory:

A major trap in Dataclasses is trying to assign a mutable object (like a list) as a default value directly. Python will raise a ValueError because all instances would end up sharing the same list.

#MISTAKE: Direct mutable default

from dataclasses import dataclass, field

@dataclass

class Team:

# members: list = [] #This would raise ValueError

members: list = field(default_factory=list) #The correct way

Edge Case Scenario:

The __post_init__ Validation

Theory: Dataclasses are "structured," but they don't automatically validate data types at runtime. To actually enforce rules (like "Price cannot be negative"), you must use the __post_init__ method.

Code Snippet:

#EDGE CASE: Logic validation after creation

from dataclasses import dataclass

@dataclass

class Product:

price: float

def __post_init__(self):

if self.price < 0:

raise ValueError("Price cannot be negative")

Performance Analysis (Big O)

Theory

•In programming, Big O Notation is a way to measure how "efficient" your code is. It doesn't measure speed in seconds (because every computer has a different processor), but rather how the number of operations grows as your data gets larger.

Imagine you are looking for a name in a phonebook:

•$O(n)$ (Linear Time): You start at page 1 and read every single name until you find it. If the book has 100 names, it takes 100 steps. If it has 1 million, it takes 1 million steps.

•$O(1)$ (Constant Time): You have a magic index that tells you exactly which page and line the name is on. Whether the book has 10 names or 10 million, it only ever takes one step to find it.

This table shows why choosing the right container matters. Notice how much faster Dicts and Sets are for finding or deleting data compared to Lists.

Code Snippets

Python

import time

# --- O(1) Example (Dictionary) ---

# Looking up 'apple' is instant, regardless of dictionary size.

my_dict = {"apple": 1, "banana": 2, "cherry": 3}

print(my_dict["apple"]) # Instant lookup

# --- O(n) Example (List) ---

# Python must check every item from the start until it finds 'cherry'.

my_list = ["apple", "banana", "cherry"]

if "cherry" in my_list: # This is O(n)

print("Found it!")

Think About It: If you have 1 million usernames and need to check if "Student123" exists, should you store them in a List or a Set?
Answer: A Set. Checking if an item exists in a list takes longer as the list grows ($O(n)$). In a set, Python uses "Hashing" to find the item instantly ($O(1)$), no matter how many items are there.

Common Mistake:

The $O(n^2)$ "System Freeze:

The Scenario: You have a list of 10,000 "Banned Users" and you are processing 10,000 "New Comments." You want to block comments from banned users.

Python
import time

banned_users = [f"user_{i}" for i in range(10000)] # 10k items

new_comments = [f"user_{i}" for i in range(10000)] # 10k items

#MISTAKE (O(n^2)): Searching a list inside a loop

start = time.time()

for user in new_comments:

if user in banned_users: # This check is O(n) - it scans the whole list!

pass

print(f"Time with List (Slow): {time.time() - start:.2f} seconds")

#FIX (O(n)): Convert to a Set first

start = time.time()

banned_set = set(banned_users) # O(n) to create, but 'in' is now O(1)

for user in new_comments:

if user in banned_set:

pass

print(f"Time with Set (Instant): {time.time() - start:.5f} seconds")

Using the List method, your website might "freeze" for 5–10 seconds every time it processes comments. Using a Set makes it happen in a fraction of a millisecond.

Edge Case Scenario:

Big O isn't everything (The "Constant Factor"):

Sometimes a "slow" $O(n)$ operation is actually faster than a "fast" $O(1)$ operation if the dataset is tiny.

•The Reality: Creating a Set (which is $O(1)$) requires Python to do work behind the scenes to "hash" the items. If you only have 3 items, just checking a List ($O(n)$) is faster because it has zero setup time.

•The Lesson: Don't over-engineer. Only optimize with Sets/Dicts when your data starts growing.

Deep Copy vs. Shallow Copy

Theory

In Python, when you "copy" a list or a dictionary that contains other lists or dictionaries inside it (nested data), Python doesn't always make a completely fresh copy of everything. There are two ways to copy data, and understanding the difference is crucial to avoiding bugs where data changes unexpectedly.

•Shallow Copy: This creates a new "outer" container, but it still points to the exact same "inner" objects. Think of it like a new folder that contains the original physical documents; if you draw on the document, it changes for everyone.

•Deep Copy: This creates a brand new outer container AND brand new copies of every single object inside it. It’s like taking a photocopy of the entire folder and all its documents; what you do to your copy won't affect the original.

Code Snippets

Python

import copy

# --- The Problem: Shallow Copy ---

original = [[1, 2], [3, 4]]

shallow = list(original) # Or original.copy()

# Changing the 'inner' list in the copy

shallow[0][0] = "BOOM"

print(f"Original: {original}") # Output: [['BOOM', 2], [3, 4]] - Corrupted!

print(f"Shallow: {shallow}") # Output: [['BOOM', 2], [3, 4]]

# --- The Solution: Deep Copy ---

original_2 = [[1, 2], [3, 4]]

deep = copy.deepcopy(original_2) [cite: 1]

deep[0][0] = "SAFE"

print(f"Original 2: {original_2}") # Output: [[1, 2], [3, 4]] - Safe!

print(f"Deep: {deep}") # Output: [['SAFE', 2], [3, 4]]

Think About It: If you have a simple list like [1, 2, 3] (no nested lists), do you need a Deep Copy?
Answer: No. Since integers are "Immutable" (they cannot be changed), a Shallow Copy is perfectly fine and faster. Deep Copy is only necessary when you have "Mutable" objects (like lists or dicts) inside other objects

Think About It: Does copy.deepcopy() use more memory than a shallow copy?
Answer: Yes. It has to create brand new objects in memory for every nested level, which can be expensive for massive data structures.

Common Mistake:

The "Shared Data" Corruption:

The Scenario: You are building a game. You have a "Standard Enemy" template and you want to create a "Boss" by copying the template and giving it 100x more health.

Code Snippet:

import copy

# Template for an enemy: [Name, [Stats: HP, Defense]]

enemy_template = ["Goblin", [10, 5]]

#MISTAKE: Using a Shallow Copy (Slicing)

boss = enemy_template[:]

boss[0] = "Boss Goblin" # Top level is fine...

boss[1][0] = 1000 # BUT the inner list is SHARED!

print(f"Boss HP: {boss[1][0]}") # 1000

print(f"Regular Goblin HP: {enemy_template[1][0]}") # 1000! (The small Goblin is now a God)

Think about it: By using a shallow copy, you've accidentally linked the health of every enemy in your game together. If the player hits one, they hit them all! Deep Copy is the only way to ensure they are truly independent.

Edge Case Scenario:

The Memory Cost of deepcopy():

Theory: While copy.deepcopy() is safe, it is a "heavyweight" operation. It has to visit every single item and create a brand new spot in your computer's RAM for it.

Code Snippet:

Python
import copy

import sys

# A list with 10,000 sub-lists

big_data = [[x] for x in range(10000)]

# Deep copy creates brand new memory for every single sub-item

deep_version = copy.deepcopy(big_data)

print(f"Original Size: {sys.getsizeof(big_data)} bytes")

# Deep copying can double or triple your memory usage instantly!

Think about it: If you have a massive dataset (like 1GB of data), avoid deepcopy. Instead, try to write your code in a way that doesn't need to modify the original data.

Nested Data Manipulation

Theory:

In the real world, data is rarely just a simple list of names. Whether you are dealing with weather reports, social media feeds, or online store inventories, the data usually looks like a "List of Dictionaries"—where each item in a list is its own dictionary, and those dictionaries might even contain more lists!

This is called Nested Data. To work with it effectively as a beginner, you need two skills:

1. Accessing: Reaching deep into the "folders" to find a specific value.

2. Manipulation: Changing or extracting data from every item at once using loops or comprehensions.

Code Snippets:

Python

# A common "List of Dictionaries" structure

users = [

{"name": "Alice", "roles": ["admin", "editor"], "active": True},

{"name": "Bob", "roles": ["viewer"], "active": False},

{"name": "Charlie", "roles": [], "active": True}

]

# --- 1. Basic Access ---

# Get the first role of the first user

print(users[0]["roles"][0]) # Output: admin

# --- 2. Safe Access with .get() ---

# Some users might not have a 'roles' key. Using .get() prevents a crash.

for user in users:

roles = user.get("roles", ["Guest"]) # If 'roles' is missing, use 'Guest'

print(f"{user['name']} is a {roles}")

# --- 3. Extracting Data (Flattening) ---

# Goal: Get a single list of EVERY role from every user

all_roles = [role for user in users for role in user.get("roles", [])]

print(all_roles) # Output: ['admin', 'editor', 'viewer']

Think About It: If you have a massive list of users and some are missing the "email" field, what happens if you run print(user["email"]) inside a loop? Answer: The program will immediately crash with a KeyError as soon as it hits the first user without an email. This is why professional developers almost always use .get("email", "No Email Provided") when working with nested data.

Think About It: Why do we use user.get("roles", []) inside the comprehension instead of just user["roles"]?
Answer: To prevent the code from crashing with a KeyError. If a user dictionary is missing the "roles" key, .get() returns an empty list [] as a fallback, allowing the loop to continue safely.

Common Mistake:

The "KeyError" in Nested Comprehensions:

Theory:

When "flattening" or extracting data from nested structures, developers often forget that a single missing key in one sub-dictionary will crash the entire comprehension.

Code Snippet:

Python

#MISTAKE: Assuming all nested dicts are identical

data = [{"id": 1, "meta": {"tags": ["AI"]}}, {"id": 2}]

# tags = [t for item in data for t in item["meta"]["tags"]] #Crashes on item 2

#FIX: Use .get()

tags = [t for item in data for t in item.get("meta", {}).get("tags", [])]

Edge Case Scenario:

The "Deep Update" Trap:

Theory: When updating a dictionary inside a list, using .update() only works on the immediate keys. If you try to update a nested dictionary within that structure, you might accidentally overwrite the entire sub-dictionary instead of merging the values.

Code Snippet:

Python

#EDGE CASE: Overwriting vs. Merging in Nested Data

users = [{"id": 1, "settings": {"theme": "dark", "notify": True}}]

# Attempting to update ONLY the theme

new_settings = {"theme": "light"}

#TRAP: This replaces the WHOLE 'settings' dict, losing 'notify'

users[0]["settings"] = new_settings

# Result: {"theme": "light"} -> 'notify' is now GONE.

#FIX: Manual merge or a deep update

users[0]["settings"].update(new_settings)

# Result: {"theme": "light", "notify": True}

The Challenge: The "E-Commerce Order Audit"

The Scenario: You are a lead developer at a fast-growing online store. Your system recently crashed because the "Order Summary" report became too slow as the number of orders grew. You need to refactor the legacy code to handle 100,000+ orders efficiently.

The Question:

1. Structured Data: Define a Product using a Dataclass (instead of a dictionary) to store name, ID, and price.

2. Performance & Collections: Create an audit tool that:

o Finds the top 3 most frequently purchased products (using Counter).

o Groups all orders by Category (using defaultdict).

o Allows fast processing of a "Waiting List" of orders (using deque).

3. Data Integrity: Demonstrate a Deep Copy to ensure that when you apply a "Trial Discount" to a sub-list of orders, the original prices remain unchanged.

4. Efficiency: Optimize a search to find if a specific "VIP Product ID" exists in a list of 100,000 items (improving it from $O(n)$ to $O(1)$).

The Answer:

Python

from collections import Counter, defaultdict, deque, namedtuple

from dataclasses import dataclass, field

import copy

import time

# 1. Dataclasses for Structured Data

@dataclass

class Product:

pid: str

price: float

category: str

# 2. Setup Mock Data

products = [

Product("P1", "Laptop", 1200.0, "Tech"),

Product("P2", "Mouse", 25.0, "Tech"),

Product("P3", "Desk Lamp", 45.0, "Office"),

Product("P4", "Monitor", 300.0, "Tech")

]

# A Nested List of Orders (List of Product IDs)

raw_orders = ["P1", "P2", "P1", "P3", "P2", "P1", "P4"]

# --- TASK A: Frequency Analysis (Counter) ---

order_counts = Counter(raw_orders)

print(f"Top 3 Products: {order_counts.most_common(3)}")

# --- TASK B: Grouping Data (defaultdict) ---

category_map = defaultdict(list)

for p in products:

category_map[p.category].append(p.name)

print(f"Products by Category: {dict(category_map)}")

# --- TASK C: Deep Copy for Data Integrity ---

# Suppose we want to simulate a 50% discount on the first product ONLY

simulation_list = copy.deepcopy(products)

simulation_list[0].price *= 0.5

print(f"Original Laptop Price: {products[0].price}") # Still 1200.0

print(f"Simulated Discount Price: {simulation_list[0].price}") # 600.0

# --- TASK D: Performance Optimization (Big O) ---

# Check if VIP product 'P4' is in stock (Simulating 100,000 items)

large_inventory_list = [f"P{i}" for i in range(100000)]

inventory_set = set(large_inventory_list) # Convert to Set for O(1)

start = time.time()

"P99999" in inventory_set # O(1) Lookup

print(f"Optimized Search Time: {time.time() - start:.8f}s")

The Explanation:

1. Why Dataclasses over Dictionaries?

In the code, products[0].price is much safer than products[0]["price"]. If you accidentally type .pice, the IDE catches it immediately. In a dictionary, a typo leads to a "Silent Error" or a crash only when the code runs.

2. The Power of Counter and defaultdict

• Counter: Without it, you’d need a for loop and a dict to manually check if key in counts. Counter does this in one line and is optimized in C.

• defaultdict: We didn't have to write if category not in category_map: category_map[category] = []. It makes the code cleaner and less prone to "Missing Key" errors.

3. Shallow vs. Deep Copy

When we ran the "Discount Simulation," we used deepcopy. If we had used a shallow copy (products.copy()), changing the price of the laptop in the simulation would have permanently changed the price in the original inventory, causing an accounting nightmare!

4. The $O(1)$ Performance Win

By converting the large_inventory_list to a set, we changed the search time from Linear ($O(n)$) to Constant ($O(1)$).

• In a List, if the item is at the end of 100,000 items, Python does 100,000 checks.

• In a Set, Python uses a "Hash Map" to go directly to the item's location, doing roughly 1 check regardless of size.

Google Sites

Report abuse

Data Structures

Special Collections

Dataclasses vs. Dictionaries

Performance Analysis (Big O)

Deep Copy vs. Shallow Copy

Nested Data Manipulation

The Challenge: The "E-Commerce Order Audit"

Contact Us - binarybridge.trainings@gmail.com 📧