Strings covered in Lesson 7 / Chapter 6. The following is complementary to the information there- not a replacement or summary!
Strings are one form of collection- because a single string can be a collection of many characters. A fellow pythonista has put together a cheat sheet which is worth printing out for reference, or just making some notes from.
Strings are an IMMUTABLE Data Structure.
Any letter, symbol, space or return is a character (often referred to as a char).
On a computer, a character is a number under the hood. By that, we mean that each character has a number associated with it. This is called ascii, and was developed in the 1960s to standardise character encoding across the electronics industry. Up to then, individual manufacturers would encode data differently, making sharing files -even of text- very difficult across different platforms.
ASCII is abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters.
the ascii table shows the characters and their underlying number in both decimal and hex
We can find the number value, or the ordinal value of a character using a builtin python function
>>> ord('a')
97
This will only work with a single character, as each char has it's own underlying value.
We can do the reverse using another builtin function in Python to take an int, and give us the corresponding char.
>>> chr(97)
'a'
This is why a string can be considered a collection or one, many or no chars. Because it is a collection, we can index into it and iterate through it with a loop
You might think looking at the numbers representing a letter is cool, but even cooler is when we look at the binary code for characters and see what the difference between uppercase and lowercase really is.
You could even check the binary representation of the chars for "0... 9" and you would find a relationship between the binary representation and the number it represents. Just an interesting bit of "by the way".
Each char in a string with a len of >= 1 has an index that points to that particular char.
For the string s shown, we can index forward into it or use the backward direction index:
>>> s[3]
'h'
>>> s[-5]
'y'
but if we go beyond those numbers we get an error.
IndexError: string index out of range
String slicing stride syntax in Python allows you to extract characters from a string using the format s[start:end:step]. The start index is inclusive, the end index is exclusive, and the step determines the interval between characters; for example, a step of 2 retrieves every second character, while a negative step reverses the string. This form of indexing also works with lists and tuples.
>>> name[::-1] # reverse the string
'draugydoB ehT'
# Get every second element, starting with the index 0. The elments on index 0, 2, 4 ...
>>> name[::2]
'TeBdGad'
# Get every second element in the range from index 0 to index 4
>>> name[0:5:2]
'TeB'
# Get every third element from index 1 onwards
>>> name[1::3]
'hBya'
# Get every second element from index 2 to 9 only
>>> name[2:9:2]
'eBdg'
We know we can use the + and * operators with strings, and we know- deep in our minds- that when we use these operators with strings, they are different from the same-looking ones we use with numbers. They are indeed different.
Under the hood, the plus and multiply operators are implemented for strings because there is a string method that tells python what to do when one of these operators are used with a string. The str class has a special __add__ and a special __mul__ method that allows it to use the + and *. We'll learn more about these dunder methods (double underscore methods) later in OOP.
We may know from experience that the - (minus) and the / (divide) operators will not work.
>>> "hello" - 'o' # does not result in "hell"
Traceback (most recent call last):
File "<pyshell>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'str' and 'str'
So very glad you asked!. Look at the following code and then look to the ascii table to discover why the code is doing what it is doing:
>>> 'a' > 'Z'
True
>>> 'a' > 'z'
False
These operators are also implemented under the hood for strings, with their own defined dunder methods to implement them.
So, one of the most commonly used functions with strings is the len function.
>>> len('hello')
5
but what exactly is a method then? All methods are functions, but not all functions are methods. Confused?
With a function, we pass an argument within the parens:
>>> max("hello") # single str
'o'
>>> max(1, 2, 3, 4) # many ints
4
>>> max('a', 'D', 'z', 'S') # many strs
'z'
so the max function can accept a wide variety of args... but with a method we use dot notation.
>>> 'hello'.upper() # returns a copy- does not change the original
'HELLO'
>>> str.upper('hello')
'HELLO'
You can see from this example that calling the .upper() method on the string effectively passes the string to str.upper as an argument. Strings can be changed to upper, lower, title and so on. If we tried to use the .upper method on an int, float or bool, we would get an error. This is because methods belong to the class. The str class has these methods associated with it, but the int, float, bool and other classes do not have these methods.
This means that, while methods are functions, they are functions specific to a class.
To distinguish between a method and a function, methods will use dot notation which implicitly passes the string, while functions pass args explicitly.
These are some of the commonly used string methods.
Python help function will let you look up any of these as long as you tell it where the method lives- using dot notation.
>>> help(str.upper)
Help on method_descriptor:
upper(self, /)
Return a copy of the string converted to uppercase.
So, this tells us that the upper method lives inside the str class.
isnumeric() is a useful method that checks if a string contains a valid numerical string- this can be used to check validity before casting the str to a float, for example. It may allow you to avoid using a try/except!
Strings are immutable. If we want to change one character inside a word, we cannot mutate that char.
>>> word = 'hello'
>>> word[1] = 'a'
Traceback (most recent call last):
File "<pyshell>", line 1, in <module>
TypeError: 'str' object does not support item assignment
If we want to effectively do that job, we need to splice together a new string using copied parts of the existing string. We can then assign that new string to be stored in the original variable, and "throw away" the original string.
>>> new_word = word[0] + 'a' + word[2:]
>>> word = new_word
>>> word
'hallo'
or more succinctly, we could do all this in one line
>>> word = word[0] + 'a' + word[2:]
Strings are iterable, meaning we can use a for loop to iterate, or step through a string. We could also use a while loop, but for loops are a convenient construct created for iterating through finite sequences.
word = 'hello'
for letter in word:
print(ord(letter), end='|')
Output is:
104|101|108|108|111|
word = 'HELLO'
index = 0
while index < len(word):
print(ord(word[index]), end='|')
index += 1
Output is
72|69|76|76|79|
You can see that the for loop is simply an easy way to traverse an iterable object, allowing us to use a variable name to describe each object as we traverse the string. There are often instances where we must use the while loop however such as when comparing adjacent chars in a string, for example.
The range function is a very useful builtin that is often used when processing strings and other iterables.
The range function can take 1, 2 or 3 args. If we pass it one arg, it can generate a list of numbers from 0 to the single arg:
>>> print(list(range(4)))
[0, 1, 2, 3]
If we pass it two args, it can generate a list of numbers from arg1 up to but not including arg2:
>>> print(list(range(3, 14)))
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
If we pass it three args, it can generate a list of numbers from arg 1 up to but not including arg2, in increments of arg3:
>>> print(list(range(3, 14, 3)))
[3, 6, 9, 12]
so the general syntax is
range(start_num, stop_num, increment)
To use the range function to help us iterate through a string, we would need to get the length of the str and then iterate using the indices of the chars in the string.
>>> s = 'hello'
>>> limit = len(s)
>>> for idx in range(limit):
print(s[idx])
h
e
l
l
o
where idx represents the index of the char. Remember that range(limit) in this case gives us something like:
[0, 1, 2, 3, 4]
This may seem trivial, but is in fact very important to understand- when a str is passed to a function, the function receives a copy of the str- not a reference to the "box" holding that variable. This is important because other data types allow a function to mutate the object passed.
str: passes to function by value (a copy)
By contrast:
list: passes to function by reference (the actual object)
Strings are very commonly printed- there are a few things we can do with strings expressly when using the print statement. Back slashes represent the beginning of escape sequences. Escape sequences represent strings that may be difficult to input. For example, back slash "n" represents a new line. The output is given by a new line after the back slash "n" is encountered:
# New line escape sequence
>>> print(" The BodyGuard\n is the best album" )
The BodyGuard
is the best album
# Tab escape sequence: \t indicates a tab
>>> print(" The BodyGuard \t is the best album" )
The BodyGuard is the best album
# To include a back slash in string, use another backslash
print(" The BodyGuard \\ is the best album" )
The BodyGuard \ is the best album
# r will tell python that string will be display as raw string
print(r" The BodyGuard \ is the best album" )
The BodyGuard \ is the best album
In Python, RegEx (short for Regular Expression) is a tool for matching and handling strings.
This RegEx module provides several functions for working with regular expressions, including search, split, findall, and sub.
Python provides a built-in module called re, which allows you to work with regular expressions. First, import the re module
import re
The search() function searches for specified patterns within a string. Here is an example that explains how to use the search() function to search for the word "Body" in the string "The BodyGuard is the best".
s1 = "The BodyGuard is the best album"
# Define the pattern to search for
pattern = r"Body"
# Use the search() function to search for the pattern in the string
result = re.search(pattern, s1)
# Check if a match was found
if result:
print("Match found!")
else:
print("Match not found.")
results in:
Match found!
Regular expressions (RegEx) are patterns used to match and manipulate strings of text. There are several special sequences in RegEx that can be used to match specific characters or patterns.
A simple example of using the \d special sequence in a regular expression pattern with Python code:
pattern = r"\d\d\d\d\d\d\d\d\d\d" # Matches any ten consecutive digits
text = "My Phone number is 1234567890"
match = re.search(pattern, text)
if match:
print("Phone number found:", match.group())
else:
print("No match")
results in:
Phone number found: 1234567890
The match.group() method is used in Python's re module to retrieve the part of the string where the regular expression pattern matched.
Here's a detailed explanation:
Purpose
Extract Matched Text: match.group() returns the exact substring that matched the pattern.
Usage
When you use functions like re.search() or re.match(), they return a match object if the pattern is found. You can then use match.group() to get the matched text.
Here match.group() retrieves the substring 1234567890 from the text, which is the part that matched the pattern.
The regular expression pattern is defined as r"\d\d\d\d\d\d\d\d\d\d", which uses the \d special sequence to match any digit character (0-9), and the \d sequence is repeated ten times to match ten consecutive digits
A simple example of using the \W special sequence in a regular expression pattern with Python code:
pattern = r"\W" # Matches any non-word character from raw string
text = "Hello, world!"
matches = re.findall(pattern, text)
print("Matches:", matches)
results in:
Matches: [',', ' ', '!'] # comma, space, closing exclamation mark
The regular expression pattern is defined as r"\W", which uses the \W special sequence to match any character that is not a word character (a-z, A-Z, 0-9, or _). The string we're searching for matches in is "Hello, world!".
The findall() function finds all occurrences of a specified pattern within a string.
s2 = "The BodyGuard is the best album of 'Whitney Houston'."
# Use the findall() function to find all occurrences of the "st" in the string s2
result = re.findall("st", s2)
# Print out the list of matched words
print(result)
results in:
['st', 'st'] # a list of all the findings
A regular expression's split() function splits a string into an array of substrings based on a specified pattern.
# Use the split function to split the string by the "\s"
split_array = re.split(r"\s", s2)
# The split_array contains all the substrings, split by whitespace characters
print(split_array)
results in:
['The', 'BodyGuard', 'is', 'the', 'best', 'album', 'of', "'Whitney", "Houston'."]
Here's a detailed explanation:
re.split: This function splits a string by the occurrences of a pattern.
r"\s": This is a regular expression pattern that matches any whitespace character (spaces, tabs, newlines, etc.).
s2: This is the string that you want to split. So we split the string wherever we found a space (in this case).
The sub function of a regular expression in Python is used to replace all occurrences of a pattern in a string with a specified replacement.
s2 = "The BodyGuard is the best album of 'Whitney Houston'."
# Define the regular expression pattern to search for
pattern = r"Whitney Houston"
# Define the replacement string
replacement = "legend"
# Use the sub function to replace the pattern with the replacement string
new_string = re.sub(pattern, replacement, s2, flags=re.IGNORECASE)
# re.IGNORECASE makes the search case-insensitive,
# so it matches "Whitney Houston" in any letter case
# The new_string contains the original string with the pattern replaced by the replacement string
print(new_string)
results in:
The BodyGuard is the best album of 'legend'.