Data Science‎ > ‎With Python‎ > ‎

2. Numpy



Numpy, a package widely used in the data science community which lets us work efficiently with arrays and matrices in Python. 

We can achieve significant performance speedups with NumPy over native Python code, particularly when our computations follow the Single Instruction, Multiple Data (SIMD) paradigm.

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

numpy arrays are specialized data structures. This means you don't only get the benefits of an efficient in-memory representation, but efficient specialized implementations as well.

E.g. if you are summing up two arrays the addition will be performed with the specialized CPU vector operations, instead of calling the python implementation of int addition in a loop.

The optimized C code numpy/scipy use behind the scenes consists of the Automatically Tuned Linear Algebra Software (ATLAS), BLAS (Basic Linear Algebra Subprograms) and LAPACK - Linear Algebra PACKage. These libraries have been developed and worked on for ages, are the gold standard of linear algebra vector and matrix computations

Indexing the array is over 100 times faster than indexing the Series. This shows up in arithmetic too, because Pandas aligns Series on their indexes before doing operations.
Why is Pandas so much slower than NumPy? The short answer is that Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python.


Creating Arrays

Create a list and convert it to a numpy array

mylist = [1, 2, 3]
x = np.array(mylist)
x
array([1, 2, 3])

Or just pass in a list directly

y = np.array([4, 5, 6])
y
array([4, 5, 6])

Pass in a list of lists to create a multidimensional array

m = np.array([[7, 8, 9], [10, 11, 12]])
m
array([[ 7,  8,  9],
       [10, 11, 12]])

Use the shape method to find the dimensions of the array. (rows, columns)

m.shape
(2, 3)

arange returns evenly spaced values within a given interval.

n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30
n
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

reshape returns an array with the same data with a new shape.

n = n.reshape(3, 5) # reshape array to be 3x5
n
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])

linspace returns evenly spaced numbers over a specified interval.

o = np.linspace(0, 4, 9) # return 9 evenly spaced values from 0 to 4
o
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ])

resize changes the shape and size of array in-place.

o.resize(3, 3)
o
array([[ 0. ,  0.5,  1. ],
       [ 1.5,  2. ,  2.5],
       [ 3. ,  3.5,  4. ]])


Numpy also has several built-in functions and shortcuts for creating arrays.

ones returns a new array of given shape and type, filled with ones.

np.ones((3, 2))
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

zeros returns a new array of given shape and type, filled with zeros.

np.zeros((2, 3))
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])


eye returns a 2-D array with ones on the diagonal and zeros elsewhere.

np.eye(3)
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

diag extracts a diagonal or constructs a diagonal array.

np.diag(y)
array([[4, 0, 0],
       [0, 5, 0],
       [0, 0, 6]])

Create an array using repeating list (or see np.tile)

np.array([1, 2, 3] * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])

Repeat elements of an array using repeat.

np.repeat([1, 2, 3], 3)
array([1, 1, 1, 2, 2, 2, 3, 3, 3])


Combining Arrays

p = np.ones([2, 3], int)
p
array([[1, 1, 1],
       [1, 1, 1]])

Use vstack to stack arrays in sequence vertically (row wise).

np.vstack([p, 2*p])
array([[1, 1, 1],
       [1, 1, 1],
       [2, 2, 2],
       [2, 2, 2]])

Use hstack to stack arrays in sequence horizontally (column wise).

np.hstack([p, 2*p])
array([[1, 1, 1, 2, 2, 2],
       [1, 1, 1, 2, 2, 2]])


Operations

Use +-*/ and ** to perform element wise addition, subtraction, multiplication, division and power.

print(x + y) # elementwise addition     [1 2 3] + [4 5 6] = [5  7  9]
print(x - y) # elementwise subtraction  [1 2 3] - [4 5 6] = [-3 -3 -3]

print(x * y) # elementwise multiplication  [1 2 3] * [4 5 6] = [4  10  18] 
print(x / y) # elementwise divison         [1 2 3] / [4 5 6] = [0.25  0.4  0.5]

print(x**2) # elementwise power  [1 2 3] ^2 =  [1 4 9]

Dot Product:

[x1 x2 x3]y1y2y3=x1y1+x2y2+x3y3[x1 x2 x3]⋅[y1y2y3]=x1y1+x2y2+x3y3


x.dot(y) # dot product  1*4 + 2*5 + 3*6
32

z = np.array([y, y**2])
print(len(z)) # number of rows of array
2

Let's look at transposing arrays. Transposing permutes the dimensions of the array.

z = np.array([y, y**2])
z
array([[ 4,  5,  6],
       [16, 25, 36]])

The shape of array z is (2,3) before transposing.

z.shape
(2, 3)

Use .T to get the transpose.

z.T
array([[ 4, 16],
       [ 5, 25],
       [ 6, 36]])

The number of rows has swapped with the number of columns.

z.T.shape
(3, 2)

Use .dtype to see the data type of the elements in the array.

z.dtype
dtype('int64')

Use .astype to cast to a specific type.

z = z.astype('f')
z.dtype
dtype('float32')


Math Functions

Numpy has many built in math functions that can be performed on arrays.

a = np.array([-4, -2, 1, 3, 5])

a.sum()
3

a.max()
5

a.min()
-4

a.mean()
0.59999999999999998

a.std()
3.2619012860600183

argmax and argmin return the index of the maximum and minimum values in the array.

a.argmax()
4

a.argmin()
0


Indexing / Slicing

s = np.arange(13)**2
s
array([  0,   1,   4,   9,  16,  25,  36,  49,  64,  81, 100, 121, 144])

Use bracket notation to get the value at a specific index. Remember that indexing starts at 0.

s[0], s[4], s[-1]
(0, 16, 144)

Use : to indicate a range. array[start:stop]

Leaving start or stop empty will default to the beginning/end of the array


s[1:5]
array([ 1,  4,  9, 16])

Use negatives to count from the back.

s[-4:]
array([ 81, 100, 121, 144])

A second : can be used to indicate step-size. array[start:stop:stepsize]

Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

s[-5::-2]

array([64, 36, 16,  4,  0])

Let's look at a multidimensional array.

r = np.arange(36)
r.resize((6, 6))
r
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

Use bracket notation to slice: array[row, column]

r[2, 2]
14

And use : to select a range of rows or columns

r[3, 3:6]
array([21, 22, 23])

Here we are selecting all the rows up to (and not including) row 2, and all the columns up to (and not including) the last column.

r[:2, :-1]
array([[ 0,  1,  2,  3,  4],
       [ 6,  7,  8,  9, 10]])

This is a slice of the last row, and only every other element.

r[-1, ::2]
array([30, 32, 34])

We can also perform conditional indexing. Here we are selecting values from the array that are greater than 30. (Also see np.where)

r[r > 30]
array([31, 32, 33, 34, 35])

Here we are assigning all values in the array that are greater than 30 to the value of 30.

r[r > 30] = 30
r
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 30, 30, 30, 30, 30]])


Copying Data

Be careful with copying and modifying arrays in NumPy!

r2 is a slice of r

r2 = r[:3,:3] r2

array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])

Set this slice's values to zero ([:] selects the entire array)

r2[:] = 0
r2
array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

r has also been changed!

r
array([[ 0,  0,  0,  3,  4,  5],
       [ 0,  0,  0,  9, 10, 11],
       [ 0,  0,  0, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

To avoid this, use r.copy to create a copy that will not affect the original array

r_copy = r.copy()
r_copy
array([[ 0,  0,  0,  3,  4,  5],
       [ 0,  0,  0,  9, 10, 11],
       [ 0,  0,  0, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

Now when r_copy is modified, r will not be changed.

r_copy[:] = 10
print(r_copy, '\n')
print(r)
[[10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]] 

[[ 0  0  0  3  4  5]
 [ 0  0  0  9 10 11]
 [ 0  0  0 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 31 32 33 34 35]]


Iterating Over Arrays

Let's create a new 4 by 3 array of random numbers 0-9.

test = np.random.randint(0, 10, (4,3))
test
array([[3, 4, 7],
       [3, 1, 1],
       [8, 8, 4],
       [8, 0, 0]])

Iterate by row:

for row in test:
    print(row)
[3 4 7]
[3 1 1]
[8 8 4]
[8 0 0]

Iterate by index:

for i in range(len(test)):
    print(test[i])
[3 4 7]
[3 1 1]
[8 8 4]
[8 0 0]

Iterate by row and index:

for i, row in enumerate(test):
    print('row', i, 'is', row)
row 0 is [3 4 7]
row 1 is [3 1 1]
row 2 is [8 8 4]
row 3 is [8 0 0]

Use zip to iterate over multiple iterables.

test2 = test**2
test2
array([[ 9, 16, 49],
       [ 9,  1,  1],
       [64, 64, 16],
       [64,  0,  0]])

for i, j in zip(test, test2):
    print(i,'+',j,'=',i+j)
[3 4 7] + [ 9 16 49] = [12 20 56]
[3 1 1] + [9 1 1] = [12  2  2]
[8 8 4] + [64 64 16] = [72 72 20]
[8 0 0] + [64  0  0] = [72  0  0]










Comments