Data Wrangling

Abaixo temos algumas anotações acerca de cursos estudos sobre o processo complexo e delicado de transformar e manipular dados provenientes de formatos "crus" para outro formato com a intenção de entregar as informações contidas no dados em informação útil, sendo o processo necessário para a construção de gráficos, estatísticas descritivas e também para a modelagem de dados.

Below we have some notes about course studies on the complex and delicate process of transforming and manipulating data from "raw" formats to another format with the intention of delivering the information contained in the data into useful information, the process being necessary for the construction of graphics, descriptive statistics, and also for data modeling.

31 examples to master sql

SQL is a programming language used to manage data stored in tabular form (tables) in relational databases.

Relational Data

A relational database consists of multiple tables that relate to each other. The relationship between tables is formed in the sense of shared columns.

Database Management

There are many different relational database management systems (MySQL, PostgreSQL, SQL Server). The SQL syntax between them may differ slightly.

Practical Approach

Create a database and tables
Insert data into tables
Delete data from tables
Update tables
Query tables using a wide variety of selected statements

1. Connect to the server

First, we connect to the Terminal's MySQL server and create a database.

sudo mysql -u root

We will be asked to enter the password. Now we are connected to the MySQL server on our machine.

2. Create a Database

The following command creates a database called "retail.

create database retail;

use retail;

We are not yet in the retail database that does not yet contain tables.

3. Create tables

First, we will create a table called "client" using the create table command.

create table customer (

cust_id int primary key,

age int,

location varchar (20),

gender varchar (20));

We define the name of the columns and associated data types within the parenthesis. The column cust_id is specified as the primary key.

The primary key is the column that uniquely identifies each row. It's like the index of a Pandas DataFrame.

4. Create another table

Let's create the second table that is called "orders.

create table orders ( order_id int primary key,

date date, amount decimal(5,2),

cust_id int,

foreign key (cust_id) references customer(cust_id)

on delete cascade);

In the beginning, we mentioned that relational tables are related to each other through shared columns. A column that lists two tables is a foreign key.

The foreign key is what concerns one table to another.

The foreign key contains the primary key of another table.

The column cust_id in the orders table is a foreign key and relates the order table to the customer table. We specify this condition when creating the table.

5. View Tables

The retail database contains two tables now. We can see the existing tables in a database using the show tables command.

show tables;

+------------------+

| Tables_in_retail |

+------------------+

| customer |

| orders |

+------------------+

Note: SQL commands end with a semicolon (";").

6. Table Description — info( )

The desc or describe commands provide an overview of the table in column names, data types, and additional information.

mysql> desc orders;

+----------+--------------+------+-----+---------+-------+

+----------+--------------+------+-----+---------+-------+

| order_id | int(11) | NO | PRI | NULL | |

| cust_id | int(11) | YES | MUL | NULL | |

+----------+--------------+------+-----+---------+-------+

7. Modify Tables

We can modify existing tables. For example, the change table command can add a new column or delete an existing one. For example, let's add a column to the order table called "is_sale."

alter table orders add is_sale varchar(20);

We type dread the column name and data type, along with the additional keyword.

desc orders;

+----------+--------------+------+-----+---------+-------+

+----------+--------------+------+-----+---------+-------+

| order_id | int(11) | NO | PRI | NULL | |

| cust_id | int(11) | YES | MUL | NULL | |

+----------+--------------+------+-----+---------+-------+

The column is_sale has been added to the order table.

8. Delete Feature

We can also use the changing table to delete a column with a slight change in syntax.

alter table orders drop is_sale;

The drop keyword is used instead of the add. It is because we also don't need to write the data type to drop a column.

9. Enter data

We have tables, but they don't contain data. One way to populate tables is the insertion statement.

insert into customer values (1000, 42, 'Austin', 'female);

The specified values are entered in the columns in the same order. So, we need to maintain consistent order.

10. Insert Multiple Lines

We can insert multiple lines at the same time by selecting each row.

insert into customer values

(1001, 34, 'Austin', 'male'),

(1002, 37, 'Houston', 'male'),

(1003, 25, 'Austin', 'female'),

(1004, 28, 'Houston', 'female'),

(1005, 22, 'Dallas', 'male');

11. Deleting Lines

The declaration deletion can be used to delete existing rows in a table. First, we need to identify the rows to be deleted by providing a condition. For example, the statement below will delete the row with an order id of 17.

delete from orders

where order_id = 17;

If we do not specify a condition, all rows in the given table are deleted.

12. Update a Line

We can also update an existing line. For example, let's update a row in the order table.

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 1 | 2020-10-01 | 24.40 | 1001 |

+----------+------------+--------+---------+

It is the first row in the order table. We want to change the order value to 27.40.

update orders

set amount = 27.40 #alterar essa coluna

where order_id = 1;

select * from orders limit 1;

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 1 | 2020-10-01 | 27.40 | 1001 |

We write the updated values after the keyword set. The rows to be updated are identified by providing the conditions after the keyword where.

13. Replicate a Table Structure

If we want to create a table by copying the structure of an existing table, we can use the create table statement with a similar keyword.

create table orders_copy like orders;

show tables;

+------------------+

| Tables_in_retail |

+------------------+

| customer |

| orders |

| orders_copy |

+------------------+

The table orders_copy has the same structure as the order table but contains no data.

14. Replicate a whole Table

We can also create a copy of an existing table with the data using the create table and select claims together.

create table new_orders

select * from orders;

It Looks like a combination of two separate statements. The first row creates the table, and the second row fills it with the data in the order table.

15. Drop Tables

We can use the drop statement to delete tables in a database.

drop table orders_copy, new_orders;

show tables;

+------------------+

| Tables_in_retail |

+------------------+

| customer |

| orders |

We successfully dropped the tables created in the previous example.

We have two relational tables in a database. The following examples will demonstrate how we can retrieve data from these tables using selected queries.

16. View Table Features

The most straightforward query is to view all the columns in a table.

select * from orders

limit 3;

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 1 | 2020-10-01 | 27.40 | 1001 |

| 2 | 2020-10-01 | 36.20 | 1000 |

| 3 | 2020-10-01 | 65.45 | 1002 |

+----------+------------+--------+---------+

The "*" selects all columns, and the limit keyword places a constraint on the number of rows to display.

17. Select Specific Columns

We can select only some of the columns by typing the column name instead of "*

select order_id, amount

from orders

limit 3;

+----------+--------+

| order_id | amount |

+----------+--------+

| 1 | 27.40 |

| 2 | 36.20 |

| 3 | 65.45 |

+----------+--------+

18. Condition Where

We can specify a condition for rows to be selected using the where clause. For example, the following query will return all orders placed in 2020–10–01.

select * from orders

where date = '2020-10-01';

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 1 | 2020-10-01 | 27.40 | 1001 |

| 2 | 2020-10-01 | 36.20 | 1000 |

| 3 | 2020-10-01 | 65.45 | 1002 |

+----------+------------+--------+---------+

19. Multiple Where Conditions

The clause accepts multiple conditions. Let's add another state to the query in the previous example.

select * from orders

where date = '2020-10-01' and amount > 50;

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 3 | 2020-10-01 | 65.45 | 1002 |

+----------+------------+--------+---------+

20. Sort

We may want to sort query results that can be done using order by clause. For example, the following query will return orders in 2020–10–02 and sort them based on value.

select * from orders

where date = '2020-10-02'

order by amount;

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 5 | 2020-10-02 | 18.80 | 1005 |

| 6 | 2020-10-02 | 21.15 | 1009 |

| 4 | 2020-10-02 | 34.40 | 1001 |

| 7 | 2020-10-02 | 34.40 | 1008 |

| 8 | 2020-10-02 | 41.10 | 1002 |

+----------+------------+--------+---------+

21. Ascending Sort

The order by clause sorts the lines in ascending order by default. However, we can change it to go down with the keyword desc.

select * from orders

where date = '2020-10-02'

order by amount desc;

+----------+------------+--------+---------+

+----------+------------+--------+---------+

| 8 | 2020-10-02 | 41.10 | 1002 |

| 4 | 2020-10-02 | 34.40 | 1001 |

| 7 | 2020-10-02 | 34.40 | 1008 |

| 6 | 2020-10-02 | 21.15 | 1009 |

| 5 | 2020-10-02 | 18.80 | 1005 |

+----------+------------+--------+---------+

22. Count (fundamental)

SQL is a universal language that can also be used as a data analysis tool. It provides many functions for analyzing and transforming data during the query from a database.

For example, we can count the number of unique days in the order table.

select count(distinct(date)) as day_count

from orders;

+-----------+

| day_count |

+-----------+

| 4 |

+-----------+

The order table contains orders on four different days—the keyword "as" is used to rename the column in the query result. Otherwise, the column name would be "count(data)."

23. Group By

There are four different days in the order table. We can also find out how many orders per day. Group by will help us in this task.

select date, count(order_id) as order_count

from orders

group by date;

+------------+-------------+

| date | order_count |

+------------+-------------+

| 2020-10-01 | 3 |

| 2020-10-02 | 5 |

| 2020-10-03 | 6 |

| 2020-10-04 | 2 |

+------------+-------------+

We count the orders and group them up to the dates column.

24. Average daily value

We will calculate the average order value for each day and order the results based on the average value in descending order.

select date, count(order_id) as order_count

from orders

group by date;

+------------+-------------+

| date | order_count |

+------------+-------------+

| 2020-10-01 | 3 |

| 2020-10-02 | 5 |

| 2020-10-03 | 6 |

| 2020-10-04 | 2 |

+------------+-------------+

25. Group and Filter

We want to modify the query in the previous example and include only days with an average value greater than 30.

select date, avg(amount)

from orders

group by date

having avg(amount) > 30

order by avg(amount) desc;

+------------+-------------+

| date | avg(amount) |

+------------+-------------+

| 2020-10-01 | 43.016667 |

| 2020-10-04 | 42.150000 |

| 2020-10-03 | 37.025000 |

+------------+-------------+

It is important to note that the order of the statements in the query matters. For example, it gives an error if we place the order by clause before they have a clause.

26. Maximum Daily Value

We want to find out the maximum order amount per day.

select date, max(amount)

from orders

group by date;

+------------+-------------+

| date | max(amount) |

+------------+-------------+

| 2020-10-01 | 65.45 |

| 2020-10-02 | 41.10 |

| 2020-10-03 | 80.20 |

| 2020-10-04 | 50.10 |

+------------+-------------+

27. Combine Various Functions

We want to combine multiple aggregate functions into a selected statement. To demonstrate this, let's elaborate on the previous example.

We want to see the difference between the maximum order and the minimum order for each customer. We also want to sort the results based on the difference in the ascending order and display the first three.

select cust_id, max(amount) - min(amount) as dif

from orders

group by cust_id

order by dif desc

limit 3;

+---------+-------+

| cust_id | dif |

+---------+-------+

| 1007 | 46.00 |

| 1009 | 28.95 |

| 1002 | 24.35 |

+---------+-------+

The dif column is obtained by subtracting the minimum value from the maximum value.

28. Grouping Count

We're moving to the client's desk now. But, first, let's find out how many female and male clients we have in each city.

select location, gender, count(cust_id)

from customer

group by location, gender;

+----------+--------+----------------+

| location | gender | count(cust_id) |

+----------+--------+----------------+

| Austin | female | 2 |

| Austin | male | 1 |

| Dallas | female | 2 |

| Dallas | male | 2 |

| Houston | female | 2 |

| Houston | male | 1 |

+----------+--------+----------------+

29. Relationship

Customer and order tables are related to each other based on the cust_id. Therefore, we can query data from both tables using JOINS.

We want the average order value for each city in the customer table.

select customer.location, avg(orders.amount) as avg

from customer

join orders

on customer.cust_id = orders.cust_id

group by customer.location;

+----------+-----------+

| location | avg |

+----------+-----------+

| Austin | 33.333333 |

| Dallas | 34.591667 |

| Houston | 44.450000 |

Because we select columns from two different tables, column names are specified with the associated table name. The second, third, and fourth rows of the above query join the order and customer table based on the cust_id in each table.

30. Aggregation and Filter

We want to see the average age of customers who have an order in 2020–10–03.

select avg(c.age) as avg_age

from customer c

join orders o

on c.cust_id = o.cust_id

where o.date = '2020-10-03';

+---------+

| avg_age |

+---------+

| 30.0000 |

+---------+

We can use codenames for the names on the table as well. It is useful when we need to type table names often.

31. Nested Condition

We want to see the location of the customer who has the most significant number of orders.

select c.location, o.amount

from customer c

join orders o

on c.cust_id = o.cust_id

where o.amount = (select max(amount) from orders);

+----------+--------+

| location | amount |

+----------+--------+

| Dallas | 80.20 |

+----------+--------+

We have a selected statement nested in this query.

Concluding

I believe the 30 examples in this article provide a comprehensive introduction to SQL. In addition, we cover the following topics:

Creating a database with relational tables
Modifying tables
Inserting data into tables
Deleting data from tables
Writing queries to retrieve data from tables

It is best to move on to the more advanced operations since you are comfortable working with the basics.

SQL em Pandas

Notebook de Estudo

Acesse o Jupyter Notebook para consultar os conceitos que serão abordados sobre SQL Join em Pandas. Obs: as funções, outputs e termos importantes estão em negrito para facilitar a compreensão — pelo menos a minha.

Importar Bibliotecas

import pandas as pd

import numpy as np

Para que possamos consultar os dados em um banco de dados relacional, utilizamos a linguagem SQL, que é uma linguagem de manipulação de dados em bancos de dados relacionais.

Quando temos que trabalhar com mais de uma tabela, executamos comandos de SQL Join — uma junção de tabelas, que busca dados de 2, 3, 4 ou 5 tabelas de maneira simultânea.

Importar Imagem

A imagem resume alguns tipos junções através de SQL Joins:

from IPython.display import Image

Image(url = 'https://upload.wikimedia.org/wikipedia/commons/9/9d/SQL_Joins.svg')

Inner Join

Veja na figural acima, que temos duas tabelas — tabela A e tabela B. O tipo de JOIN que retorna os dados que são comuns às duas tabelas. Para isso temos o tipo de JOIN específico — INNER JOIN.

Right Join

O JOIN seguinte é o RIGHT JOIN — temos a tabela A e B, desejamos retornar os dados comuns às duas tabelas, mas também todos os dados pertencentes à tabela B (do lado direito)

Left Join

O mesmo raciocínio serve para o RIGHT JOIN— temos a tabela A e B, desejamos retornar os dados comuns às duas tabelas, mas também todos os dados pertencentes à tabela A do (lado esquerdo)

Full Outer Join

Podemos buscar todos os dados com FULL OUTER JOIN — temos a tabela A e B, desejamos retornar os dados comuns às duas tabelas, mas também todos os dados pertencentes à tabela A e à tabela B.

Como fazer isso em Pandas?

Criar DataFrames

Primeiramente, criaremos alguns conjuntos de dados a partir de dicionários, ou seja, pares de chave:valor. Com o dicionário criado, podemos convertê-lo em DataFrame e indicar quais são os nomes das colunas.

Imagine que coordenamos uma escola que registra notas para cada turma de alunos e, cada uma dessas turmas, têm uma planilha individual. Como faríamos a junção dessas planilhas para realizar o processo de análise? Pois bem…

DF_1: disciplina, nome e sobrenome

Suponha que sejam arquivos csv ou planilha de uma turma de uma escola qualquer, onde temos as disciplinas e o nome e sobrenome dos alunos — aqui vamos criar pequenos DataFrames para fins didáticos:

dados1 = {

'disciplina_id': ['1', '2', '3', '4', '5'],

'nome': ['Bernardo', 'Alan', 'Mateus', 'Ivo', 'Gerson'],

'sobrenome': ['Anderson', 'Teixeira', 'Amoedo', 'Trindade', 'Vargas']}

df_a = pd.DataFrame(dados1,

columns = ['disciplina_id', 'nome', 'sobrenome']);

df_a

DF_2: disciplina, nome e sobrenome

Temos novamente outro DataFrame de disciplinas, nome e sobrenome:

dados2 = {

'disciplina_id': ['4', '5', '6', '7', '8'],

'nome': ['Roberto', 'Mariana', 'Ana', 'Marcos', 'Maria'],

'sobrenome': ['Sampaio', 'Fernandes', 'Arantes', 'Menezes', 'Martins']}

df_b = pd.DataFrame(dados2,

columns = ['disciplina_id', 'nome', 'sobrenome']); df_b

DF_3: disciplina e nota

Temos o código das disciplinas e a nota do teste correspondente:

dados3 = {

'disciplina_id': ['1','2','3','4','5','7','8','9','10','11'],

'teste_id': [81, 75, 75, 71, 76, 84, 95, 61, 57, 90]}

df_n = pd.DataFrame(dados3,

columns = ['disciplina_id','teste_id']); df_n

O que fazer a partir de agora?

Agora juntaremos essas três tabelas para realizar o processo de análise — fazemos isso o tempo todo ao trabalhar com análise de dados relacionais, até mesmo com a análise de dados não estruturados, visto que dificilmente receberemos um arquivo pronto, uniforme e limpo.

Portanto, precisamos realizar o processo de limpeza, unir tudo em um único Dataframe, fazero pré-processamento e depois a modelagem preditiva — nesse caso poderíamos prever a nota de alunos em algum teste. Para isso, primeiramente, precisamos deixar os dados limpos e organizados para auxiliar na manipulação.

Não há exagero algum em afirmar que até 80% do tempo de um cientista de dados é usado em tarefas de organização, manipulação, limpeza e transformação de dados. Cada problema de negócios vai requerer uma abordagem diferente para dados diferentes, de fontes diferentes…

Junção de Tabelas com Pandas

Veremos agora como fazer a junção de DataFrames utilizando Pandas.

Vamos começar concatenando os dois DataFrames df_a e df_b no sentido das linhas, ou seja, vamos pegar todas as linhas do 1° DataFrame e concatenar com a função concat do Pandas, todas as linhas do 2° Dataframe.

Concatenar por Linhas

A função concat receberá uma lista de DataFrames:

df_new = pd.concat([df_a, df_b]); df_new

Observe como ficaram os índices, temos o índice do primeiro DataFrame df_a de 0 a 4, depois os índices do segundo DataFrame df_b de 0 a 4 novamente. É importante posteriormente fazer o ajuste desses índices!

Concatenar por colunas

Uma vez que fizemos a concatenação por linhas, também podemos fazer a concatenação por colunas — basta adicionar o parâmetro axis=1:

pd.concat([df_a, df_b], axis=1)

Veja que agora fizemos a junção das tabelas no sentido das colunas. Isso que estamos fazendo, nos permite concatenar dois arquivos csv da maneira que for mais conveniente ou necessária para nossa análise.

Join de DataFrames por Coluna

Agora faremos o JOIN de dois DataFrames através de uma coluna específica — disciplina_id, ou seja, vamos utilizar uma das colunas para orientar nosso JOIN entre tabelas.

pd.merge(df_new, df_n, on = 'disciplina_id')

Geralmente, é necessário que tenhamos uma coluna comum entre duas tabelas — correspondência, ou quantas tabelas estivermos trabalhando simultaneamente. Senão, o JOIN servirá meramente a caráter de concatenação de valores.

Passamos para a função merge do Pandas o df_new — DataFrame que fizemos a concatenação por linhas, df_n — referente a lista de disciplinas e testes e, então, apontamos a junção on pela coluna disciplina_id correspondente a ambos:

Acima, concluímos a junção de três tabelas diferentes! Geramos df_new da concatenação por semelhança de df_a com df_b, e partir dessa contenação, realizamos um JOIN de df_new com df_n que contém informações diferentes — nota dos alunos e uma coluna correspondente disciplina_id.

Isso que foi feito acima, também pode ser expandido para essa outra linha de código, que aliás é a mesma coisa:

pd.merge(df_new, df_n, left_on = 'disciplina_id', right_on = 'disciplina_id')

Uma opção é utilizar apenas o parâmetro on da função merge, a outra opção é utilizar o left_on ou right_on e depois omitir um ou outro para fazer um dos SQL JOINS vistos lá em cima:

Quando especificamos os dois left_on e right_on, é a mesma coisa do que usar apenas o parâmetro on do merge. Podemos ver acima que os resultados foram iguais.

Joins Específicos

Contudo, em alguns casos, desejamos aplicar um filtro um pouco mais específico:

OUTER JOIN (união de todos os dados)
INNER JOIN (dados comuns às duas tabelas).

Com a função merge especificamos o parâmetro how, de como queremos a junção: outer ou inner.

Outer Join

Merge entre df_a e df_b, com a correspondência de disciplina_id e de maneira outer.

Se não houver correspondência, o lado ausente conterá NaN:

pd.merge(df_a, df_b, on = 'disciplina_id', how = 'outer')

Inner Join

A junção interna produz apenas o conjunto de registros correspondentes entre df_a e df_b. Talvez fosse esse tipo de JOIN que buscávamos:

pd.merge(df_a, df_b, on = 'disciplina_id', how = 'inner')

Esse merge busca apenas onde encontra correspondência de disciplina_id de código 4 e 5 entre df_a e df_b, retornando estas exatas correspondências de ambas as tabelas. Os códigos 1, 2, 3, 6, 7, 8 não formam uma interseção, logo, estes últimos não são retornados em INNER JOIN.

Left Join

Podemos também realizar, apenas um LEFT JOIN, ou seja, buscamos a interseção de df_a e df_b, mas também todos os dados relativos à tabela da esquerda df_a.

pd.merge(df_a, df_b, on = 'disciplina_id', how = 'left')

Right Join

Podemos também realizar, apenas um LEFT JOIN, ou seja, buscamos a interseção de df_a e df_b, mas também todos os dados relativos à tabela da esquerda df_b.

pd.merge(

df_a,

df_b,

on = 'disciplina_id',

how = 'right')

Adicionar Sufixos

Podemos adicionar sufixos no momento que fizermos as junções, isso pode ser útil para identificarmos qual é a tabela da esquerda, qual é a tabela da direita e assim colocar as colunas com um nome correspondente — muito útil no processo de análise para sabermos com qual tabela trabalhamos.

pd.merge(

df_a,

df_b, on = 'disciplina_id',

how = 'left',

suffixes = ('_left', '_right'))

Join baseado em Índices

Podemos ainda, fazer JOIN baseado em índice. Basta apontar quais índices são verdadeiros — o resultado é basicamente parecido, mas com a notação baseada em índice.

pd.merge(

df_a,

df_b,

right_index = True,

left_index = True)

Se você trabalha com banco relacional, você comumente usa os comandos SELECT, se trouxer os dados para o Pandas, utilizará a função concat ou merge e indicando how fazer a junção.

Portanto, com a biblioteca Pandas somos capazes de executar a mesma manipulação que seria feita em um Banco Relacional, juntando vários arquivos ou tabelas em uma única estrutura de dados e seguir com nosso processo de análise.

Operações em colunas com pandas

1. Ordenar Valores

A classificação com DataFrame Pandas por uma ou mais colunas pode ser feita de forma ascendente ou descendente.

df.sort_values(by='column_name', ascending=False)

O primeiro argumento passado em by é a coluna ‘column_name’ do nosso DataFrame que gostaríamos ordenar.

Se você precisarmos classificar por várias colunas, devemos passar os valores para uma lista de colunas e uma outra lista de ordenação correspondente às colunas.

df.sort_values(by = ['col_1', 'col_2'], ascending = [True, False])

Este exemplo ordenou as colunas do DataFrame de forma distinta, um a em ordem ascendente e a outra em descendente.

2. Renomear Colunas

Há uma série de maneiras que você pode renomear colunas usando Pandas, vamos renomear tanto uma, como várias colunas.

pd.rename(columns = { 'original_col_name': 'new_col_name'})

Renomear várias colunas:

pd.rename(columns = {

'original_col1_name': 'new_col1_name',

'original_col2_name': 'new_col2_name',

'original_col3_name': 'new_col3_name'})

Outra maneira de renomear cabeçalhos de coluna usando o módulo Python Pandas é definir o atributo:

pd.columns = ['col_1', 'col_2', 'col_3']

3. Excluir Colunas

Podemos excluir uma ou mais colunas de um DataFrame Pandas ao mesmo tempo. Primeiro, vamos começar com a exclusão de uma coluna:

df.drop('column_name', axis=1)

Especificando axis garante que estamos dropando uma coluna no sentido correto de “coluna” e não em linha. Para dropar várias colunas, atualizar a passagem com uma lista de nomes de colunas.

df.drop(['col_1', 'col_2', 'col_3'], axis=1)

4. Agrupar Colunas

Agrupar dados com Pandas é uma maneira de resumir seus dados. Isso pode ser usado como base para plotar gráficos ou apenas para fornecer insights.

df.groupby('col_1').count()

O método pode ser usado em qualquer número de colunas e usado para agregar cada uma de diferentes maneiras. O código abaixo agrupará duas colunas e agregará a estas operações de soma e média, respectivamente.

df.groupby(['col_1', 'col_2']).agg(['sum', 'mean'])

5. Aplicar função a uma coluna

Uma maneira de aplicar uma função a todas as linhas em uma coluna de DataFrame pandas é usando o método .apply( ):

df['col'].apply(função específica)

Acima, uma coluna específica foi selecionada para a função (genérica neste caso) foi aplicada. As funções aplicadas podem ser embutidas, por exemplo, a função raiz quadrada numpy ou uma função definida pelo usuário.

df['col'].apply(lambda x: x**2 + 5)

Usando a função lambda, para cada linha na coluna, cada valor será quadrado e, em seguida, terá cinco adicionados aos valores dessa coluna.

Resumindo

Classificar por colunas df.sort_values()
Colunas de renomear df.rename()
Excluir colunas df.drop()
Grupo e agregado por colunas df.groupby().agg()
Aplique uma função a todas as linhas em uma coluna df.apply()

20 funções úteis em pandas

Pandas é uma biblioteca de análise de dados Python predominantemente usada, fornecendo muitas funções e métodos para agilizar o processo de análise de dados.

Neste passo a passo, veremos 20 funções de Pandas com exemplos, todos eles agregarão valor ao seu processo de análise de dados. Vamos direto ao ponto.

Importando bibliotecas

import numpy as np

import pandas as pd

1. query()

Comumente, precisamos filtrar um DataFrame com base em uma condição ou aplicar uma máscara para obter determinados valores. “query” é uma função de consulta fácil.

Primeiramente criamos um DataFrame para exemplificar nosso tutorial daqui para frente.

values_1 = np.random.randint(10, size=10)

values_2 = np.random.randint(10, size=10)

years = np.arange(2010,2020)

groups = ['A','A','B','A','B','B','C','A','C','C']

df = pd.DataFrame({

'group':groups,

'year':years,

'value_1':values_1,

'value_2':values_2})

pd.DataFrame( )

É muito simples usar a função de consulta, apenas requer a passagem da condição de filtragem.

df.query('value_1 < value_2')

df.query( )

2. Adicionar nova coluna: .insert()

Quando queremos adicionar uma nova coluna em um DataFrame, ela é adicionada no final por padrão. No entanto, Pandas oferece a opção de adicionar a nova coluna em qualquer posição usando função de inserção.

Precisamos especificar a posição passando um índice como primeiro argumento. Este valor deve ser um valor inteiro. Os índices das colunas começam de zero, assim como os índices de linha. O segundo argumento é o nome da coluna e o terceiro argumento é o objeto que inclui valores que podem ser Série ou um objeto semelhante a uma matriz.

# Nova coluna

new_col = np.random.randn(10)

# Inserindo a nova coluna na posição

2df.insert(2, 'new_col', new_col)df

3. Soma Cumulativa: .cumsum()

O DataFrame contém alguns valores anuais de 3 grupos diferentes (A, B, C). Podemos estar interessados apenas em valores mensais, mas há alguns casos em que também precisamos de uma soma acumulada. Pandas fornece uma função fácil de usar para calcular a soma cumulativa que é .cumsum().

# definindo soma cumulativa

df['cumsum_2'] = df[['value_2','group']].groupby('group').cumsum()

# ordenando grupos

df.sort_values(by=['group'])

.cumsum( )

Se aplicarmos apenas cumsum, grupos (A, B, C) serão ignorados. Esse tipo de valor cumulativo pode ser inútil em alguns casos porque não somos capazes de distinguir entre grupos, então devemos aplicar groupby e, em seguida, função cumsum.

4. Amostagem: .sample()

O método de amostragem permite selecionar valores aleatoriamente a partir de uma série ou DataFrame. É útil quando queremos selecionar uma amostra aleatória de uma distribuição

sample1 = df.sample(n=3);

sample1

Especificamos o número de valores com parâmetro n, mas também podemos passar uma razão para parâmetro frac. Por exemplo, 0,5 retornará metade das linhas.

sample2 = df.sample(frac=0.5); sample2

Exemplo amostra fracionada em 50%

Para obter amostras reprodutíveis, podemos usar o parâmetro random_state.

O valor inteiro passado para random_state gerará a mesma amostra será reproduzida toda vez que o código for executado.

5. Onde: .where()

“where” é usado para substituir valores em linhas ou colunas com base em uma condição.

Os valores de substituição padrão são NaN, mas também podemos especificar o valor a ser colocado como uma substituição.

df['new_col'].where(df['new_col'] > 0 , 0)

A forma como “where” funciona é que os valores que se encaixam na condição são selecionados e os demais valores são substituídos pelo valor especificado. where (df[‘new_col’]>0, 0) seleciona todos os valores em “new_col” que são maiores que 0 e os demais valores são substituídos por 0. Assim, onde também pode ser considerado como uma operação de máscara.

Um ponto importante é que “where” para Pandas e NumPy não são exatamente os mesmos. Podemos alcançar o mesmo resultado, mas com sintaxes ligeiramente diferente.

Com o DataFrame.where, os valores que se encaixam na condição são selecionados como estão e os outros valores são substituídos pelo valor especificado. Np.where requer também especificar o valor para aqueles que se encaixam na condição. As duas linhas a seguir retornam o mesmo resultado:

# aplicação com Pandas DataFrame

df['new_col'].where(df['new_col'] > 0, 0)

# aplicação com NumPy array

np.where(df['new_col'] > 0, df['new_col'], 0)

6. Contém: isin()

Usamos com frequência métodos de filtragem ou seleção ao trabalhar com DataFrames.

O método .isin é uma espécie de filtragem avançada. Podemos filtrar valores com base em uma lista de seleções.

years = ['2010','2014','2017']df[df.year.isin(years)]

Exemplo de filtragem .isin( )

7. Loc and iloc

Loc e iloc são usados para selecionar linhas e colunas.

loc: seleção por rótulos
iloc: selecão por posições

O método loc é usado para selecionar dados por rótulos. Os rótulos das colunas são os nomes das colunas. Precisamos ter cuidado com rótulos de linha. Se não atribuirmos nenhum índice específico, Pandas criará um índice inteiro por padrão. Assim, os rótulos de linha são inteiros a partir de 0 e subindo. As posições de linha que são usadas com iloc também são inteiros a partir de 0.

Selecionando as primeiras 3 linhas e as primeiras 2 colunas com iloc:

df.iloc[:3, :2]

Retornando 3 primeiras linhas e 2 primeiras colunas com iloc

Selecionando as primeiras 3 linhas e as primeiras 2 colunas com loc:

df.loc[:2, ['group', 'year']]

Retornando 3 primeiras linhas e 2 primeiras colunas com loc

8. Variação percentual: .pct_change

Esta função é usada para calcular a variação percentual através dos valores em uma série. Considere que temos uma série que contém [2, 3, 6]. Se aplicarmos pct_change a esta série, a série devolvida será [NaN, 0.5, 1.0], ou seja, há um aumento de 50% do primeiro para o segundo e 100% do segundo para o terceiro.

pct_change é umaunção é útil para comparar a porcentagem de mudança em uma série de elementos temporizados.

df.value_1.pct_change()

Variação percentual de valores de um atributo (Series)

9. Ranqueamento: .rank()

A função de classificação atribui classificação aos valores. Suponha que temos uma série que contém [1, 7, 5, 3]. As classificações atribuídas a esses valores serão [1, 4, 3, 2].

# criando atributo de ranqueamento "rank_1" através de "value_1"

df['rank_1'] = df['value_1'].rank(); df

Ranqueando valores de uma coluna com .rank

10. Derreter: .melt()

Melt é usado para converter DataFrames “largos” em “estreitos”, ou seja, DataFrame largo é quando temos alto número de colunas. Em alguns casos, representar essas colunas como linhas pode se encaixar melhor em nossa tarefa. Considere o seguinte dataframe:

Criando df_wide de dimensão “larga”

Temos três cidades (A, B e C) e medições diferentes feitas em 5 dias diferentes. Decidimos representar esses dias como filas em uma coluna. Haverá também uma coluna para mostrar as medidas. Podemos facilmente conseguir isso usando a função melt:

df_wide.melt(id_vars=['city'])

“Derretendo” o DataFrame com a função .melt( )

Nomes de colunas de variável e valor são dados por padrão. Podemos usar var_name e value_name parâmetros da função melt para atribuir novos nomes de colunas.

11. Explodir: .explode()

Vamos assumir que o conjunto de dados inclui várias entradas de um recurso em uma única observação (3a linha de measurement), mas desejamos analisá-los em linhas separadas.

Criando linha com mais de uma observação.

df1.explode('measurement').reset_index(drop=True)

Explodindo a linha em mais linhas com .explode

12. .nunique()

nunique( ) conta o número de entradas únicas sobre colunas ou linhas. É muito útil em características categóricas especialmente nos casos em que não sabemos o número de categorias de antemão. Vamos dar uma olhada no nosso dataframe lá do início:

df.year.nunique()

df.group.nunique()

nunique( ) retorna valores únicos de anos e de grupos

Podemos aplicar diretamente a função nunique ao dataframe e ver o número de valores únicos em cada coluna:

df.nunique()

Se o parâmetro do eixo for definido para (axis=1), nunique retorna o número de valores únicos em cada linha.

13. Busca com: .lookup()

Ele pode ser usado para procurar valores no DataFrame com base nos valores em outras linhas, pares de colunas. Suponha que tenhamos o seguinte DataFrame:

Para cada dia dos 8 dias, temos medições de 4 pessoas e uma coluna que inclui os nomes dessas 4 pessoas.

Queremos criar uma nova coluna que mostre a medição da pessoa na coluna “person”. Assim, para a primeira linha, o valor na nova coluna será de 4 (o valor na coluna “Alex”).

df['Person_point'] = df.lookup(df.index, df['Person']); df

14. Inferir tipos de dados: infer_objects

Pandas suporta uma ampla gama de tipos de dados, um dos quais é objeto. O objeto abrange valores texte ou mistos (numéricos e não numéricos). No entanto, não é preferível usar o tipo de dados do objeto se uma opção diferente estiver disponível. Certas operações são executadas mais rapidamente com tipos de dados mais específicos. Por exemplo, preferimos ter tipo de dados inteiro ou flutuante para valores numéricos.

infer_objects tenta inferir melhores tipos de dados para colunas de objetos. Considere o seguinte dataframe:

df.dtypes

Todos os tipos de dados são objeto. Vamos ver quais são os tipos de dados inferidos:

df.infer_objects().dtypes

Pode parecer trivial, mas definitivamente será útil quando houver muitas colunas.

15. Uso de memória: .memory_usage()

Memory_usage() retorna quanta memória cada coluna usa em bytes. É útil especialmente quando trabalhamos com grandes dataframes. Considere o dataframe a seguir com 1 milhão de linhas.

df_large = pd.DataFrame(

{'A': np.random.randn(1000000),

'B': np.random.randint(100, size=1000000)})

df_large.shape

(1000000, 2)

E o uso de memória para cada coluna em bytes:

df_large.memory_usage()

Uso de memória

Uso de memória de todo o DataFrame em megabytes:

df_large.memory_usage().sum() / (1024**2)

Uso de memória do DataFrame

16. Estatística básica: .describe()

Descrever a função calcula estatísticas básicas para colunas numéricas que são contagem, média, desvio padrão, valores mínimos e máximos, mediana, primeiro e terceiro quartil. Assim, fornece um resumo estatístico do DataFrame.

Estatísticas básicas

17. Mesclar: .merge()

A Mesclagem combina DataFrames com base em valores em colunas compartilhadas. Considere os dois DataFrames a seguir:

Podemos mesclá-los com base em valores compartilhados em uma coluna. O parâmetro que define a condição para a fusão é o parâmetro “on”.

df1 e df2 são mesclados com base nos valores comuns em column_a.

O parâmetro da função de mesclagem permite combinar dataframes de diferentes maneiras. Os possíveis valores de como são ‘interior’, ‘externo’, ‘esquerda’, ‘direita’.

interior: apenas linhas com os mesmos valores na coluna especificada pelo parâmetro (valor padrão de como parâmetro)
exterior: todas as linhas
esquerda: todas as linhas do DataFrame esquerdo
direito: todas as linhas do DataFrame direito

18. Select_dtypes

Select_dtypes função retorna um subconjunto das colunas do DataFrame com base no conjunto de condições em tipos de dados. Permite incluir ou excluir certos tipos de dados usando parâmetros de inclusão e exclução.

df.select_dtypes(include='int64')

df.select_dtypes(exclude='int64')

19. Substituir: .replace()

Como o nome sugere, ele permite substituir valores em um dataframe.

df.replace('A', 'A_1')

O primeiro parâmetro é o valor a ser substituído e o segundo é o novo valor.

Também podemos passar em um dicionário para várias substituições ao mesmo tempo.

df.replace({'A':'A_1', 'B':'B_1'})

20. Aplicar: .applymap()

A função applymap( ) é usada para aplicar uma função a um quadro de dados elementwise. Observe que se uma versão vetorizada de uma operação estiver disponível, ela deve ser preferida em vez do apply( ).

Por exemplo, se quisermos múltiplos cada elemento por um número, não precisamos e não devemos usar a função applymap( ). Uma operação simples vetorizada (por exemplo, df * 4 ) é muito mais rápida nesse caso.

No entanto, pode haver alguns casos em que não temos a opção de operação vetorializada. Por exemplo, podemos alterar o estilo de um dataframe usando a propriedade Style de dataframes pandas. A função a seguir altera a cor dos valores negativos como vermelho.

def color_negative_values(val):

color = 'red' if val < 0 else 'black'

return 'color: %s' % color

Precisamos usar a função applymap para aplicar essa função a um dataframe:

df.style.applymap(color_negative_values)

8 ways to manipulate pandas dataframe

Pandas is a data analysis and manipulation library for Python. Pandas' central data structure is the DataFrame that stores tabular-shaped data with labeled rows and columns.

Pandas provide a variety of functions for modifying or manipulating their central structure. In this article, we'll go through 8 different ways to transform DataFrames.

We started by importing the libraries.

import numpy as np

import pandas as pd

As examples, we will use the following dataframe:

1. Add column

The first and most important way to transform is to add or drop columns. For example, a new column can be added as follows:

df['new'] = np.random.random(5)

We generate the values in a list and assign a name to the new column. But, first, we must make sure that the size of the array is compatible with the size of the DataFrame.

2. Add line

We can use the loc method to add a single row to a DataFrame.

df.loc[5,:] = ['Jack', 3, 3, 4, 5, 1]

The loc function specifies the rows and columns with their labels. The expression [5, :] indicates a line labeled five and all its columns. We can use the drop function with axis parameter = 0 to drop the line.

df.drop(5, axis=0, inplace=True)

We just dropped the line that is added in the previous step.

3. Drop

The drop function is used to drop a column.

df.drop('new', axis = 1, inplace = True)

We passed the column name to be discarded. The axis parameter is set to 1 to indicate that we are dropping in the direction of the column. Finally, the inplace parameter needs to be True to save the changes directly to the df dataframe.

4. Insert

The insert function adds a column to a specific position.

df.insert(0, 'new', np.random.random(5))

The first parameter specifies the index/position of the column. The second and third parameters are the column name and its values, respectively.

5. Melt

The melt function converts a large data frame (high number of columns) to a long (increased number of rows). Consider following the DataFrame:

Contains consecutive daily measurements for three people in long format:

pd.melt(df, id_vars='names').head()

The column passed to the id_vars remains the same, and the other columns are combined with it.

6. Concat

The concat function combines multiple data frames along an axis (index or columns). For example, consider the following two data frames.

Here's how combining them with concat:

pd.concat([df, df2], axis=0, ignore_index=True)

We set the axis parameter to zero to match along with the index. The parameter ignore_index resets the index in sequential order. Otherwise, the original indexes are retained (0,1,2,0,1,2). To combine along with the columns, the axis parameter is set to 1.

7. merge

The merge function also combines data frames based on shared values in a given column or columns.

We can use them based on the id column:

customer.merge(order, on='id')

By default, only existing values on both data frames are taken. However, we can change it with the parameter "by." For example, if we set it ''left', all values in the left data frame are taken, and only the corresponding ones are taken from the proper data frame.

8.get_dummies

Some machine learning models cannot handle categorical variables. In these cases, we must encode categorical variables so that each category is represented as a column.

The function get_dummies encodes the name and the "ctg" columns.

pd.get_dummies(df)

For example, in the first row, the name is Jane, and the ctg is A. Thus, the columns representing these values are value 1, and all other columns are 0.

9. pivot_table

The function pivot_table transforms a data frame into a format that explains the relationship between variables.

The data frame on the left contains two categorical characteristics (i.e., columns) and a numeric feature.

We want to see the average value of the categories in both columns. Thus, the function pivot_table transforms the data frame to see the average deals or any other aggregation.

df.pivot_table(index='name', columns='ctg', aggfunc='mean')

What we saw here

We've covered eight different ways to transform a data frame. Some of them only change the structure, while some perform a calculation during the transformation.

We also saw how to add or drop rows/columns. In addition, join and concatenation functions transform data frames by combining them. All of these operations are critical parts of the data analysis and manipulation process.

Boosting Joings using sql With Pandas

Study notebook

Visit Jupyter Notebook to see the concepts that we will cover about SQL Join in Pandas. Note: Important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Import libraries

The first step is to import Pandas so we can use the packages, methods, and attributes, and the NumPy package to create the Arrays:

import pandas as pd

import numpy as np

To query the data in a relational database, we use the SQL language to manipulate data in relational databases. For example, when we work with more than one table, we run SQL Join commands, which fetches data from 2, 3, 4, or 5 tables simultaneously.

Import image

The image summarizes some types of joins through SQL Joins:

from IPython.display import Image

Image(url = 'https://upload.wikimedia.org/wikipedia/commons/9/9d/SQL_Joins.svg')

Inner Join

Look at the top center figure, in which we have two tables — table A and table B. The join type returns the data that is common in both tables. For this, we have the specific join type — Inner Join;

Right Join

The following Join is Right Join — we have tables A and B, we want to return the shared data in both tables, but also all the data belonging to table B (on the right side);

Left Join

The same reasoning is for Left Join — we have tables A and B, we want to return the shared data in both tables, but also all the data belonging to table A (on the left side);

Full Outer Join

We can search all data with Full Outer — we have tables A and B; we want to return the shared data to the tables and all the data belonging to table A and B.

How to do this with Pandas?

Create DataFrames

First, we will create some datasets from dictionaries, that is, key: value pairs. With the dictionary created, we can convert it to a DataFrame and indicate the column names.

Imagine that we coordinate a school that records grades for each class of students, and each of these classes has an individual spreadsheet. How would we join these worksheets to perform the analysis process?

Data1: discipline code, first and last name

Suppose we have CSV files or spreadsheets from a class at any school, where we have the subjects and the students’ first and last names — here we’ll create small DataFrames for teaching purposes:

data1 = {'subject_id': ['1', '2', '3', '4', '5'], 'name': ['Bernardo', 'Alan', 'Mateus', 'Ivo', 'Gerson'], 'surname': ['Anderson', 'Teixeira', 'Amoedo', 'Trindade', 'Vargas']}

df_a = pd.DataFrame(data1, columns = ['subject_id', 'name', 'surname']); df_a

Data 2: code discipline, first and last name

We again have another DataFrame of disciplines, first and last name:

data2 = {'subject_id': ['4', '5', '6', '7', '8'], 'name': ['Roberto', 'Mariana', 'Ana', 'Marcos', 'Maria'], 'surname': ['Sampaio', 'Fernandes', 'Arantes', 'Menezes', 'Martins']}

df_b = pd.DataFrame(data2, columns = ['subject_id', 'name', 'surname']);
df_b

Data 3: discipline and test grade

We have the code of the disciplines and the corresponding test grade:

data3 = {'discipline_id': ['1','2','3','4','5','7','8','9','10','11'], 'test_id': [81, 75, 75, 71, 76, 84, 95, 61, 57, 90]}

df_n = pd.DataFrame(data3, columns = ['subject_id','test_id']);
df_n

What to do from now on?

Now we’ll put these three tables together to perform the analysis process — we do this all the time when working with relational data analysis, even with unstructured data analysis, since we will hardly receive a ready, uniform, and clean file.

So we need to perform the cleanup process, merge everything into a single Dataframe, do the preprocessing, and then predictive modeling — in which case we could predict the grade of students in some tests. But, first, we need to leave the data clean and organized to assist in handling it.

There is no exaggeration in stating that up to 80% of a data scientist’s time is used to organize, manipulate, clean, and transform data. Each business problem will require a different approach to different data from various sources.

Joining tables with Pandas

We will now see how to join DataFrames using Pandas.

Let’s start by concatenating the two df_a and df_b DataFrames in the direction of the lines; that is, let’s take all the 1st DataFrame and join with the Pandas concat function, all lines of the 2nd Dataframe.

Concatenate by lines

The concat function will receive a list of DataFrames:

df_new = pd.concat([df_a, df_b]); df_new

Notice how the indexes turned out. We have the first DataFrame df_a index from 0 to 4, then the second DataFrame df_b from 0 to 4 again. It is crucial to adjust these indexes later!

Concatenate by columns

Once we’ve done line concatenation, we can also concatenate by columns — add the axis=1 parameter:

pd.concat([df_a, df_b], axis = 1)

See that we have now made the tables join in the columns direction. This join type allows us to concatenate two CSV files that are most convenient or necessary for our analysis.

Join of two DataFrames per Column

Now we’ll join two data frames through a specific column — subject_id; that is, we’re going to use one of the columns to guide our Join between tables.

Generally, we need to have a common column between two tables — matching how many tables we are working on simultaneously. If not, join will serve merely to concatenate values.

pd.merge(df_new, df_n, on = 'subject_id')

Attention here

We passed to the merge function df_new — df_new is the DataFrame that we did concatenating by rows df_a and df_b, the df_n — refers to the list of disciplines and tests, and then we point the junction on by the subject_id column corresponding to both:

Above, we conclude the joining of three different tables! First, we generate df_new through concatenation by the similarity of df_a with df_b, and from this, we joined df_new with df_n that contains extra information — student grade and a corresponding column subject_id.

It that we did above can also be expanded to this other line of code below:

pd.merge(df_new, df_n, left_on = 'subject_id', right_on = 'subject_id')

One option is to use the parameter of the merge function, and the other option is to use left_on or right_on and then omit one or the other to make one of the SQL joins.

When we specify the left_on and right_on, it’s the same as just the merge on parameter. So we can see above that the results were the same.

Specific Joins

However, in some cases, we want to apply a slightly more specific filter:

Outer Join (union of all data)
Inner Join (data common to both tables).

With the merge function, we specify how we want the Join: outer or inner.

Outer Join

Merge between df_a and df_b, with the subject_id and outer way:

# If there is no match, the missing side will contain NaN

pd.merge(df_a, df_b, on = 'subject_id', how = 'outer')

Inner Join

The inner join produces only the corresponding record set between df_a and df_b. So maybe that’s the kind of join we were looking for:

pd.merge(df_a, df_b, on = 'subject_id', how = 'inner')

This merge searches find subject_id of code 4 and 5 between df_a and df_b, returning these exact matches from both tables. However, codes 1, 2, 3, 6, 7, 8 do not form an intersection, so they are not returned with Inner Join.

Left Join

We can also perform only a Left Join. Therefore, we seek the intersection of df_a and df_b and all the data relating to the table on the left df_a.

pd.merge(df_a, df_b, on = 'subject_id', how = 'left')

Right Join

We can also perform only a Right Join. We seek the intersection of df_a and df_b and all the data related to the table on the left df_b.

pd.merge(df_a, df_b, on = 'subject_id', how = 'right')

Add suffixes

We can add suffixes the moment we make the joins. Suffixes can be useful to identify which table is on the left, the table on the right, and thus put the columns with a corresponding name — useful in the analysis process to know which table we are working on.

pd.merge(df_a, df_b, on = 'subject_id', how = 'left', suffixes = ('_left', '_right'))

Join based on Indexes

We can also do join based on the index. Just point out which indexes are true — the result is basically similar, but with index-based notation.

pd.merge(df_a, df_b, right_index = True, left_index = True)

If you work with a relational database, you commonly use select commands. For example, if you bring the data to Pandas, you will use the Concat function or merge function and indicating how to join.

Therefore, we can perform the same manipulation done in a Relational Bank with the Pandas library, joining multiple files or tables in a single data structure and proceeding with our analysis process.

And there we have it. I hope you have found this useful. Thank you for reading. 🐼

Pandas Cheat Sheet for Data Science

Study notebook

Visit Jupyter Notebook to see the concepts that we will cover about SQL Join in Pandas. Note: Important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Import libraries

The first step is to import Pandas so we can use the packages, methods, and attributes, and the NumPy package to create the Arrays:

import pandas as pd

import numpy as np

Import image

The image summarizes some types of joins through SQL Joins:

from IPython.display import Image

Image(url = 'https://upload.wikimedia.org/wikipedia/commons/9/9d/SQL_Joins.svg')

Inner Join

Right Join

The following Join is Right Join — we have tables A and B, we want to return the shared data in both tables, but also all the data belonging to table B (on the right side);

Left Join

The same reasoning is for Left Join — we have tables A and B, we want to return the shared data in both tables, but also all the data belonging to table A (on the left side);

Full Outer Join

We can search all data with Full Outer — we have tables A and B; we want to return the shared data to the tables and all the data belonging to table A and B.

How to do this with Pandas?

Create DataFrames

First, we will create some datasets from dictionaries, that is, key: value pairs. With the dictionary created, we can convert it to a DataFrame and indicate the column names.

Data1: discipline code, first and last name

Suppose we have CSV files or spreadsheets from a class at any school, where we have the subjects and the students’ first and last names — here we’ll create small DataFrames for teaching purposes:

data1 = {'subject_id': ['1', '2', '3', '4', '5'], 'name': ['Bernardo', 'Alan', 'Mateus', 'Ivo', 'Gerson'], 'surname': ['Anderson', 'Teixeira', 'Amoedo', 'Trindade', 'Vargas']}

df_a = pd.DataFrame(data1, columns = ['subject_id', 'name', 'surname']);
df_a

Data 2: code discipline, first and last name

We again have another DataFrame of disciplines, first and last name:

data2 = {'subject_id': ['4', '5', '6', '7', '8'], 'name': ['Roberto', 'Mariana', 'Ana', 'Marcos', 'Maria'], 'surname': ['Sampaio', 'Fernandes', 'Arantes', 'Menezes', 'Martins']}

df_b = pd.DataFrame( data2, columns = ['subject_id', 'name', 'surname']);

df_b

Data 3: discipline and test grade

We have the code of the disciplines and the corresponding test grade:

data3 = {'discipline_id': ['1','2','3','4','5','7','8','9','10','11'], 'test_id': [81, 75, 75, 71, 76, 84, 95, 61, 57, 90]}

df_n = pd.DataFrame(data3, columns = ['subject_id','test_id']);
df_n

What to do from now on?

Joining tables with Pandas

We will now see how to join DataFrames using Pandas.

Concatenate by lines

The concat function will receive a list of DataFrames:

df_new = pd.concat([df_a, df_b]); df_new

Notice how the indexes turned out. We have the first DataFrame df_a index from 0 to 4, then the second DataFrame df_b from 0 to 4 again. It is crucial to adjust these indexes later!

Concatenate by columns

Once we’ve done line concatenation, we can also concatenate by columns — add the axis=1 parameter:

pd.concat([df_a, df_b], axis = 1)

See that we have now made the tables join in the columns direction. This join type allows us to concatenate two CSV files that are most convenient or necessary for our analysis.

Join of two DataFrames per Column

Now we’ll join two data frames through a specific column — subject_id; that is, we’re going to use one of the columns to guide our Join between tables.

Generally, we need to have a common column between two tables — matching how many tables we are working on simultaneously. If not, join will serve merely to concatenate values.

pd.merge(df_new, df_n, on = 'subject_id')

Attention here

It that we did above can also be expanded to this other line of code below:

pd.merge(df_new, df_n, left_on = 'subject_id', right_on = 'subject_id')

One option is to use the parameter of the merge function, and the other option is to use left_on or right_on and then omit one or the other to make one of the SQL joins.

When we specify the left_on and right_on, it’s the same as just the merge on parameter. So we can see above that the results were the same.

Specific Joins

However, in some cases, we want to apply a slightly more specific filter:

Outer Join (union of all data)
Inner Join (data common to both tables).

With the merge function, we specify how we want the Join: outer or inner.

Outer Join

Merge between df_a and df_b, with the subject_id and outer way:

# If there is no match, the missing side will contain NaN

pd.merge(df_a, df_b, on = 'subject_id', how = 'outer')

Inner Join

The inner join produces only the corresponding record set between df_a and df_b. So maybe that’s the kind of join we were looking for:

pd.merge(df_a, df_b, on = 'subject_id', how = 'inner')

Left Join

We can also perform only a Left Join. Therefore, we seek the intersection of df_a and df_b and all the data relating to the table on the left df_a.

pd.merge(df_a, df_b, on = 'subject_id', how = 'left')

Right Join

We can also perform only a Right Join. We seek the intersection of df_a and df_b and all the data related to the table on the left df_b.

pd.merge(df_a, df_b, on = 'subject_id', how = 'right')

Add suffixes

pd.merge(df_a, df_b, on = 'subject_id', how = 'left', suffixes = ('_left', '_right'))

Join based on Indexes

We can also do join based on the index. Just point out which indexes are true — the result is basically similar, but with index-based notation.

pd.merge(df_a, df_b, right_index = True, left_index = True)

If you work with a relational database, you commonly use select commands. For example, if you bring the data to Pandas, you will use the Concat function or merge function and indicating how to join.

And there we have it. I hope you have found this useful. Thank you for reading. 🐼

Guia de análise exploratória de dados

Nesse tutorial vamos explorar o dataset de aluguel de bicicletas, realizar transformações e reorganizar os dados como se realmente estivéssemos preparando os dados para modelagem e criação de modelos.

Acesse o Jupyter Notebook para consultar os conceitos que serão abordados sobre Análise Exploratória de Dados. Obs: as funções, outputs e termos importantes estão em negrito para facilitar a compreensão — pelo menos a minha.

É comum que recebamos os dados para resolver um problema qualquer e precisemos analisar e explorar esses dados, buscar relacionamentos, buscar como as variáveis estão organizadas, encontrar ou não a necessidade de transformar os dados — tudo isso é feito através da Análise Exploratória.

Objetivo da Análise

Analisar dados de aluguel de bikes usadas como táxis na cidade de Nova Iorque. A cidade de Nova Iorque é um verdadeiro caos urbano, o trânsito é caótico. Por conta disso, é muito comum encontrar bikes que são usadas como táxis, que comportam mais de uma pessoa. Esse dataset representa esse tipo de aluguel.

Importar Biblioteca

Vamos importar os pacotes com os quais trabalharemos:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inlinepd.__version__

Carregar Dataset

Chamamos a função read_csv do Pandas e passamos para o objeto df:

df = pd.read_csv('data/taxis_bikes_nycity.csv')

Checar Tipo

Podemos conferir que temos um objeto do tipo DataFrame para manipular:

print(type(df))

Visualizar Dados

A primeira coluna é o índice criado automaticamente por Pandas, as demais colunas foram coletadas da cidade de New York — data, distância do trajeto e tempo do trajeto.

df.head(10)

Checar Tipo dos Dados

Vamos ver como Pandas determinou os tipos de cada coluna ao carregá-los:

df.dtypes

Data object #stringDistancia float64 #decimalTempo object #stringdtype: object

Entretanto, a primeira coluna não se trata de um formato string, mas sim um formato de data. No momento que pandas fez a importação, ele não detectou que aquela coluna era uma data — isso afeta nosso processo de análise.

Checar Colunas

Retorna uma lista de colunas:

df.columns

Index(['Data', 'Distancia', 'Tempo'], dtype='object')

Checar Índices

Inicia em 0 e termina em 81, saltando de 1 em 1.

df.indexRangeIndex(start=0, stop=81, step=1)

Olhar apenas para uma Coluna

Chamamos o DataFrame df e filtramos pela coluna Data:

df['Data'].head()

0 8/2/151 8/3/152 8/4/153 8/6/154 8/7/15

Name: Data, dtype: object

Converter tipo da coluna Data

A coluna Data foi carregada como tipo string, mas podemos mudar esse comportamento no momento que carregamos os dados. Para isso, basta incluir o parâmetro parse_dates da função read_csv:

df = pd.read_csv('data/taxis_bikes_nycity.csv', parse_dates = ['Data'])

df['Data'].head()

0 2015-08-021 2015-08-032 2015-08-043 2015-08-064 2015-08-07Name: Data, dtype: datetime64[ns]

Como podemos ver, o dtype de ‘Data’, retorna como datetime64, a conversão foi executada com sucesso. Essa alteração é fundamental de ser feita, visto que cada tipo de objeto tem atributos e métodos específicos.

Definir Data como Índice

Definiremos a coluna Data como sendo índice da nossa tabela df. Se formos analisar os dados ao longo do tempo, faz sentido usar a Data como índice.

Ao colocar a Data como sendo o índice, temos ao nosso favor todo o poder da indexação que é oferecido por Pandas:

df.set_index('Data', inplace = True)

df.head(10)

Agora podemos realizar uma série de fatiamentos, montar séries temporais para acompanhar esses dados ao longo do tempo — temos muito mais possibilidades. Essa conversão depende, é claro, do objetivo.

Criar plot

Quando chamamos a função plot, ela automaticamente reconheceu que a data está como índice e criou uma série temporal.

df.plot()

distância percorrida ao longo do tempo

Podemos ver acima a evolução da distância do trajeto de Bikes ao longo do tempo. Nitidamente conseguimos extrair a informação de que há uma média de distância por volta de 5 e em um dado momento há um pico relevante.

Consultar help

Assim somos capazes de visualizar todos os parâmetros, alguns exemplos de como usar a função, valores que podemos usar, etc.

?df.plot

Alterar Tipo de Gráfico

Faremos algumas alterações no gráfico, iniciando pelo seu tipo. Vamos ajustar o parâmetro kind para gráfico de barras:

df.plot(kind = 'bar')

plt.show()

Essa poluição em preto abaixo do gráfico, são as legendas por conta da alta compactação de informação.

Alterar gráfico

Ajustar o parâmetro king para um gráfico de área e determinando o nome do eixo y:

df.plot(kind = 'area')

plt.ylabel("Count")

Atenção!

O pico que vemos no final do gráfico, embora ele mostre que alguma coisa relevante ocorreu, se mantivermos esse pico, teremos problemas futuramente na modelagem preditiva — temos uma média de distância de constante, o pico é um outlier que foge completamente da média.

O ideal seria pegar o outlier, analisá-lo separadamente e removê-lo do conjunto de dados — assim teremos a média preservada para treinar o conjunto de dados.

Resumo Estatístico

Obter o resumo estatístico desse conjunto de dados com um simples método:

df.describe()

Slicing

Podemos fazer o fatiamento por data, visto que Data agora é um índice:

df['2015-11']

Perceba que o filtro foi feito pelo mês e ano, então o retorno foi de todos os dias do mês 11, do ano 2015 — Essa é uma grande vantagem de ter colocado a data como índice.

Comprimento

Podemos ter o comprimento total do filtro com a função len:

len(df['2015-11'])13

Até aqui, executamos cada linha do Jupyter Notebook. Porém, se fecharmos o browser, deveremos executá-lo por completo novamente.

Salvar em CSV

Portanto, uma dica é salvar o resultado intermediário em um arquivo csv. Para isso, chamamos a função to_csv, do DataFrame df e gravamos tudo armazenado em df em um arquivo csv chamado dataframe_saved_v1

df.to_csv('diretorio/dataframe_saved_v1.csv')

Portanto, podemos ir salvando os resultados intermediários para não perdermos o que foi feito e não ser necessário executar o notebook todo novamente.

Terminamos nosso dia de trabalho e usamos .to_csv

Notebook de Estudo —2

Carregar Dados Salvos

Vamos carregar o arquivo que salvamos através da função to_csv:

df = pd.read_csv('diretorio/dataframe_saved_v1.csv')

df.head()

Verificar Tipo

Muito cuidado aqui! Repare que a coluna Data foi carregada e retornou ao formato de object!

Ao carregarmos o csv precisamos determinar o parse_dates todas as vezes:

df.dtypesData

objectDistancia float64Tempo objectdtype: object

Recarregar Arquivo

Dessa vez, vamos carregar o arquivo determinando novamente parse_dates:

df = pd.read_csv('data/dataframe_saved_v1.csv', parse_dates = ['Data'])

df.dtypesData

datetime64[ns]Distancia float64Tempo objectdtype: object

Ao verificar novamente o tipo, podemos ter a segurança em seguir com a Data no formato datetime64.

Ajustar Nome das Colunas

Aqui temos um código para ajustar o nome das colunas, não sendo necessário neste caso — os nomes já estão na nomenclatura correta.

Basta criar uma lista com as strings que desejamos e atribuímos essa lista ao atributo columns do DataFrame df:

cols = ['Data', 'Distancia', 'Tempo']

df.columns = cols

colsdf.head()

Definir Data como Índice

Vamos definir a coluna Data como índice com a função set_index mais uma vez e verificar a alteração com head:

df.set_index('Data', inplace = True)

df.head()

Plotar gráfico

df.plot()

Converter Tempo em Segundos

Durante a fase de análise exploratória podemos identificar algumas necessidades, como por exemplo, quebrar a coluna de tempo em partes.

Primeiramente, usamos a palavra reservada def, damos nome à função de calcula_total_segundos e passamos um parâmetro time.

A função verificará se o parâmetro passado é do tipo NAN, caso seja, retornará NAN e seguirá adiante.

Depois apontaremos para as horas, minutos e segundos da coluna onde há o formato time, no caso a coluna tempo de duração da viagem— fará o split da coluna por “ : “, dividindo a coluna a cada vez que encontrar “:” e extraindo o tempo, de modo a separar em hrs, mins e seconds.

Finalmente é feito um cálculo para que possamos converter hrs, mins e seconds em segundos totais. Ou seja, ao invés de ter a duração no formato de tempo cronometrado, teremos o tempo padronizado em segundos totais:

def calcula_total_segundos(time): if time is np.nan: return np.nan hrs, mins, seconds = str(time).split(':') seconds = int(seconds) + 60 * int(mins) + 60 * 60 * int(hrs) return seconds

Mapear Função

Vamos aplicar a função à coluna Tempo, com a função map, e gravamos em uma nova coluna chamada Segundos em df:

df['Segundos'] = df.Tempo.map(calcula_total_segundos)

df.head(10)

Temos a coluna Tempo intacta e agora a nova coluna Segundos, com a conversão para segundos totais da duração do trajeto — Muito útil.

Resumo Estatístico

Resumo estatístico para a coluna distância e segundos que estão em formato numérico:

df.describe()

Preencher Valores Missing

Caso ache necessário, podemos preencher os valores NaN com 0 e depois chamar o resumo estatístico:

df.fillna(0).describe()

Os valores estatísticos agora mudaram porque agora preenchemos os valores ausentes com 0 — temos mais técnicas para resolver isso, veremos depois.

Converter Segundos em Minutos

Agora podemos quantificar quantos Minutos duraram o Tempo do trajeto, visto que já temos a contagem por Segundos.

Vamos pegar a coluna Segundos e aplicamos a função map. Como parâmetro passaremos uma função anônima utilizando a expressão lambda — para cada valor de x, ou seja, para cada elemento, pegar o elemento e dividí-lo por 60 para cada valor em Segundos e gravar a saída em Minutos:

df['Minutos'] = df['Segundos'].map(lambda x: x / 60)

Preencher valores NAN com 0 e solicitar o resumo estatístico:

df.fillna(0).describe()

Agora temos mais uma coluna, a coluna Minutos. Portanto, estamos quebrando a coluna Tempo e até aqui já temos o total de Segundos e o total de Minutos quantificados

Plot

Vamos aproveitar que o DataFrame já está mais robusto e traçar um gráfico de dispersão entre Distancia e Minutos:

df.plot(x = 'Distancia', y = 'Minutos', kind = 'scatter')

Claramente há uma relação positiva entre a variável Distancia e a variável Minutos. À medida que Distancia aumenta, leva-se mais Minutos para aquele trajeto — isso mostra que há coerência nos dados.

Verificar correlação — Pearson

Podemos chamar o método corr para verificar a correlação entre as variáveis, criando a tabela de correlação de variáveis. A diagonal 1 é apenas a relação de uma variável consigo própria — a mais alta possível.

Para o cálculo da correlação padrão, é usado coeficiente de Pearson.

df.corr()

Verificar correlação — Spearman

Alterando o coeficiente para Spearman:

df.corr(method = 'spearman')

Verificar correlação — Kendall

Alterando o coeficiente para Kendall:

df.corr(method = 'kendall')

Plotar Boxplot

É um gráfico que nos ajuda a visualizar outliers, valores que fogem muito da média da distribuição — valores extremos.

No gráfico, os círculos representam as viagens que duraram muito mais tempo do que a média, sendo preciso investigar esses casos.

df.boxplot('Minutos', return_type = 'axes')

Criar outras Variáveis

Podemos criar uma variável que indica a minutagem por quilômetro:

df['Min_Por_Km'] = df['Minutos'] / df['Distancia']df.fillna(0).describe()

Aplicar Histograma

O histograma nos mostra a distribuição de frequência de uma variável, neste caso a nossa nova variável Min_Por_Km:

df.hist('Min_Por_Km')

Veja que temos a grande maioria dos trajetos entre 7.0 e 7.5 minutos — isso nos mostra como os dados estão distribuídos. Dependendo do modelo de Machine Learning, talvez precise normalizar os dados.

Ajustar Colunas do Histograma

Podemos alterar a quantidade de colunas com bins:

df.hist('Min_Por_Km', bins = 20)

Salvar Gráfico

Caso o gráfico seja útil para análise, podemos usar a função .savefig() para colocar em um relatório ou apresentacão em PowerPoint. Portanto, criamos o histograma, definimos seus limites, colocamos título, optamos por grade e depois salva o arquivo .png:

df.hist('Min_Por_Km', bins = 20, figsize = (10, 8))

plt.xlim((5, 11))

plt.ylim((0, 12))

plt.title("Histograma Minutos Por Km")

plt.grid(False)

plt.savefig('graficos/hist_minutos_por_km.png')

Plot de uma Variável

Um simples plot de uma única variável do conjunto:

df['Distancia'].plot()

df.head(15)

Aplicar Janelas de Tempo

Aqui temos um conceito muitíssimo interessante, as janelas. Esse conceito é aplicado com maior frequência ao utilizar o Apache Spark — trabalhar com dados em tempo real.

Dentro das janelas aplicamos as rolling statistics, que são estatísticas calculadas especificamente dessa janela. Ou seja, os dados estão passando, pegamos uma janela, olhamos para esses dados e aplicamos estatísticas a eles.

df['Distancia'].plot()pd.Series(df['Distancia']).rolling(window = 2).mean().plot()

Primeiramente convertemos a coluna em uma Series, aplicamos a função rolling — para aplicar uma rolling statistics dentro dessa janela com duração de duas horas e aplicamos a média.

Muito Interessante!

A linha azul referia-se à coluna Distância que já estava plotada, agora a nova linha laranja refere-se à média criada na janela!

Em um único gráfico, temos a distância percorrida pelas corridas ao longo do tempo e também a média das distâncias percorridas — Tudo no mesmo gráfico.

Aplicar soma à Janela de Tempo

Usar outra estatística à janela de tempo de duas horas:

df['Distancia'].plot()pd.Series(df['Distancia']).rolling(window = 2).sum().plot()

A soma acompanha muito mais a distribuição dos dados do que a própria média.

Indexação

Por fim, podemos utilizar a indexação.

df.index

Acima temos os índices que são as datas que definimos em set_index anteriormente e podemos aplicar o filtro de indexação por Data:

df['2015-11':'2015-12']

range de datas

Plot de um Intervalo

Aplicamos a indexação por Data, indicamos a coluna Distância e finalizamos com um plot. Tudo apenas com um simples comando:

df['2015-11':'2016-1-1']['Distancia'].plot()

Localização de Índice

Podemos fazer a localização de um índice utilizando uma Data específica:

df.loc['2015-8-12']

Distancia 1.42

Tempo 0:10:35

Segundos 635

Minutos 10.5833

Min_Por_Km 7.45305Name: 2015-08-12 00:00:00, dtype: object

Salvar em CSV

Finalmente, quando nosso trabalho for concluído ou der o nosso ponto, podemos salvar nosso DataFrame em um novo arquivo.

df.to_csv('data/dataframe_saved_v2.csv')

Resetar o Índice

Caso seja necessário, podemos resetar o index, veja que a coluna Data volta a ser uma coluna de dados comun e o índice volta a ser uma numeração:

df.reset_index()

Durante a análise exploratória mexemos nos dados, coletamos insights ou a informação que buscamos, depois retornamos os dados ao formato original se desejarmos. O Cientista de Dados é quem decide o que é melhor para o seu processo.

A Slap in the face of pandas

We will cover operations with the majestic Pandas pack. In this first part of the operations roundup, we’ll look at one-dimensional Series operations and finish our studies deepening into DataFrames.

Study notebook - 1

Visit Jupyter Notebook to see the concepts that we will cover about Pandas Indexing. Note: Important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Import libraries

The first step is to import Pandas so we can use the methods and attributes and the NumPy package to create the Arrays:

import sys

import pandas as pd

import numpy as np

print(sys.version)

pd.__version__

3.7.3 # python version

1.2.0 # pandas version

Install a specific library

If at any time it's necessary to install a specific version of Pandas or any other package, use:

!pip uninstall pandas

!pip install pandas == 1.2.0

Operations with Series

Let’s start by creating a series with the NumPy arange function, asking to return 26 elements in this object — array, and then pass that array as a parameter of the Pandas Series function, resulting in a one-dimensional object.

serie1 = pd.Series(np.arange(26)); serie1

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

dtype: int64

Above, the first column refers to the index and the second column is the data itself.

Object contents

When calling the object index, it will show that we have a range of starting at 0, ending at 26 (exclusive), and jumping from 1 to 1—shows how we created the index for this One-Dimensional Series object.

serie1.index

RangeIndex(start=0, stop=26, step=1)

Series are usually employed when working with time series — the most common application.

Import string package

This package offers two attributes: ascii_lowercase the attribute ascii_uppercase. One feature for lowercase letters and the other for uppercase letters. These two attributes allow us to return all letters of the alphabet.

import string

lcase = string.ascii_lowercase

ucase = string.ascii_uppercase

print(lcase)

abcdefghijklmnopqrstuvwxyz

print(ucase)

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Convert objects

To make our job a little easier, let’s convert these two objects into lists through the list function applied to each object:

lcase = list(lcase)

ucase = list(ucase)

print(lcase)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

print(ucase)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

Determine index

Now let’s indicate that the lcase object will be the index of the series that we created there at the beginning. That is, the lowercase letters of the alphabet will be our index:

serie1.index = lcase; serie1

a 0

b 1

c 2

d 3

e 4

f 5

g 6

h 7

i 8

j 9

k 10

l 11

m 12

n 13

o 14

p 15

q 16

r 17

s 18

t 19

u 20

v 21

w 22

x 23

y 24

z 25

dtype: int64

Slicing Series

Slice allows us to search for data through indexes that are vowels or consonants, or that is between a range of letters, will depend only on our objective:

serie1['f':'r']

f 5

g 6

h 7

i 8

j 9

k 10

l 11

m 12

n 13

o 14

p 15

q 16

r 17

dtype: int64

Therefore, we can change the type of index to what we want. For example, we can use words, date, whatever, and then do the data filter by this indexing or specific item search:

serie1['f']

Applying functions in Series

Here we will call the matplotlib and the command %matplotlib inline to create the graphics within the Notebook Jupyter itself:

import matplotlib.pyplot as plt

%matplotlib inline

Define seed

Use seed to reproduce the same results:

np.random.seed(784)

Create set

Create a random dataset of 10 elements, a NumPy array:

array1 = np.random.randint(1, 30, 10);

array1

array([22, 10, 5, 7, 24, 22, 10, 21, 11, 16, 23])

Create a Pandas Series

Use the array to create a Series in Pandas. But if we already have a list of data in NumPy, why make a list with Pandas?

Pandas offer far more slicing, organization, and manipulation options than NumPy — for that reason converting the array into a Pandas Series.

data = pd.Series(array1); data

0 22

1 10

2 5

3 7

4 24

5 22

6 10

7 21

8 11

9 16

10 23

dtype: int64

Pandas Series

Pandas make our lives much easier by manipulating some objects, such as the series. We can also define the data type with the dtype parameter into pd.Series:

data = pd.Series(array1, dtype = np.float16);

data

0 22.0

1 10.0

2 5.0

3 7.0

4 24.0

5 22.0

6 10.0

7 21.0

8 11.0

9 16.0

10 23.0

dtype: float16

Calculate Statistics

Remembering that the index does not enter into the calculation of statistics, it serves only as an address:

data.mean() # average

14.52

data.median() # central value

14.5

data.mode() # frequent values

0 24.0

data.unique() # unique values

array([22., 10., 5., 7., 24., 21., 11., 16., 23., 13., 17., 6., 1., 15., 25., 2., 3., 14., 29., 12., 8., 19., 28.])

Value count

Count of each singular value of the set:

dados.value_counts()

24.0 4

10.0 3

5.0 3

6.0 2

15.0 2

16.0 2

17.0 2

1.0 2

22.0 2

25.0 2

3.0 2

14.0 2

29.0 2

2.0 1

19.0 1

13.0 1

23.0 1

11.0 1

21.0 1

7.0 1

12.0 1

8.0 1

28.0 1

dtype: int64

Full summary

Statistical summary of the collection:

dados.describe()

count 40.000000

mean 14.523438

std 8.515625

min 1.000000

25% 6.750000

50% 14.500000

75% 22.250000

max 29.000000

dtype: float64

Pandas Histogram

We can also create a histogram from the dataset, that is, the series in Python Pandas, offers us several possibilities for processing one-dimensional data:

dados.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x114fdf780>

Study notebook — 2

Let’s continue studying the one-dimensional Series in Pandas, which form the basis of Real-Time Data Analysis. In this other Jupyter Notebook, we will cover the missing values.

First, let’s see a subtle difference between Series and DataFrames, but that makes all the difference — this can be decisive when working with these types of objects in the data analysis process.

Create NumPy array

We will create an array of numeric values in NumPy, determining the elements and mainly including missing NaN values:

array1 = np.array([1, 2, 3, np.nan]);

array1

array([1., 2., 3., nan])

However, when performing mathematical operations with this array1, we have to look at the missing values!

array1.mean()

nan

NumPy tried to do the computation by applying the average to the set but came across the NaN element that can’t handle. Hence, the final average returns as NaN — it’s the alternative chosen by NumPy developers, signaling that there’s at least “something wrong.”

Create Pandas array

Now let’s create an array in Pandas — Series. This Series will contain the same list of elements that we used to create the array in NumPy:

serie2 = pd.Series([1, 2, 3, np.nan]); serie2

0 1.0

1 2.0

2 3.0

3 NaN

dtype: float64

As we can see above, we have the indexes and their respective elements specified in the function.

Average

We will try to calculate the average of this object which, in NumPy, returned NaN:

serie2.mean()

2.0

In Pandas, were summed up all available numeric elements — returning the average calculation this time.

Concatenating the Series

Use the seed so that we can reproduce the same examples:

np.random.seed(567)]

Creating series 1

Calling the rand function of the NumPy random module — will generate a list of values and that will be passed as a parameter of the Pandas Series function:

serie3 = pd.Series(np.random.rand(5)); serie3

0 0.304782

1 0.953033

2 0.964709

3 0.343762

4 0.993886

dtype: float64

Creating series 2

We will use the same procedure to create another Series:

serie4 = pd.Series(np.random.rand(5)); serie4

0 0.302074

1 0.876231

2 0.705643

3 0.681150

4 0.548266

dtype: float64

Concatenating Two Series

Now let’s concatenate two series, moving to concat function, as a parameter, a list of series:

combo = pd.concat([serie3, serie4]); combo

0 0.304782

1 0.953033

2 0.964709

3 0.343762

4 0.993886

0 0.302074

1 0.876231

2 0.705643

3 0.681150

4 0.548266

dtype: float64

Pay attention to indexing when concatenating the two series in the combo object without adjusting any parameters. That is, the indexes should represent a unique address for each element of the dataset.

At this point, if we want to look for something in index 0, we will return two values.

combo[0]

0 0.304782

0 0.302074

dtype: float64

Count elements on the object

We can solve this by using the count() function in the combo object, where it will count how many elements there are in the combo object created, a listing.

This count will be passed as a parameter to the range function and will finally be written as the index of the combo object:

combo.index = range(combo.count()); combo

0 0.304782

1 0.953033

2 0.964709

3 0.343762

4 0.993886

5 0.302074

6 0.876231

7 0.705643

8 0.681150

9 0.548266

dtype: float64

Reindex

Another exciting option is to use the reindex function directly on the combo object. Only specific elements of the dataset will be reindexed in the reindex function.

We can see that we are reindexing indexes 0, 2, 16, and 21. However, in the combo object, there are no indexes 16 and 21 — nan will return. It is because only index 0 and 2 are on the combo object.

combo.reindex([0, 2, 16, 21])

0 0.304782

2 0.964709

16 NaN

21 NaN

dtype: float64

Fill values

If you do not want the NaN value because it causes problems, we can use the attribute fill_value = 0 to fill the NaN values with 0.

combo.reindex([0, 2, 16, 21], fill_value = 0)

0 0.304782

2 0.964709

16 0.000000

21 0.000000

dtype: float64

Other methods for NaN

Let’s create combo again, using reindex, and we will apply new methods to missing values:

new_combo = combo.reindex([0, 2, 16, 21]); new_combo

0 0.304782

2 0.964709

16 NaN

21 NaN

dtype: float64

ffil

This ffill method means forward fill —fill the next ones. So when we run the fill, we can see that we took the last valid element that was filled and replicated to the following.

new_combo.ffill()

0 0.304782

2 0.964709

16 0.964709

21 0.964709

dtype: float64

bfill

This bfill method means backward fill —fill the previous ones. So when we run the bfill, we can see that we took the first valid element that was filled and replicated to the previous ones.

new_combo.bfill()

0 0.304782

2 0.964709

16 NaN

21 NaN

dtype: float64

fillna

The fillna will fill the indexes with the value specified in the function:

new_combo.fillna(12)

0 0.304782

2 0.964709

16 12.000000

21 12.000000

dtype: float64

Operations with Series — Different Indexes

Another advantage of the Panda Series is that we can do operations directly with the Series.

Creating Series

Here below, we will create two Series s1 and s2:

s1 = pd.Series(np.random.randn(5)); s1

0 0.724774

1 0.124588

2 1.023399

3 2.221557

4 1.130909

dtype: float64

s2 = pd.Series(np.random.randn(5)); s2

0 -0.651538

1 -1.179986

2 0.437101

3 -0.607102

4 1.310954

dtype: float6

As we already know, Series are one-dimensional datasets, we have an index on the left side, and we have the data in the right-hand column.

Summing Series

The first element of s1 is added to the first element of s2, so on. Therefore, the indexes of each series must match:

s1 + s2

0 0.073236

1 -1.055398

2 1.460500

3 1.614455

4 2.441862

dtype: float64

What happens when the indexes do not match?

Change index

Let’s change the index of the s2 series. First, we create a range of values 3 through 8, convert it to a list, and place it as the index of s2:

s2.index = list(range(3,8)); s2

3 -0.651538

4 -1.179986

5 0.437101

6 -0.607102

7 1.310954

dtype: float64

Note that now, the s2 (3–7) indices are different from the s1(0–4) indexes

When we try to add up again, we have NaN values. During the data manipulation process, a data scientist can generate NaN values.

s1 + s2

0 NaN

1 NaN

2 NaN

3 1.570019

4 -0.049077

5 NaN

6 NaN

7 NaN

dtype: float64

Fill in values

When a relationship between the indexes is not found, Pandas fills in with the NaN value. To solve the nan generation problem, we can fill in with specific values.

Below, we will do the reindexing of s1 and the reindexing of s2 and fill nan values with 0. It is a possible alternative to solving the problem

s1.reindex(range(10), fill_value = 0) + s2.reindex(range(10), fill_value = 0)

0 0.724774

1 0.124588

2 1.023399

3 1.570019

4 -0.049077

5 0.437101

6 -0.607102

7 1.310954

8 0.000000

9 0.000000

dtype: float64

Creating Series

We are creating another series of range from 1 to 4 and indexes a, a, and c:

s1 = pd.Series(range(1, 4), index = ['a','a','c']); s1

a 1

a 2

c 3

dtype: int64

Creating s2 as the same set, changing only the indexes:

s2 = pd.Series(range(1,4),index = ['a','a','b']); s2

a 1

a 2

b 3

dtype: int64

Multiplying Series

The rule is the same. If a match is not found between the indexes, the NaN value is filled in.

s1 * s2

a 1.0

a 2.0

a 4.0

b NaN

c NaN

dtype: float64

Copying Series

Let’s make a copy of our series using the copy method and copy s1 into the s1_copy:

s1_copy = s1.copy()

Then, we can directly change the copy object:

s1_copy['a'] = 3; s1_copy

a 3

c 3

dtype: int64

The original object remains intact:

a 1

a 2

c 3

dtype: int64

Pandas map() function

Here we will use the map function of the Pandas package that will be applied directly to object s1, the Pandas series we created.

s1.map(lambda x: x ** 2) # raise each value squared

a 1

a 4

c 9

dtype: int64

s1.map({1:2,2:3,3:12}) # replace values

a 2

a 3

c 12

dtype: int64

s1.map({2:3,3:12}) # replace values

a NaN

a 3.0

c 12.0

dtype: float64

Study Notebook — 3

Visit the Jupyter Notebook to see the concepts we will cover about Pandas Indexing, where Pandas can show all their value. Note: Important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Indexing offers several possibilities for us to organize, manipulate, modify the data according to the project's needs.

We will work with DataFrames, tables, and spreadsheets, which we can bring to the Python language, put in an object of type DataFrame, and then manipulate in various ways.

Import packages

import sys

import string

import pandas as pd

import numpy as np

print(sys.version)

Install library-specific version

!pip uninstall pandas

!pip install pandas == 1.2.0

pd.__version__

'1.2.0'

Create dictionary

A dictionary is a set of key: value pairs. We have the legend before the colon, in this case, a series of strings, and after the colon, a list of numbers:

dict1 = {

'Coluna1': [190, 231, 784, 127],

'Coluna2': [545, 278, 104, 347],

'Coluna3': [665, 224, 901, 503],

'Coluna4': [123, 456, 789, 763]}

dict1

Adicionar texto alternativo

Convert to DataFrame

We will pass the dict1 dictionary as a parameter to create a DataFrame with Pandas:

Pandas already understand what a dictionary is by converting the dictionary into a table. Everything key became a column. Everything worth filled the columns.

df = pd.DataFrame(dict1); df

Data types

Fetch the data types we have by calling the dtypes attribute:

df.dtypes

Coluna1 int64

Coluna2 int64

Coluna3 int64

Coluna4 int64

dtype: object

Return Columns

Use column headings as the index to return column data:

df['Coluna2']

0 545

1 278

2 104

3 347

Name: Coluna2, dtype: int64

Combine column and row index

Return a specific element within a column:

df['Coluna2'][0]

545

Location by index — iloc

Data search by numeric index, in this case, position 0 — first row and all its corresponding columns:

df.iloc[0]

Coluna1 190

Coluna2 545

Coluna3 665

Coluna4 123

Name: 0, dtype: int64

Slicing

Return everything until index 2 — unique.

df[:2]

Adicionar texto alternativo

Index structure

Shows the beginning, end, and compass of the index:

df.index

RangeIndex(start=0, stop=4, step=1)

Insert a column to DataFrame

We can insert one more column into this DataFrame. We call the insert method, at position 0, the id name column, with values 1 through 4.

df.insert(0, 'ID', [1, 2, 3, 4]); df

Set up index

We call df — our DataFrame and indicate that ID will be our index, using the set_index:

df2 = pd.DataFrame(df.set_index('ID')); df2

Manipulating and Organizing DataFrames

Let’s create a list of values and pass them on to the columns object.

columns = list(range(26))

Comprehension List

Translating: (for) each element x (in) list of uppercase (string.ascii_uppercase), print element x:

ucase = [x for x in string.ascii_uppercase]

Translating: (for) each element x (in) list of lowercase (string.ascii_lowercase), print element x:

lcase = [x for x in string.ascii_lowercase]

Create DataFrame

Let’s create a DataFrame and pass a list of objects:

alphabet = pd.DataFrame([lcase, ucase, columns]); alphabet

Transposing the Matrix

View data from another perspective

alphabet = alphabet.T; alphabet.head()

Index

Call the index attribute to view the indexes of this table:

alphabet.index

RangeIndex(start=0, stop=26, step=1)

Columns

View columns with the columns feature:

alphabet.columns

RangeIndex(start=0, stop=3, step=1)

Change the column’s

The title at this instant is like 0, 1, and 2 — let’s specify the strings:

alphabet.columns = ['lowercase', 'uppercase', 'number']; alphabet

Adicionar texto alternativo

Now that we have labels for each column, we can search for data:

alphabet.lowercase

0 a

1 b

2 c

3 d

4 e

5 f

6 g

7 h

8 i

9 j

10 k

11 l

12 m

13 n

14 o

15 p

16 q

17 r

18 s

19 t

20 u

21 v

22 w

23 x

24 y

25 z

Name: lowecase, dtype: object

Slicing notation:

alphabet['lowercase']

0 a

1 b

2 c

3 d

4 e

5 f

6 g

7 h

8 i

9 j

10 k

11 l

12 m

13 n

14 o

15 p

16 q

17 r

18 s

19 t

20 u

21 v

22 w

23 x

24 y

25 z

Name: lowercase, dtype: object

Date as the Index

We can also put as an index a set of dates — we generate the dates with the date_range:

alphabet.index = pd.date_range('7/1/2019', periods = 26); alphabet

Date Range

Once we have the index as data, we can use the dates to select a range through indexing.

alfphabet['2019-07-12':'2019-07-19']

Create DataFrame

Let’s create a DataFrame. But, first, let’s call the DataFrame function of Pandas and pass a dictionary as a parameter:

letters = pd.DataFrame({'lowercase':lcase, 'uppercase':ucase})

letras.head()

Add a column to DataFrame.

Add column using another notation. For example, let’s generate numbers with randint, values from 1 to 5–26 elements:

np.random.seed(25)

letters['numbers'] = np.random.randint(1, 50, 26);

letters

Check Data Types

letters.dtypes

lowercase object

uppercase object

Numeros int64

dtype: object

Turn index into letters

letters.index = lcase;

letters

Increasing Order DataFrame

We can organize the DataFrame by one of the columns — Arranging DataFrame in ascending order with the sort_values:

letters.sort_values('numbers')

Sort by index

In this case, the Dataframe will be ordered by the index value:

letters.sort_index()

List specific columns

We can list specific columns of this DataFrame, and we want to return only two columns:

lettes[['lowecase','upperscase']].head()

iloc

Finally, we can make an iloc — indicate the range. Although the index is in letter format, internally, the Python language interprets the position of numeric values:

letras.iloc[5:10]

Pandas are an excellent tool. We’ve done so far several data organization and manipulation activities with Series or DataFames Pandas or NumPy Arrays — always focusing on an object. If we want to focus on more concurrent objects, we should consider some SQL techniques — it will stay for another time.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

Leonardo Anello

Pandas dataframe filters

Pandas is a popular library of Data Analysis and Manipulation for Python. The central data structure of Pandas is Series and DataFrames, which stores tabular-shaped data with labeled rows and columns.

A standard operation in data analysis is to filter values based on one condition or multiple conditions. Pandas provide a variety of ways to filter data points (i.e., rows). In this article, we'll cover eight different ways to filter a DataFrame.

Filters

Logical operators
Multiple logical operators
isin()
String accessories
(~) does not belong
Queries
nlargest & nsmallest
loc & iloc

Notebook

See the Notebook for the concepts we’ll cover on building machine learning models and LinkedIn for other Data Science and Machine Learning tutorials.

Importing libraries

import pandas as pd

import numpy as np

Creating an aleatory DataFrame for Analysis

df = pd.DataFrame({

'name':['Jane','John','Ashley','Mike','Emily','Jack','Catlin'],

'ctg':['A','A','C','B','B','C','B'],

'val':np.random.random(7).round(2),

'val2':np.random.randint(1, 10, size=7)})

Using logical operators

We can use logical operators in column values to filter rows. For example, we select rows where the value in the "val" column is greater than 0.5:

df[df.val > 0.5]

Logical operators also work on strings

Only the names that come after 'Jane' in alphabetical order are returned:

df[df.name > 'Jane']

Multiple logical operators

Pandas allow combining multiple logical operators. For example, we can apply conditions in the val and val2 columns as below:

df[df.val > 05.) & df.val2 == 1)]

isin

The isin method is another way to apply multiple conditions for filtering. For example, we can filter the names that exist in a given list.

names = ['John', 'Catlin', 'Mike']

df[df.name.isnin(names)]

Str Accessory

Pandas is a highly efficient library for textual data as well. The functions and methods under the str attachment provide flexible ways to filter strings-based lines. For example, we can select the names that start with the letter "J”:

df[df.name.str.startswith('J')]

The function contains under the str attachment returns the values that have a specific set of characters. We can pass a more extended set of characters to the function, depending on the strings in the data:

df[df.name.str.contains('y')]

Tilde (~)

The tilde operator is used for "no" logic in filtering. If we add the tilde operator before the filter expression, the lines that do not fit the condition are returned.

df[~df.name.str.startswith('J')]

Query

The query function offers a little more flexibility when writing the conditions for filtering. For example, we can pass the requirements like a rope. For example, the following code returns rows that belong to category B and have a value greater than 0.5 in the val column:

df.query('ctg == "B" and val > 0.5')

Nlarger or nsmallest

We don't have a specific range for filtering in some cases, but we need the highest or smallest values. So, the nlarger and nsmallest functions allow you to select rows with the largest or smallest values in a column, respectively.

df.nlargest(3, 'val')

We specify the number of major or more minor values to select and the column name.

df.nsmallest(2, 'val2')

Loc and iloc

The loc and iloc methods are used to select rows or columns based on index or label.

loc: Select rows or columns using labels
iloc: Select rows or columns using indexes

Thus, we can use them for filtering. However, we can only select a specific part of the DataFrame without specifying a condition.

df.iloc[3:5, :

If the DataFrame has an entire index, the indexes and labels of the rows are the same. Thus, both loc and iloc performed the same thing in the ranks.

df.loc[3:5, :]

Let's better update the DataFrame index to better demonstrate the difference between loc and iloc.

df.index = ['a','b','c','d','e','f','g'] df

We can't pass integers to the loc method now because the labels of the indexes are letters.

Conclusion

We cover eight different ways to filter rows on a DataFrame, and all of them are helpful for specific cases. Pandas is a powerful library for data analysis and manipulation. It provides numerous functions and methods for handling tabular-shaped data. However, as in any other tool, the best way to learn Pandas is through practice.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

Operations with numpy arrays

Here we’ll do some operations with arrays through the NumPy library and take the opportunity to compare NumPy with List Comprehensions and Lambda Functions and even see the difference in performance between the different kinds of operations.

Study notebook — 1

Go to Jupyter Notebook to see the concepts that will be covered about Operations with Arrays in NumPy. Note: Important functions, outputs, and terms are bold to facilitate understanding — at least mine.

View versions

Let’s start by viewing the Python version and the NumPy package version. This is for almost any package — always check the version.

import sys

import numpy as np

print(sys.version)

np.__version__

3.8.1 # Python version

'1.19.5' # NumPy version

Create an array in NumPy

Now we create an array with the arange function of numpy and not range built-in. The detail is subtle.

array1 = np.arange(15); array1

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

Help?

When we have any questions about any object in Python, can we call help using? And the name of the object.

?array1

A complete help of that object is opened — the full description of what the object is specifically, how to use it, parameters that we can use, attributes and methods, other packages capable of generating that same type of object, etc.…

Mathematical Methods directly in arrays

Once we have the array created with NumPy, we can call the mathematical methods or any methods with. + TAB.

Average

array1.mean()

7.0

Sum

array1.sum()

105

Minimum value

array1.min()

Maximum value

array1.max()

Standard-deviation

array1.std()

4.320493798938574

Using List Comprehension in NumPy Arrays

Let’s use the NumPy array1 in the list comprehension the same way we would use any other type of array.

Translating: for (for) each value of x within (in) array1, multiply the value of x by itself:

[x * x for x in array1]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196]

Translating: for (for) each value of x within (in) array1, if (if) the value of x divisible by two results in zero, return the value of x multiplied by itself — that is, the operation only with even numbers.

[x * x for x in array1 if x % 2 == 0]

[0, 4, 16, 36, 64, 100, 144, 196]

Therefore, when working with NumPy, remember that you can also work with List Comprehension.

Using Lambda Function in NumPy Arrays

Here we call Lambda, where it will return the value of x only when the division of x by 2 is == 0, for each element of array1 NumPy — then apply the filter to print the numerical results in the list:

list(filter(lambda x: x % 2 == 0, array1))

[0, 2, 4, 6, 8, 10, 12, 14]

This that we did above can be further simplified:

array1 % 2 == 0

array([ True, False, True, False, True, False, True, False, True, False, True, False, True, False, True])

We’re using Array1 (NumPy), asking for the rest of the division for 2 == 0 — we return exactly True| False; that is, with this simple notation, we were able to do the same with the map function, generating True| False for each value of array1 that met that condition.

♠ Slicing notation — Powerful! ♠

What if we put the above condition as a slicing notation within array1?

array1[array1 % 2 == 0]

array([0, 2, 4, 6, 8, 10, 12, 14])

NumPy opens up a sea of possibilities. We have been able to replace everything that has been done so far with this simple notation of slicing. This will work when we’re working with NumPy — very important for Data Science when working with Python.

Evaluating Performance

We will use the %, timeit operator. It belongs, specifically, to the Jupyter Notebook. It allows you to measure the execution time of a command.

Translating: For (for) each value of x in array1, return if (if) x divisible by 2 results in 0, that is, x for even. And, of course, calculate the time of the operation.

%timeit [x for x in array1 if x % 2 == 0]

6.03 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It took 6.03 microseconds with the list comprehension in Pure Python, and already with the notation of slicing, the time gets to be 3x faster with numpy slicing.

%timeit array1[array1 % 2 == 0]

2.16 µs ± 65.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using Logical Operators

We can also use logical operators for operations with numpy arrays.

Ex.1

Array1 is greater than 8? Then, a check is made for each element of the array:

array1 > 8

array([False, False, False, False, False, False, False, False, False, True, True, True, True, True, True])

Ex.2

This notation can be placed within the index, returns the values that meet the condition — not True| False:

array1[array1 > 8]

array([9, 10, 11, 12, 13, 14])

Ex.3

We can use a logical operator and. We concatenate two logical operations that must be true:

(array1 > 9) & (array1 < 12)

array([False, False, False, False, False, False, False, False, False, False, True, True, False, False, False])

Ex.4

We can use a logical operator or. We concatenate two logical operations that at least one must be true:

(array1 > 13) | (array1 < 12)

array([ True, True, True, True, True, True, True, True, True, True, True, True, False, False, True])

Ex.5

We can also place the logical operator or within the slicing notation:

array1[(array1 > 13) | (array1 < 12)]

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14])

Ex.6

We can create a NumPy array with list comprehension:

array2 = np.array([x ** 3 for x in range(15)]; array2

array([0, 1, 8, 27, 64, 125, 216, 343, 512, 729, 1000,

1331, 1728, 2197, 2744])

Ex.7

Array NumPy with list comprehension, returning true for each value of x in the range:

array2 = np.array([True for x in range(15)]); array2

array([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True])

Ex.8

Use this same array2 above as the index for array1.

array1[array2]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

Study notebook — 2

Let’s continue from here with another Jupyter Notebook to see the concepts covered about concatenation, join, and split arrays in NumPy.

We’re talking about the split-apply-combine technique, where we can split an array, apply a technique, an analysis, or some calculation to parts of this array, and then combine the results — this is very useful in various situations in data manipulation and organization.

Concatenating Arrays

First, we will create array1 with the one's function, which creates an array filled with values 1:

array1 = np.ones(4)

We now create array2, with another function, arange — elements up to 15 positions:

array2 = np.arange(15)

Next, we call the concatenate function, which belongs to NumPy, to concatenate between array1 and array2:

array_conc = np.concatenate((array1, array2)); array_conc

array([1., 1., 1., 1., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14.])

Above, the first part of the array output can quickly identify the set of 1 generated by array1 in np. ones and the other part of array2, from 0 to 14.

Joining Arrays

Let’s create two arrays, a and b:

a = np.ones((3,3))

b = np.zeros((3,3))

print(a)

[[1. 1. 1.]

[1. 1. 1.]

[1. 1. 1.]]

print(b)

[[0. 0. 0.]

[0. 0. 0.]

[0. 0. 0.]]

vstack & hstack

Let’s apply the vstack and hstack functions. If we don’t know anything about this function or anything else, we call help!

# stack arrays in vertical sequence

?np.vstack

# stack of arrays in horizontal sequence

?np.hstack

vstack function

Fill snares towards the lines:

np.vstack((a,b))

array([[1., 1., 1.],

[1., 1., 1.],

[0., 0., 0.],

[0., 0., 0.]])

hstack function

Fills in the direction of the columns:

np.hstack((a,b))

array([[1., 1., 1., 0., 0., 0.],

[1., 1., 1., 0., 0., 0.],

[1., 1., 1., 0., 0., 0.]])

Creating more arrays

We’ll create more arrays to run other stack examples:

a = np.array([0, 1, 2])

b = np.array([3, 4, 5])

c = np.array([6, 7, 8])

column_stack function

This function stacks — stacks one-dimensional arrays in the direction of columns — vertically forming a two-dimensional array:

np.column_stack((a, b, c))

array([[0, 3, 6],

[1, 4, 7],

[2, 5, 8]])

Split Arrays

Below we use the arange function to create an array of 16 elements and apply the reshape to make the one-dimensional array two-dimensional:

array3 = np.arange(16).reshape((4,4)); array3

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11],

[12, 13, 14, 15]])

hsplit function

This function splits horizontally, from array3, passing parameter 2 — we divide array3 into two parts [p1 and p2].

[array3_p1, array3_p2] = np.hsplit(array3, 2)

array3_p1

array([[ 0, 1],

[ 4, 5],

[ 8, 9],

[12, 13]])

array3_p2

array([[ 2, 3],

[ 6, 7],

[10, 11],

[14, 15]])

vsplit function

This function does a vertical-level split from array3, passing parameter 2 — we divide array3 into two parts [p1 and p2].

[array3_p1, array3_p2] = np.vsplit(array3, 2)

array3_p1

array([[0, 1, 2, 3],

[4, 5, 6, 7]])

array3_p2

array([[8, 9, 10, 11],

[12, 13, 14, 15]])

Record and Load data with NumPy

We create the data object from array3:

data = array3

.save

use the save function — save, it is useful for saving the array in the operating system at the end of the process. You don’t have to redo the process next time:

np.save('data_saved_v1', data)

.load

use the load function — load:

loaded_data = np.load('data_saved_v1.npy'); loaded_data

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11],

[12, 13, 14, 15]])

This material is beneficial and comfortably serves as a necessary to manipulate the NumPy library. And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

Leonardo Anello

the Power of Dynamic Tables in Excel

If you already work with data, you’ve probably faced a series of challenges in analyzing it efficiently.

The amount of information can be enormous, and extracting useful insights can be difficult. That’s where Pivot Table in Excel comes in as a powerful tool to help you analyze large sets of data quickly and efficiently.

A pivot table in Excel allows you to turn raw data into valuable and easily accessible information. With it, you can group, summarize, compare, and visualize information in various ways, allowing you to make decisions based on accurate insights.

In this article, we will discuss the importance of pivot tables in Excel and how they can be used to improve your data analysis.

We will also explain how to create a pivot table and how to use it to achieve the desired results.

Why is the pivot table in Excel important?

The pivot table in Excel is important because it allows you to analyze large amounts of data quickly and efficiently.

With it, you can group information in various ways, which allows for a more detailed and accurate analysis. In addition, pivot tables in Excel are highly interactive, which means you can explore your data from different perspectives and gain valuable insights.

Another advantage of the pivot table in Excel is the ability to create charts and dashboards. With these, you can visualize information in a clear and intuitive way, making it easier to make decisions.

How to create a pivot table in Excel?

Creating a pivot table in Excel is simple and can be done in a few easy steps. Before you begin, it’s important to have in mind the goal of your data analysis. This will help you choose the information you want to include in your pivot table and how you want to organize it.

Step 1: Preparing your Data

Before creating a pivot table, we need to have our data organized and ready for use. To do this, it is important that the data is in a separate spreadsheet and that each column has a descriptive title. In addition, it is important that there are no blank rows or empty cells in the middle of the data.

Step 2: Selecting your Data

After having our data organized, we need to select it to create the pivot table. To do this, simply click on a cell within the data and then press Ctrl + Shift + Arrow to select all the data or, even better, use the shortcut Ctrl + T to select all the data.

Next, you can use the keyboard shortcuts Alt + T + V + T and select the data range to turn it into a table.

Valuable Tip

Before inserting a pivot table, it is important to format the data as a table to ensure that any updates in the data are automatically updated in the pivot table.

To do this, simply select all the data (Ctrl + T), then press Alt + T + T + A.

After formatting as a table, it is recommended to assign a name in the Table Design tab for better organization.

Step 3: Setting up the Pivot Table

In the window that appears, click “OK”. This will open the “Pivot Table” window with several options to configure your table. Here, you can choose the type of data you want to group, the columns that will be used as rows or columns, and how the data will be presented.

Step 4: Adding Fields to the Report

After configuring your pivot table, it’s time to add fields to your report. To do this, simply drag the desired fields from the “Fields” area to the “Rows”, “Columns”, or “Values” areas.

Step 5: Adjusting the Layout

Here, you can choose how the data is presented and grouped, allowing you to perform a more detailed analysis of the data.

A pivot table allows you to analyze data to determine the best salesperson, the most sold product by each salesperson, the type of payment each salesperson most uses, and even compare performance in specific months.

After adding the fields to the report, you can adjust the layout of your pivot table. This includes adjusting colors, formats, subtotals and other formatting options you desire.

You can also add filters with data segmentation through Alt + T + EN to only display relevant data for your analysis.

With these simple steps, you can create a dynamic table in Excel for a more efficient data analysis. Additionally, with the customization and formatting options, you can make your dynamic table even more attractive and easy to understand.

Learning how to create a dynamic table in Excel is an important skill for anyone who works with data analysis and needs a more efficient way to visualize and understand their data.

Thank you for taking your time to read it.

Leonardo Anello

in/anello92

Page updated

Google Sites

Report abuse