Data Preprocessing

Data Collection

Data Source Description

One of the most well-known places to find Formula 1 race results from the past is the Ergast F1 API. Thanks to its extensive data set and RESTful interface, researchers and developers can access Formula 1 statistics and records from a number of seasons. Any data-driven initiative pertaining to Formula 1 would benefit greatly from the data, which usually contains comprehensive information on races, lap times, driver and constructor standings, and more.

Data source Link (Ergast F1 API) : http://ergast.com/mrd/

Example List of API's

http://ergast.com/api/f1/2023/5/results ( To extract race results information)
http://ergast.com/api/f1/2023/drivers ( To extract driver information)
http://ergast.com/api/f1/2023/constructors ( To extract constructor information)
http://ergast.com/api/f1/2023/5/pitstops ( To extract pitstops information)
http://ergast.com/api/f1/2023/circuits ( To extract circuit information)
http://ergast.com/api/f1/2023/status ( To extract race finish status information)
http://ergast.com/api/f1/2023/5/qualifying ( To extract qualifying result information)

Dataset Description

F1_race_results dataset

The dataset contains comprehensive results from Formula 1 races, including a wealth of information regarding the outcomes of each race. In addition to the driver's ID and code, there are columns for the race's outcome, position, points, grid position, laps completed, and status at the finish. There's also a URL for more information. For a more in-depth examination of results across years and races, extra data points are available, covering details such the race's name, date, and season. Even though the fastest lap details don't seem to be included in the first rows looked at, the dataset tries to include them nevertheless. Formula One racing is a high-stakes sport, and this extensive dataset provides a solid basis for studying trends, performances, and the development of teams and drivers across the seasons.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/results.json?limit=1000

Here, the season it taken from year 2000 to 2023

F1_drivers dataset

From 2000 to 2023, the dataset covers every Formula 1 driver who competed in the championships. Important information such as driver's license number, full name (including middle name and any additional surnames), birthdate, country of origin, and season(s) of participation are included. It also includes each driver's Wikipedia page, a code they use in the sport, and their permanent racing number (if available) for more information. Driver careers, performance comparisons between periods, and the demographic makeup of Formula 1 drivers throughout the given period can all be better understood with this dataset.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/drivers.json?limit=1000

F1_constructors dataset

This dataset contains detailed information on Formula 1 car manufacturers that competed in the championships between 2000 and 2023. With metadata like namespace, series designation, and a URL to the Ergast API for comprehensive constructor data, each entry belongs to a specific season. In the "MRData.ConstructorTable.Constructors" column, you'll find a JSON-like string, and the dataset also details the API query limit and offset as well as the overall count of constructors for a specific season. The information on each team in this string may include constructor IDs, links to more in-depth Wikipedia pages, and other pertinent details. For anyone interested in learning about the development of Formula 1 teams, their influence on the sport, and the competitive dynamics of the sport during the past twenty years, this is the book to have

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/constructors.json?limit=1000

F1_driver_standings dataset

Providing a structured summary of race outcomes for the years 2000–2023, the dataset details the driver rankings for Formula 1 seasons. The metadata for each season is included in each row, along with the namespace, series identity, and a URL to the Ergast API. In addition, the "MRData.StandingsTable.StandingsLists" column contains the actual standings data encased in a JSON-like format, and the limit, offset, and total number of entries are specified. This column seems to have comprehensive information regarding the driver rankings for each season, including driver IDs, points, placements, and maybe even their teams. Throughout Formula 1's twenty-plus year history, this dataset has been an invaluable tool for studying patterns in driver performances, team supremacy, and the development of competitiveness.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/driverStandings.json?limit=1000

F1_constructor_standings dataset

The dataset encapsulates the competitive scene of the teams throughout two decades, chronicling the rankings of Formula 1 constructors for each season from 2000 to 2023. With metadata like the namespace of the Ergast API, the series designation, and a URL to view full standings through the API, each item in the dataset signifies a season. For every season, the "MRData.StandingsTable.StandingsLists" column in the dataset contains a JSON-like string with comprehensive standings information, the offset, the maximum number of records, and the total number of constructor entries. In the high-stakes world of Formula 1 racing, this comprehensive dataset is essential for studying team performance trends, their dominance or challenges over the years, and the shifting dynamics of the constructors' championship.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/constructorStandings.json?limit=1000

F1_circuits dataset

The dataset contains details about the venues that have staged Formula 1 races from 2000 to 2023, including information about the circuits utilized in the championships throughout that time. With metadata such as namespace and series identification as well as a direct URL to the Ergast API for accessing comprehensive circuit information, each entry in the dataset corresponds to a given season. There is a JSON-like string in the "MRData.CircuitTable.Circuits" column that probably contains circuit IDs, URLs for more information, and maybe other pertinent details about each circuit. The dataset also specifies the limit and offset for query results and the overall number of circuits in a season. Formula One racing has been a fascinating story for over twenty years, and this dataset is a great way to learn about the many different tracks that have played a part in it.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/circuits.json?limit=1000

F1_race_status dataset

The dataset offers a detailed account of the various statuses assigned to drivers and their outcomes in Formula 1 races spanning from 2000 to 2023. Each row in the dataset is tied to a specific season and includes metadata such as the namespace, series information, and a link to the Ergast API for accessing detailed status data. The "MRData.StatusTable.Status" column contains a JSON-like string with data including status IDs, the count of occurrences, and the status descriptions themselves, which likely range from 'Finished', 'Retired', to various technical issues that led to a driver's race ending. This dataset is crucial for analyzing the reliability of cars, the frequency of certain types of race incidents, and how these factors have evolved over the years in the dynamic environment of Formula 1 racing.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/status.json?limit=1000

F1_pitstops dataset

Although the first entries begin with the 2011 season, the dataset gives a comprehensive record of pit stops done during Formula 1 races from 2000 to 2023. Driver identification, lap number, precise time of pit stop, amount of time spent in the pits (in seconds), and number of stops for each driver are all part of the data. The dataset also includes the season, race name, and date of the event. Team pit stop strategy, its impact on race results, and the development of more efficient pit stops over time can all be better understood with this data. In Formula One, where every second counts, this dataset is a gold mine of information about race strategy and the importance of pit stops.

The API endpoint utilized to extract the information for the given dataset is given by:

http://ergast.com/api/f1/{season}/{round}/pitstops.json?limit=1000

Data Cleaning

Making sure the data is accurate, consistent, and useable is the data cleaning process, which is an important part of getting the dataset ready for analysis. The steps followed for Data cleaning is given below

Eliminating Duplicate Values: Eliminating duplicate observations from the dataset guarantees that each observation is distinct and will not bias the results of the analysis due to the extra weight given to repeated data points. As a standard practice, this phase ensures the analysis is accurate by removing rows that are same across all or some columns.
Renaming Columns to Meaningful Names: To enhance reading and comprehension of the dataset, column names are made descriptive and simple, reflecting the substance of the column. This makes the data analysis easier to understand by renaming columns from default or obscure ones to more descriptive ones.
Dropping Irrelevant Columns: Finding and eliminating the columns that are either irrelevant to the analysis or have duplicate data. Because only the information directly related to the analysis objectives is considered, the dataset is reduced in size, which in turn improves processing performance and decreases memory use.
Filling Missing Values with Numerical Placeholders: In order to keep the analysis valid and free of bias, it is essential to handle missing data properly. The researcher takes care to avoid skewing the data distribution by filling missing values with placeholders (such as the mean, median, or a specific value like -1), allowing them to preserve observations that might otherwise be dismissed.
Merging Datasets on Common Attributes: A more complete picture of the data can be obtained by merging several datasets that have common properties into one larger dataset. Step one is to merge datasets that share common keys (such names or IDs) in order to increase the amount of data that can be analyzed.

Raw Data from Ergast F1 API

Cleaned Dataset from Ergast F1 API

Removing Null values from the dataset

Data Visualization

The above sunburst chart displays the nationalities of Formula 1 drivers which is a great representation of the varied routes that drivers take to reach the top of the sport and the widespread interest in it around the world. The presence of robust motorsport cultures and developmental programs, as well as a substantial talent pool, are hallmarks of countries with larger segments. Insights gained from this diversity can help teams, sponsors, and promoters make better talent scouting and marketing decisions, which in turn can help expand Formula 1's fan base around the world.

Several Formula 1 performance metrics are visually summarized in the above correlation heatmap. It brings attention to a moderately negative association between starting grid position and points, indicating that having a front-row start usually leads to more points scored in a race. In contrast, there is scarcely any association between grid position and quickest lap time, suggesting that a driver's starting position is unrelated to obtaining the fastest lap in a race. There appears to be a weak negative association between points and quickest lap time in the data, but it is not statistically significant.

A concise visual representation of the winning record of the top Formula One drivers over three seasons is provided by the above bar plot labeled "Top 5 F1 Drivers with Most Wins (2021-2023)". With a large advantage, Max Verstappen proves he was the best driver in the sport at the time. Lewis Hamilton comes in second, also showing a good amount of wins. There is a clear disparity in the level of competition between the top two drivers and the next three, as Sergio Pérez, Charles Leclerc, and Carlos Sainz all have less victories. In addition to highlighting the achievements of individual drivers, this visualization also provides clues as to the teams' performance capabilities during these seasons.

Formula 1 teams' win totals over the past twenty years are graphically shown in the above bar plot "Formula 1 Glory: Top Team Victories (2000-2023)". It is clear from looking at the standings that Ferrari, Mercedes, and Red Bull were the most successful teams of that era. Their win totals far exceeded those of their rivals, demonstrating their unwavering commitment to quality and supremacy in the sport. Their color gradient is a visual tribute to their dominance and heritage in Formula 1 racing history, highlighting the magnitude of their victories.

"Total Pit Stops vs. Average Pit Stop Duration by Efficiency (2000 to 2023)" is a scatter plot that compares the overall number of pit stops with the average duration of those stops, broken down by efficiency levels. It shows that more pit stops do not always mean longer average durations, which means that teams are still able to be efficient even when they manage more stops. Team tactics and their execution during the past 20 years in Formula 1 can be better understood with the use of color coding for efficiency and the size fluctuation in the plot points, which reflect the total number of pit stops.

Formula One drivers' performance trajectories over the last two decades are graphically shown in the above line chart "Driver Performance Comparison Across Seasons (2000-2023)". It follows the ever-changing careers of different drivers and shows how some, like Michael Schumacher, reach their pinnacle during championship years. Formula 1 is a very competitive sport, and this chart shows not only the longevity and consistency of drivers but also the rise of young talents who accumulate points throughout seasons.

The "Constructor Podium Finishes Over Seasons" heatmap graphically displays the podium finishes of every Formula 1 constructor from 2000 to 2023. Construction companies with a high frequency and intensity of podium results, such as Ferrari, Mercedes, and Red Bull, display their competitive success and consistency with a color intensity that fluctuates with the number of podiums. The heatmap does a great job of illustrating the rise and fall of constructor performances over time, revealing peaks and valleys of competition and supremacy.

A detailed look at each driver's finishing positions throughout races in the given seasons is provided by the above violin plot named "Distribution of Race Finishing Positions by Driver (2021-2023)". The variation and intensity of finish locations are shown by the violins' width and thickness; thinner sections indicate less frequency. Lewis Hamilton and Max Verstappen, for example, consistently finish races higher up the order, as shown by their violins that are broadest in the top spots; other drivers, on the other hand, show a wider distribution, indicating more diverse results. This graphic does a good job of summing up the drivers' performance spectrum, showing how their race results are consistently dominant, inconsistent, and volatile.

"Formula 1 Constructor Points in the 2023 Season" is a polar pie chart that shows how the points are distributed among the Formula 1 teams. Indicative of their success and high points tally relative to others, the chart is dominated by a handful of constructors with noticeably larger portions. As a representation of the season's competitive hierarchy and performance discrepancies, this image makes it easy to see how Red Bull, Mercedes, and Ferrari stand out from the pack.

Formula One's most dominant teams "Mercedes, Red Bull, and Ferrari" had their fortunes rise and fall over the course of several seasons, as shown in the line chart "Top 3 Teams Total Points Over Seasons (2000-2023)". The graph shows when each team was dominant, with high points showing seasons of great performance and troughs showing times of poor performance. In a nutshell, this graphic depicts the Formula 1 racing scene, with the rising and falling points representing the dynamic character of the technical and strategic conflicts that take place during the season.

Page updated

Google Sites

Report abuse