Data Science

manipulating data with r and python

The Python language is one of the main tools of a data scientist. We'll cover the main data manipulation tools in Python here, and that knowledge will be the basis for us to apply Machine Learning in Data Streaming.

We will study some modules for data analysis, data structures, NumPy library, Pandas library, and python data preprocessing techniques.

NumPy is the mathematical library for Python language; it is the basis of almost everything that is done in Machine Learning, Deep Learning, and artificial intelligence. Pandas are Excel for programming; much of what is done in Microsoft Excel can be done in Pandas via programming.

Do we need to learn a language?

Do we need to stick to a programming language to work as data scientists?

Answer: We need to learn how to use an analytical tool, and Python is one of today's top tools.

What is Data Science?

Data Science is the junction of different areas of knowledge. Working as a Data Scientist without knowing at least one analytical tool makes no sense. There are some analytical tools on the market, but programming languages offer several advantages. Although we can do the work in other programming languages, such as Julia, Java, Scala, C++, Go, JavasScript, etc... R and Python languages are by far the most used on the market.

Which language to choose?

Answer: do we need to choose? Neither language is perfect. Each offers advantages and disadvantages; nothing prevents us from working with R in stages of statistical analysis and Python, as a complete programming language, throughout the remaining analysis process before creating predictive models. I want to learn them.

Use programming as an analytical tool?

R and Python are free languages and can be used freely, reducing software licensing costs.
Although R and Python require more manual work to create an analysis solution, they offer greater flexibility.
We can easily apply all knowledge acquired with R and Python in other analytical solutions.
Microsoft, Oracle, and IBM solutions support R and Python as a way to extend the functionality of your solutions.
Both languages have a large, active community and much robust documentation.

Why Python?

The node we have two significant problems to solve these days in terms of data analysis:

The first problem is the fact that there is more and more data to analyze.
The second problem is that we have less and less time. So, we need real-time analytics solutions.

The growth in data volume and the high speed it generates make these problems a significant challenge. The good news is that the Python language can help us at every step of the data analysis process, precisely why we can use Python for data analysis.

Python is an easy-to-learn language, very popular worldwide, commonly used in applications, and data analysis is an interpreted language. It has an active community with support, documentation, and speedy language development.

Python competitors

The main competitors in Python would be the R language and the proprietary platforms SAS, Matlab, Stata. However, depending on the type of application, we can also use other languages and Python, such as C, C++, and Fortran, which generally offer much higher performance than other languages. Thus, Python can develop high-performance applications.

Many companies aim to adopt more than one language in their Analytics projects. Typically, what is seen in the market is the adoption of Matlab or R for specific project steps in statistical analysis or construction of predictive models, and Java, C++, or C# languages to build applications that will run in a large-scale production environment.

Increasingly, the Python language has evolved to become a language with all the characteristics that allow it to be used throughout the process, that is, to use only one language from data collection to making your product or service available at the other end.

When not to use Python?

The Python language also offers disadvantages. Because it is an interpreted language, Python is a little slower than languages like JAVA and C++, for example. Remember that our goal is not analytics itself; the goal is to solve business problems —always keep that in mind. The resolution of the issues is made through a product or service, which will result from the analysis process. This product or service requires CPU consumption, memory, disk space, which should also be weighted.

Python is not ideal for competitive applications or multithreading applications, and this can be a problem when we work with Big Data Analytics.

How to work with Python?

To work with the Python language, we have two main options:

IDE (Integrated Development Environment): PyCharm, Spyder, Canopy, WinPython If you build large python systems, PyCharm is the ideal option.
Programming via browser: Jupyter Notebook and Jupyter Lab The main advantage of programming via browser is that we do not need to install anything; install the Python interpreter

Python modules for Data Analysis

When we talk about Python language, we're talking about two different environments. The first environment is a pure language. When we go to the site Python.org, download the interpreter and insert it into the machine. At this point, we have the default language with its commands and built-in functions.

If we work only with the standard language, we will be limited to what the language offers fundamentally, which is already a lot.

Thinking about these limitations, several developers work voluntarily around the world, developing Python packages. They take the pure Python language, develop Python-language software, and put it in a box.

Anaconda?

When we installed Anaconda Python, it installs hundreds of packages simultaneously. If we only worked with pure Python, that is, using only the default interpreter, we would have to install each of the packages each time we went to work on one of these features.

What is PyData Stack?

PyData Stack is a set of Python packages specific to working with Data Science.

See that we are based, the NumPy package for numerical computing. NumPy is one of the most impressive packages in the Python language. It's a mathematical summary; it offers a series of ready-made mathematical functions that, if NumPy didn't exist, we should develop all of this from scratch. In addition, NumPy offers all mathematical tasks with high performance.

In the layer just above NumPy, we have Pandas, a kind of Excel for Python language next to Matplotlib and Bokeh, two data visualization packages. During the analysis process, it is essential to build graphs. Then, going up a bit, we have the scikit-image packages for computer vision, the statsmodel for statistics, the scikit-learn for machine learning.

NumPy - Numerical Python

It is a basic package for mathematical computing in Python. We will hardly use NumPy alone but rather with other Python packages, as it provides the fundamental basis for building data for analysis. The primary purpose of NumPy is to serve as a container for the data so that we can use it in the analysis process and still manipulate the data between different algorithms.

Arrays, that is, NumPy lists, are much more efficient than the basic options of the Python language. If we want to create a list of elements in Python, we can do this with pure Python, but if we do the same thing with NumPy, the performance will be remarkably better. Learning how to use NumPy and allowing you to perform other activities in Data Science will also offer more outstanding performance.

Pandas

Pandas is an entirely free, high-performance data analysis package developed in 2008. It has quickly become the standard library for python-language data manipulation and analysis, widely adopted by professionals using Python for Data Science.

With Python Pandas, we can create DataFrames and Data Series. First, we need DataFrames— Tables, which we find in databases or that we have in excel with rows and columns throughout the analysis process. Second, we need to put the data in table format; we do this with DataFrames in Pandas, providing us with a Series of Skills and manipulation characteristics as if using Excel itself, only with programming.

Main advantages of using Pandas

It can process data in different formats, such as time-series data, arrays, structured, or unstructured data.
It greatly facilitates the work of loading and importing data into CSV files or databases.
Provides functions for the most various preprocessing steps, such as subsetting, slicing, filters, merge, grouping, sorting, and reshape.
Allows you to handle missing data easily.
We can use it to convert data as well as apply statistical modeling.
It is fully integrated with other Python packages such as Scipy, NumPy, and Scikit-Learn for Machine Learning.

So, if we merge the NumPy and Pandas packages, we will have a platform for data manipulation and completely free.

Making a quick comparison between the R language and the Python language, see that we have on the left side what is proposed by the R language: vectors, matrices, and DataFrames.

On the right side, we have the objects in NumPy and Pandas, that is, what we call the R-language vector, we call the One-Dimensional Array with NumPy. What we call R Arrays, we call Multidimensional Arrays that can also be created in NumPy and perform these objects with high performance.

So far, we've been able to address some of the primary concepts of Data Science, some tools and modules that are commonly used in a data scientist's routine.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

Abaixo temos algumas anotações acerca de cursos estudos sobre o fantástico universo da ciência de dados, ou melhor, consciência dos dados. Essa é uma área interdisciplinar, que localiza-se em uma interface entre a estatística e a ciência da computação, que utiliza o método científico; processos, algoritmos e sistemas, para extrair conhecimento e tomar decisões a partir de dados dos diversos tipos, sendo eles ruidosos, nebulosos, estruturados ou não-estruturados. Sendo assim uma área voltada para o estudo e a análise organizada de dados científicos e mercadológicos, financeiros, sociais, geográficos, históricos, biológicos, psicológicos, dentre muitos outros.

Below are some notes about courses studying the fantastic universe of data science, or rather, data consciousness. This is an interdisciplinary area, which is located at an interface between statistics and computer science, which uses the scientific method; processes, algorithms and systems, to extract knowledge and make decisions from data of different types, whether noisy, nebulous, structured or unstructured. Thus, it is an area dedicated to the study and organized analysis of scientific and marketing, financial, social, geographic, historical, biological, psychological data, among many others.

Why use nosql?

We’ve seen two of the principal frameworks for working with Big Data, Apache Hadoop for distributed storage and processing and Apache Spark for distributed processing. However, both Apache Hadoop and Apache Spark are its primary purpose of processing, so we use HDFS to store the data and then process that data to deliver a result.

Hadoop HDFS is not a database. The goal is to store, process, and deliver results. However, in other Big Data applications, it may be necessary to store the data to consult, manipulate it in the future, update it frequently, and considering these scenarios, another database category has been developed — NoSQL databases.

NoSQL, non-relational databases have as their primary objective to store data in unstructured format — Big Data consists of about 80% of unstructured data! Suppose we need to create an application where we have to store unstructured data. In that case, we can use one of the many banks in the NoSQL database category, unlike Apache Hadoop HDFS, whose main focus is not databases!

RDBMS — Structured in Rows and Columns

The traditional relational databases belonging to the RDBMS category are Oracle, SQL Server, PostgreSQL, MySQL, DB2, etc. — that primary purpose is to store structured data.

Traditional databases are designed only to handle datasets that can be stored in rows and columns, and therefore can be queried through queries using a structured query language (SQL).

Many of the applications we find in the corporate environment, such as ERP applications, CRM, logistics, employee control, HR, and many others, essentially use relational databases.

Relational banks are designed to work with rows and columns, that is, data in a very well-defined and very well-structured format. Relational Databases are not able to handle unstructured or semi-structured data because they are not designed for this.

Relational doesn’t meet Big Data Requirements

Relational databases lack the functionality required to meet Big Data requirements; high-volume, high-speed data — NoSQL Databases fill this gap. The goal is to choose the right tool to solve the business problem we are working with. There are already more than 200 non-relational databases, which shows the growth of this category.

Several applications running on mobile devices, Machine Learning, Artificial Intelligence, and Analytics applications need to manipulate unstructured data use NoSQL databases. This category of Databases will be a differentiator when building an architecture for storing and processing large data sets.

NoSQL has a flexible and scalable architecture

NoSQL databases are distributed databases, which are designed to meet Big Data requirements. Therefore, NoSQL is a non-relational database class that does not fall under the RDBMS classification, using the SQL language.

NoSQL databases provide a much more scalable and efficient architecture than relational databases and facilitate semi-structured or unstructured no-SQL queries of data.

Big Data is Transforming Applications.

Although the relational model and SQL Language have for decades been the standard for data storage, it is a fact that databases are no longer preferred when it comes to flexibility and scalability.

Traditional Relational Database solutions do not meet this new Big Data world in which we prospect. Surely the company can continue using Relational Banking for its routine daily applications, but if we don’t use Big Data Analytics through a NoSQL Database, it doesn’t make sense!

On the other hand, if the company wants to enter a new world and extract from Big Data the best insights, create predictive models, make predictions, it’s for sure that NoSQL databases can meet these requirements or be part of a Big Data processing infrastructure.

NoSQL database categories

Graph Databases — Neo4J

This category of NoSQL Databases is often adherent to social networking scenarios, where nodes represent entities and links represent the interconnections between them. In this way, it is possible to cross the graph following the relationships. This category has been used to deal with problems related to recommendation systems and access control lists, using its ability to handle highly interconnected data.

Document Databases — MongoDB

This category of NoSQL Databases allows you to store millions of documents. For example, we can store many details about an employee, their resume (such as a document), and potential research candidates for a job, using a specific field, such as phone or knowledge in technology.

Key-values stores — DynamoDB

The data is stored in a key-value format, and its keys identify the values (data). We can store billions of records efficiently, and the writing process is fast. We can search the data using the associated keys.

Column Family Store — HBase and Hypertable

Also called column-oriented databases, data is organized into column groups and both storages.

Hybrid Category — Cassandra

Cassandra is considered a Hybrid distributed database, belonging to more than one category of NoSQL databases, an efficient alternative to storing and querying most non-relational data.

Apache Cassandra is a freely distributed, high-performance, extremely scalable, and fault-tolerant NoSQL database. Cassandra is used to storing vast amounts of data (Big Data) quickly. In addition, it works very well when you need to search for data in an indexed way and is still an excellent solution when you need high performance for reading and writing.

There are similarities between big data technologies. Virtually all solutions are free. We can use freely, modify code, use it in commercial projects without any concern for license — the open-source feature makes these technologies evolve at a breakneck speed with the force of volunteering, unlike proprietary technologies.

This database has a premise that system or hardware failures always occur; with this in mind, the solution is designed to quickly recover from a failure and thus be much more fault-tolerant when compared to proprietary solutions.

This adaptability and efficiency have transformed NoSQL databases into an excellent solution to treat big data and overcome problems related to processing large volumes of data.

Why use NoSQL databases?

Schemaless data representation: With the advancement of Big Data, as the data is unstructured, it becomes challenging to define the schema before storing the data. Ideally, it does not have this database schema constraint. Therefore, we store the data in native format, and later we worry about using it and applying some analysis.

Development Time: We load the data in the native format. We don’t waste time preparing the relational database, designing the schema, relationship between tables, modeling — all common in relational banks.

Speed: Because of these characteristics specific to each category, NoSQL Databases are usually very fast.

Scalability: We were able to add more infrastructure according to the project’s needs quickly.

MongoDB

MongoDB is open-source, and one of the NoSQL database leaders is document-oriented, one of four categories of NSQL databases.

A document-oriented NoSQL database replaces the concept of “row” as in relational databases with a more flexible model, the “document.” It’s essential to make this parallel, understand that it’s another category, another strategy, and another thought in terms of data storage — you need to be open-minded to receive a new way of storing data.

Indexing: MongoDB supports secondary indexes, allowing faster queries to build.

Aggregation: MongoDB enables the construction of complex aggregations of data, optimizing performance.

Special Data Type: Because NoSQL was modeled for Big Data, it has data types generated at high volume, high variability, and high speed. Therefore, MongoDB supports time-to-live collections for data that expires at a specific time, such as sessions. This characteristic is handy when working with some particular data sources.

Data Storage: MongoDB supports the storage of large amounts of data. Some characteristics present in relational databases are not present in MongoDB, such as joins and multi-line transactions.

MongoDB vs. RDBMS

A comparison between MongoDB which is of the NoSQL category, and some database any of the RDBMS category:

It’s a new way to store data that’s entirely about Big Data.

Where to use MongoDB?

We can use MongoDB for Big Data, content management, Social and Mobile infrastructure, user data management, and Enterprise Data Hub. We have a Data Warehouse, Data Lake, NoSQL Databases, and various storage structures according to the significant data project needs.

CouchDB

The NoSQL Database has driven the movement of non-relational databases, CouchDB being entirely web-oriented — one of the primary data sources of Big Data, which makes sense to give a database focused on the internet environment.

In CouchDB, data is stored in JavaScript Object Notation (JSON) documents, consisting of fields consisting of strings, numbers, dates, ordered lists, and associative maps.

It is distributed in pairs with a server and a client, which can have independent copies of the same database. CouchDB enables users to store, reproduce, synchronize, and process large amounts of data (Big Data) distributed across mobile devices, servers, data centers, and distinct geographic regions in any deployment configuration, including cloud environment.

And there we have it. I hope you have found this useful. Thank you for reading. 🐼

The business analytics process

In this article, we will cover what the Business Analytics process is, the flow of activities to be done in a specific order to achieve our goal.

While traditional statistical applications focus on relatively small data sets, Data Science involves vast amounts of data, typically what we call Big Data. Thus, when we talk about Data Science, we are generally talking about applying analysis techniques to large data sets that require an application from Computer Science to the Storage and Processing of large datasets.

If we’re working with a petabyte of data, it will require a different infrastructure than a Gigabyte. So at this point, the Data Scientist must know a little about Computer Science and eventually about programming to extract as much of the equipment and computational resources as possible.

Based on this, Business Analytics should be seen as a process, that is, a flow of activities and operations to solve particular problems. For each of these steps, we have techniques, procedures, tools that can be used and used.

A company that wants to succeed from the Business Analytics application should understand this and, above all, foster the construction of this process as if it were a production line.

Of course, according to the data and the type of business problem, the Data Scientist will use different techniques at each stage, but if we know how all this works, the process is much easier to implement to achieve the desired result.

The Business Analytics process involves several interrelated steps:

1. Problem Definition

The starting point is the precise definition of the business problem to be solved, which will guide everything that a Data Scientist will do. The lack of clarity in the description of the business problem is the certainty of an unsuccessful Business Analytics process. It doesn’t matter if we use the best technique on the planet, the best procedure, or high precision tools if the problem is not well defined — waste of time and resources.

It is necessary to delve into the business area, rounds of conversations, meetings, thoughts about the problem.

With the problem well defined, we have to think about storing data efficiently and the steps of data pre-processing because all of this is critical to the success of the analysis. That’s where the big data issue comes in; if the data set is large, we won’t be able to process all of this on a single machine, and you need to use a cluster of computers that will behave as if they were just one. In this way, we can use all the memory and processing capacity of this set of machines.

Know how to apply pre-processing ideas according to the problem and the data set we have at hand. In this step, we have the Data Engineer taking care of the storage infrastructure and the Data Scientist working on the pre-processing steps.

2. Select Variables

You must select the appropriate response variables and decide on the number of variables that we should investigate. When we collect data from the source to solve a problem, not all variables will likely be relevant.

Therefore, we have to apply some techniques to choose the best variables among the available ones to pass on. The Data Scientist is required; if we do not make the correct choice of variables, the future steps will all be compromised.

3. Explore data

The data needs to be tracked for outliers, and missing values must be addressed (with missing values omitted or appropriately imputed using one of several available methods).

Therefore, we define the problem, collect the data, store the data, apply some initial pre-processing, tabulate, identify the variables most relevant to the problem. We have to identify values out of the pattern, values that run away from the average.

For some problems, we want to identify the outlier. We need to remove it for some other type of problem by influencing the mean of the data; influencing the average will influence the predictive model.

The same fits the missing values; maybe someone has not filled in some attribute at the source. We are without this record that may be important for our analysis — a decision needs to be made: remove the missing records; apply an imputation technique with some arbitrary value. Therefore, once again, we have processes, procedures, and tools for this step.

4. Exploratory Analysis

Before applying predictive models and sophisticated methods, the data needs to be visualized and summarized — This step is very neglected. However, it is an opportunity to visually identify patterns or problems that we cannot locate during coding.

Chart building is part of the analysis process. Visualization helps us understand how data is organized. As a result, we can know the data better, map any problems, and make better decisions with exploratory analysis.

5. Describe Data

The data summary involves typical summary statistics such as mean, percentile, median, standard deviation, correlation, and more advanced outlines such as critical components. For example, by identifying the mean and the standard deviation, we can get a general idea of how the data is organized and make better decisions during the process.

6. Predictive Modeling

Appropriate predictive modeling methods need to be applied. Depending on the problem, this involves linear regression, logistic regression, regression/classification trees, nearest neighbor methods, and clustering.

By now, we have the data practically ready and pre-processed. After all the steps mentioned so far, we will apply an algorithm to work with predictive modeling from the historical data.

7. Insights

Finally, analytics insights need to be implemented. Finally, we need to act on the results. In this last step, the Data Scientist will work with the Business area, explaining the insights obtained through the predictive model, explaining the conclusions, and proposing some alternatives.

The final business decision will be the Director. It is up to the Data Scientist to provide support for the decision area with communication, storytelling, know how to transmit the information collected from the data, and support the company's decision making.

We can change the business problem, and the dataset may be different, but the process is the same. Of course, when the problem and the data are changed, we should apply other techniques, tools, and procedures, but the process will invariably be the one we saw.

And there we have it. I hope you have found this useful. Thank you for reading. 🐼

What is a computer cluster?

When it comes to high-performance computing, we think of sophisticated, expensive, and overly technological servers. However, you can get results as good or even superior from a solution called a computer cluster.

Cluster is a technology capable of making computers more straightforward to work together as a single machine or a single system. Computer clusters have each node configured to perform the same task, controlled and software-programmed.

If we had ten machines forming the Cluster, the software responsible for storing these ten machines would be Hadoop HDFS, a file system developed within the Apache Hadoop Framework that aims to control, manage, and program within a machine environment. So, we take multiple computers, connect to the same network, and put a layer of software to manage all these machines so they can behave like just one.

Hadoop HDFS does this for distributed storage, and MapReduce does this for distributed processing, just as Apache Spark does for distributed processing. Hive or HBase uses HDFS to process and store data on multiple machines.

The concept of clustering on computers is a group of machines working together, controlled, and programmed by software. Each computer that is part of the Cluster is named node, with no node limits.

As data scientists, we will process the data in Hadoop regardless of whether we process through a machine or 5,000 machines — this has to be transparent to the end-user. But, on the other hand, the data engineer will ensure good performance, safety and that all devices are operating correctly.

Nodes in a cluster must be interconnected, preferably by a network technology known for maintenance and cost control purposes. The most commonly used network type of connecting a Cluster is ethernet, a local network to interconnect machines to communicate.

Cluster computing is a viable solution because cluster nodes can be composed of low-cost, medium-configuration machines. Therefore, a company can take a few dozen simple configuration machines and assemble a cluster of computers and use the processing capacity of these dozens of devices as if they were a large server.

In general, these low-cost machines may have a lower total cost than buying a single large server, which is likely to have a much higher price.

According to the application, there are several types, with the ultimate goal of building the Cluster. When working with databases for critical applications within a company, the database is hardly placed on only one machine, especially in large companies requiring a system running 24 hours a day.

A high availability cluster is configured so that two servers work together so that the database does not stop working if one of these two servers falls. For example, a company that offers a website, e-commerce, or web service has web servers with high availability to have half a dozen machines. Then, in the face of unforeseen or peak access, one device covers the other, and we have stable performance.

High-Performance Cluster

This type of Cluster is targeted at applications that are very demanding concerning the processing. For example, systems used in scientific research can benefit from this type of Cluster because it needs to analyze various data and perform very complex calculations quickly. Therefore, the focus of this type of Cluster is to allow application-driven processing to deliver satisfactory results promptly.

The goal was to try to process as much as possible in the shortest possible time. Then, when this is required for an application, we can deliver the high-performance Cluster.

One gigaflop - 1 billion instructions/second.

Therefore, depending on the application and the number of instructions required for the high processing of a large volume of data, we will increasingly need gigaflops and rely on a high-performance cluster.

This high-performance Cluster is typically a type of Cluster that we set up for Apache Hadoop. After all, we process machine learning models or even data analysis processes that need to be executed promptly.

High Availability Cluster

The goal is that the application that is served by the Cluster does not stop working. With Apache Hadoop, we typically set up a high-performance cluster to process the application in the shortest possible time.

However, in some cases, the application is so critical that we cannot interrupt processing; we can also configure Apache Hadoop in high availability mode. If one server crashes, the other will continue to fulfill the requests.

A high availability cluster is mission-critical. Therefore, it is up to the company whether the Data Lake where we put Apache Hadoop with the Cluster needs high availability or not.

Cluster for Load Balancing

This type of Cluster is widely used on Web servers, a computer that runs the web service to meet the request of the page. For example, when we open the browser and type in the LinkedIn address, we are directed to the webserver where all LinkedIn pages are located.

During the day, there are access spikes and should increase the number of web servers to balance the load. When accesses return to average, the servers are shut down gradually to pay only the infrastructure for the time of use.

Cluster Combo

We can build a cluster that has all of this. It is widespread for us to have high performance and high availability in the Hadoop cluster, and load balancing is not very common in a Hadoop Cluster. Load balancing is more common on Web servers and databases.

In practice, some applications can use these clusters: web server, databases or systems, databases, or anything that requires high performance, high availability, or load balancing.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼🤏

Hadoop Ecosystem

When Hadoop was released, it supplied two Big Data needs: distributed storage and distributed processing. However, this alone is not enough to work with Big Data; other tools are required, i.e., other functionalities to meet different business, application, and architecture needs.

The entire Hadoop Ecosystem is to work with billions of records. That’s what we have at our disposal the whole Apache Hadoop Ecosystem.

Over the years, other software has begun to appear to run seamlessly together with Hadoop and its Ecosystem. Several products benefit, for example, from the HDFS file system that allows you to manage multiple machines as if they were just one.

These other Ecosystem products benefit by running on The Hadoop Distributed File System. With this, we can create an Ecosystem focused on the need for an application, architecture, or what we want to do with Big Data. Just think of the Ecosystem as the iOS or Android operating system apps; without them, the smartphone would only serve to call and receive calls — their initial goal.

Applications serve to enhance the Operating System’s capability, so we can apply the same reasoning to the Hadoop Ecosystem components to complement the operation of Hadoop.

Data transfer (Flume, Sqoop, Kafka, Falcon)
File System (HDFS)
Data Storage (HBase, Cassandra)
Serialization (Avro, Trevni, Thrift)
Jobs Execution (MapReduce, YARN)
Data Interaction (Pig, Hive, Spark, Storm)
Intelligence (Mahout, Drill)
Search (Lucene, Solr)
Graphics (Giraph)
Security (Knox, Sentry)
Operation and Development (Ooozie, Zookeeper, Ambari)

In addition to using Hadoop, we can use other systems that will run on Hadoop and, with that, assemble a single architecture powerful enough to extract from Big Data the best it has to offer.

1. Apache Zookeeper — coordination of distributed services

To manage and organize the zoo of several of Hadoop’s animals was created the guardian Zookeeper, responsible for the entire functioning. ZooKeeper has become a standard for organizing Hadoop services, HBase, and other distributed structures.

Zookeeper is a high-performance open-source solution for coordinating services in distributed applications, i.e., corresponds to large cluster sets.

Coordinating and managing a service in a distributed environment is a complicated process. The zookeeper solves the debut problem with its simple architecture, doing the management of the Hadoop cluster. Thus, zookeeper allows developers to focus on the logic of the main application without worrying about the distributed nature of the application.

2. Apache Oozie — Workflow Scheduling

The Hadoop Ecosystem also has a workflow manager, which allows us to schedule jobs and manage them through the Cluster. Remember that we are dealing with running on computer clusters, where all of this requires different management than companies are typically used to doing in more traditional Database environments.

When we create the Cluster to store and process large data sets, other concerns will arise. Hadoop came to solve the big data problem, but it brought cluster management, which requires other practices, different techniques, etc.

Apache Oozie is a workflow scheduling system used to manage MapReduce Jobs primarily. Apache Oozie is integrated with the rest of the Hadoop components to support various Hadoop jobs (such as Java Map-Reduce, Streaming Map-Reduce, Pig, Hive, and Sqoop) as well as system-specific and jobs (such as Java programs and shell scripts).

Oozie is a workflow processing system that allows users to define a series of jobs written in different languages like MapReduce, Pig, and Hive and then intelligently link them to each other. In addition, Ozzie will enable users to specify that we can only start a particular query after previous jobs that access the same data are complete.

3. Apache Hive — Data Warehouse (Developed by Facebook)

Apache Hive is a Data Warehouse that works with Hadoop and MapReduce. Hive is a data storage system for Hadoop that makes it easy to aggregate data for reporting and analysis of large data sets (Big Data). It is a system for managing and querying unstructured data in a structured format!

Just Hadoop alone, maybe not be enough for an architecture of a Big Data application. Hadoop lacks some components that we need for the day-to-day, such as a tool that allows us to aggregate and generate reports from the data stored in Hadoop HDFS. Soon, Apache Hive came to meet this processing need, being another component that runs on Apache Hadoop.

Hive enables queries about data using a SQL-like language called HiveQL (HQL). This system provides fault tolerance capability for data storage and relies on MapReduce for execution, meaning Hive alone is no use! Instead, we need Hive running on a Hadoop infrastructure because it needs the data stored in HDFS and depends on MapReduce, so Hive is a kind of Hadoop plugin.

It allows JDBC/ODBC connections to be easily integrated with other business intelligence tools like Tableau, Microstrategy, Microsoft Power BI, and more. Hive is batch-oriented and has high latency for query execution. Like Pig, it generates MapReduce Jobs that run in the Hadoop Cluster.

In the Backend, Apache Hive itself will generate a Job from MapReduce. We do not need to create this Job from MapReduce; Hive will facilitate our life by serving as a user-friendly interface and easy to collect the data from the infrastructure.

Therefore, Hive uses MapReduce for execution and HDFS for data storage and research; it provides the specific HQL language for Hive engine queries, which supports the basics of the SQL language.

4. Apache Sqoop — SQL Server/Oracle for HDFS

Sqoop is a project of the Apache Hadoop ecosystem whose responsibility is to import and export data from relational databases. Sqoop was developed to transfer data from Hadoop to RDBMS and vice versa, transforming the data into Hadoop without further development.

At some point during the analysis process, we will need data stored in relational banks. We can collect data from a non-structured source, a Social Network, and structured sales data from the relational bank and record both in Hadoop, apply some analysis, and try to find out the relationship between the company’s interaction in Social Networks with the sales process. After all, we create a predictive model to help managers and decision-makers of the company.

Sqoop allows us to move data from traditional databases like Microsoft SQL Server or Oracle to Hadoop. For example, you can import individual tables or entire databases into HDFS, and the developer can determine which columns or we will import rows.

We can manipulate how Apache Sqoop will import or export the data from Apache HDFS. Sqoop uses the JDBC connection to connect relational databases and can directly create tables in Apache Hive and supports incremental import — if we forget to import some part of a table or the table has grown since the last do an incremental import.

5. Apache Pig

Apache Pig is a tool that is used to analyze large data sets that represent data flows. We can perform all data manipulation operations on Hadoop using Apache Pig.

To write data analysis programs, Pig offers a high-level language known as Pig Latin. This language provides several operators that programmers can use to create their own reading, writing, and processing data functions.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All of these scripts are converted internally to mapping and reduction tasks. Apache Pig has a component known as Pig Engine that accepts Pig Latin scripts as input and converts those scripts to Jobs MapReduce. We have 2 components of Apache Pig:

Pig Latin Script Language is a procedural language of data flow and contains syntax and commands applied to implement business logic.
Runtime Engine, the compiler that produces sequences of MapReduce programs, uses HDFS to store and fetch data, interacts with Hadoop systems, and validates and compiles scripts in Jobs MapReduce sequences.

6. Apache HBase — NoSQL key-value database

Apache HBase is one of the most impressive products in the Hadoop system. It is a non-relational NoSQL database designed to work with large data sets on Hadoop HDFS. It is a type of NoSQL database that uses the key-value model. A respective byte-array key identifies each value; in addition to that, tables do not have schemas, which is a fairly common characteristic in relational databases.

The goal of HBase is to store huge tables with billions of records. When working with a relational database, it is widespread to have a table with multiple columns or multiple tables with a smaller number of columns, but that relates. With HBase, we have only one table, usually with 3 to 4 columns and Billions of records — not ideal for any project, but projects requiring this type of access with fewer variables and a vast number of records HDFS.

It takes advantage of the fault tolerance provided by the Hadoop file system (HDFS), being one of the characteristics of ecosystem products. So we don’t have to worry about fault tolerance, as this has already been implemented in HDFS, so it’s a product that will run on the HDFS, and we’ll only worry about the features and features of the accessory product.

HBase Node Architecture

Node Master

Only a Master node can be executed. ZooKeeper maintains high availability. He is responsible for managing cluster operations, such as assignment, load balancing, and splitting. It is not part of a reading and writing operation.

Node Region Server

We can have one or more. It is responsible for storing tables, performing reads, and writing buffers. The Client communicates with the RegionServer to process read and write operations.

We have an architecture very similar to Hadoop architecture, where we have the Master and the Slave. In the case of HBase, the difference is subtle, but it’s as if we have the Master doing the management and the Slave in general terms.

HBase vs. RDBMS

In HBase, partitioning is automatic; already in a database of the category RDBMS, the partitioning can be automated or manual — performed by the administrator.

HBase can scale it linearly and automatically with new nodes; if the cluster no longer supports processing capacity, we will add new nodes to the Cluster to gain horizontal scalability with machines. In the case of RDBMS, scalability is vertical with the addition of more hardware to the server. In this case, we have a single server to add more hardware, more memory, disk space, more processing, etc.

HBase uses commodity hardware, the same characteristic as a Hadoop cluster. However, RDBMS requires more robust and, therefore, more expensive hardware.

The HBase has fault tolerance; in the case of RDBMS, fault tolerance may be present. Soon, some issues will be resolved with relational databases, while other issues, especially those related to Big Data, will be fixed with Apache Hbase.

7. Apache Flume — Massive source data collection for HDFS

Apache Flume is a service that works in a distributed environment to collect efficiently, aggregate, and move large amounts of data, with a flexible and straightforward streaming architecture based on data — Twitter, how to collect network data and bring it to HDFS? Apache Flume is an option.

Flume is for when we need to bring data from different sources in real-time to Hadoop. Flume is a service that basically allows you to send data directly to Hadoop HDFS.

Flume’s data model allows it to be used in online analytics applications. We can have social networks, Facebook, Twitter, server logs, or any other data source on the left side. We collect this data with Apache Flume, record it in Apache HDFS to eventually apply MapReduce, or even use Apache HBase and overall architecture.

We set up a Big Data infrastructure to store data and then process it. Apache Flume is designed to bring and store the data in a distributed environment.

8. Apache Mahout — central data flow repository

At some point, we will need to apply Machine Learning, today’s leading technology. Machine Learning allows us to perform predictive modeling, predicting through historical data and automating the process.

Apache Mahout is an open-source library of machine learning algorithms, scalable and focused on clustering, classification, and recommendation systems. Therefore, Apache Mahout may be an option when we need to apply Machine Learning to a large data set stored in Hadoop.

Suppose we need to use high-performance machine learning algorithms, an open-source and free solution. In that case, we have a large data set (Big Data), we use analytics tools like R and Python, process batch data, and a mature library on the market, then Mahout can meet our needs.

9. Apache Kafka — central data flow repository

Apache Kafka manages real-time data flows generated from websites, applications, and IoT sensors. This agent application is a central system that collects high-volume data and makes it available in real-time for other applications.

Producers are the data sources that produce the data, and consumers are the applications that will consume the data. All of this can be done in real-time, while Apache Kafka collects data made from clicks, logs, stock quotes and delivers them to the consumer to process and apply machine learning, and then the data is discarded.

We generate an amount of data never seen before, and this only tends to increase exponentially. Kafka proposes to analyze data in real-time rather than storing it — it no longer makes sense to talk only about data stored in tables, with rows and columns. The volume of data is now so large that it needs to be seen as what it really is: a constant stream that needs to be analyzed in real-time.

In other words, we collect the data from Twitter, forward the data through some application, analyze the set, deliver the result and let the data go away, that is, analyze in real-time without storing the data. It makes no sense to store the data because it is already data from the past.

Kafka must evolve a lot because we have no way to store so much data, and in some cases, it makes no sense for the business’s goal to store and analyze data from a recent past.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

key features of apache spark

We will see the main features of Apache Spark and the point where some professionals’ first big mistake occurs.

Compare Spark to Hadoop.

Hadoop is a framework divided into several modules. We have Hadoop HDFS, the distributed file system for distributed data storage, and Hadoop MapReduce for distributed data processing. Apache Spark rivals Hadoop MapReduce and not Hadoop HDFS.

Hadoop MapReduce processes large data sets distributed across a computer cluster, while intermediate results are written to disk. Writing or reading to disk consumes much more computational resources.

In contrast, Apache Spark reads the data in Hadoop HDFS, performs interactive computing, and writes the result to a Memory Cache — that’s Spark’s significant differential, reading straight from computer memory.

Memory reading is much faster than disk reading so that Apache Spark can be up to 100x faster than Hadoop MapReduce.

However, there is a memory limit (still) on a computer. If we consider a cluster, we have the memory distributed: the summed memory of all the machines in the cluster added up! We may have terabytes of distributed RAM; this is very common in a cluster of computers.

But what if we have Petabytes of data to process? Will terabytes of memory be enough to process them? If we don’t have enough memory, we need some means to burn to disk, i.e., Hadoop MapReduce can be indicated if the volume of data is monstrous!

Even on disk, Apache Spark offers slightly better performance than Hadoop MapReduce, but depending on the purpose and application of the data, Hadoop MapReduce can be an excellent option. Therefore, it all depends on the ultimate goal and the computational resources at our disposal.

Spark’s Main Feature

1. Performs MapReduce operations

Performs mapping and reduction, a programming paradigm with a large set of data. First, we do the mapping, reduction, and result. Then we reapply mapping, reduction, and result. Therefore, we have formed that large set of data into a minor part, and thus, we were able to analyze, summarize, aggregate, and several other activities. In essence, Spark performs MapReduce operations like the famous Hadoop MapReduce.

2. Spark can use HDFS

We can use Hadoop and Spark together! Hadoop HDFS for distributed storage and Spark for distributed data processing. However, HDFS is not the only option for Spark. We can use S3, local storage, relational and non-relational databases — if the set is too large, HDFS ends up being the best alternative.

3. Build Analytics workflow

Spark allows you to build an Analytics workflow. That is, we can run a kind of data analytics pipeline — a series of operations. Spark collected the data, transformed it, aggregated it, applied the Machine Learning model, and delivered results. All this pipeline we can do with Apache Spark.

4. Memory Usage

Apache Spark uses computer memory differently and efficiently. The Hadoop MapReduce’s questionable performance inspired the creators of Apache Spark to develop Spark. When analyzing MapReduce, they realized that the application was very inefficient in terms of computer memory usage, i.e., rather than doing intermediate disk burning, leaving this in memory, and speeding up processing time — so the efficiency of Apache Spark emerged.

5. Spark is Fast

6. Spark is flexible

In practice, everything is programming. So, we build the process in the way that is most convenient for the use case.

7. It’s free

There’s no license cost. There is the cost of learning, and there is an Apache Spark learning curve. In general, it requires computational resources, and in return, there is no license to pay.

Encore

The Spark does not have an installer. Instead, we download a file, unpack, copy the directory to the location of our machine, and then configure the environment variables.

Within the Spark directory, we’ll find a structure that’s worth exploring to find out what we have at our disposal. As the directory itself suggests, we’ll find the Apache Spark executable binaries in the bin directory. It has binaries for Windows, Mac, and Linux operating systems in the same directory — it’s worth investigating these scripts.

Directories

conf: with a series of Spark configuration files, properties, and parameters
data: with a series of data — graphs, streaming and mllib with a series of text files, are sample datasets files so that we can use the examples provided by Apache Spark;
examples: there are several codes in several languages with examples of algorithm applications.
Jars: jar is a Java package with a series of Java class libraries.
Kubernetes: to use container
licenses: Some usage details despite being free. We will use this later for the business environment or commercial application.
Python: which owns the executable for PySpark. When we use PySpark, and in that directory, it searches for the necessary scripts.
sbin: Has scripts to boot a cluster of Spark computers. To have a large processing environment, we can configure Spark in cluster mode to manage multiple machines and machine memory.
Yarn: The option we have for managing resources in a Spark cluster.

It is all used as we build a cluster, run PySpark, run spark-shell, and so on.

We need to ensure that we have the necessary software, especially Java, installed and configured in the terminal. If you don’t have Java forget Spark:

java -version

We also need Python. In our case is Anaconda Python:

python

And lastly, use PySpark. In this case, we will boot PySpark inside the directory where Jupyter notebook is:

cd Documents/Directory/Exercises/0

pwd“Documents/Directory/Exercises/0”

And within the directory specified above, we will run PySpark:

pyspark

PySpark will open in the default browser. Running PySpark demonstrates that we have an environment installed, configured. Now we need to work directly with PySpark and not the Jupyter Notebook.

Our environment is ready and functional. We have everything we need and get to work.

And there we have it. I hope you have found this helpful. Thank you for reading. 🐼

10 jupyter extensions

Aqui abordaremos 10 extensões do Jupyter Lab que são muito úteis para melhorar drasticamente a produtividade e rotina de um cientista de dados.

Gerenciador de Extensões do Jupyter Lab

Primeiramente, executamos o comando para instalar uma extensão do Jupyter Lab.

jupyter labextension install @jupyterlab/...

Se formos usuários de VS Code, Sublime ou Atom, também podemos pesquisar diretamente o que instalar em um “gerenciador”. O Jupyter Lab fornece esse recurso.

Podemos ir diretamente para a 4ª guia na navegação esquerda, que é o gerenciador de extensão e pesquisar a extensão à medida das nossas necessidades.

1. Depurador de JupyterLab

Às vezes, o recurso de depuração é necessário para codificação. Podemos querer executar um loop passo a passo para ver o que está exatamente acontecendo e o que está sendo executado. A maioria das ferramentas IDE suporta esse recurso de depuração com “step over” e “step into”, mas infelizmente não no Jupyter:

@jupyterlab/debugger

Essa extensão nos permite complementar este recurso ausente no Jupyter Lab.

imagem: https://blog.jupyter.org/a-visual-debugger-for-jupyter-914e61716559

2. JupyterLab-TOC

Se tivermos um Notebook extenso, podemos deixá-lo mais bonito para uma apresentação:

@jupyterlab/toc

imagem: https://github.com/jupyterlab/jupyterlab-toc/raw/master/toc.gif

Com esta extensão, a tabela de conteúdo será gerada automaticamente com base nas células de marcação com títulos “##” para especificar seus níveis de posição). Sendo uma boa maneira de usar o Notebook de forma mais organizada.

3. JupyterLab-DrawIO

Diagram.net (anteriormente Draw.IO), uma ótima ferramenta para desenhar diagramas:

jupyterlab-drawio

imagem: https://github.com/QuantStack/jupyterlab-drawio/raw/master/drawio.gif

4. Tempo de execução do JupyterLab

Uma das características incríveis do Jupyter Notebook/Lab é que ele fornece muitos comandos mágicos úteis. Podemos testar quanto tempo nosso código levará para ser executado. E

le executará nosso trecho de código centenas ou milhares de vezes e obterá a média para ter certeza de dar um resultado justo e preciso.

%timeit

No entanto, às vezes não precisamos ser tão científicos. Além disso, seria bom saber quanto tempo para cada célula para correr. Neste caso, é absolutamente exagerado usar para cada célula. %timeit

jupyterlab-execute-time pode ajudar neste caso.

Acima temos o tempo de execução da célula e o último tempo executado. Este é um recurso muito conveniente para indicar a ordem de execução das células.

5. Planilha JupyterLab

Como cientista de dados ou engenheiro de dados, devemos lidar com planilhas às vezes. No entanto, o Jupyter não suporta nativamente a leitura de Excel, o que nos obriga a abrir várias ferramentas para alternar entre o Jupyter para codificação e o Excel para visualização dos dados.

jupyterlab-spreadsheet resolve este problema perfeitamente.

O comando incorpora recurso de visualização de planilha xls/xlsx no Jupyter Lab, para que possamos ter tudo o que precisamos em um único lugar.

imagem: https://github.com/quigleyj97/jupyterlab-spreadsheet/raw/main/screenshot.png

6. Monitor do Sistema JupyterLab

Python não é uma linguagem de programação eficaz de execução, o que significa que pode consumir mais recursos de CPU/memória comparar os outros.

Podemos querer monitorar nosso recurso de hardware do sistema para estar ciente de que nosso código Python não interrompa ou congele o Sistema Operacional.

jupyterlab-topbar-extension

A extensão oexibirá o uso da CPU e da memória em uma barra superior da Interface do Jupyter Lab para que possamos monitorá-los em tempo real.

imagem: https://github.com/jtpio/jupyterlab-system-monitor/raw/main/doc/screencast.gif

7. JupyterLab Kit

O JupyterLab Kit é um serviço gratuito de conclusão de código alimentado por IA. Está disponível em quase todos os IDEs populares como Sublime, VS Code e PyCharm.

imagem: https://github.com/kiteco/jupyterlab-kite

Com esta extensão, podemos codificar em Jupyter Lab mais fluentemente.

8. Inspetor De Variáveis JupyterLab

Este recurso infelizmente não está disponível no Jupyter Lab por padrão. No entanto, a extensão traz esse recurso de volta:

jupyterlab-variableInspector

imagem: https://github.com/lckr/jupyterlab-variableInspector/raw/master/early_demo.gifj

9. JupyterLab Matplotlib

Matplotlib é uma biblioteca Python imprescindível para um Cientista de Dados. É uma ferramenta básica, mas poderosa para visualização de dados em Python.

Para habilitá-lo, usamos o comando mágico para que o gráfico 3D se torne interativo:

jupyter-matplotlib%matplotlib widget

imagem: https://github.com/matplotlib/ipympl/raw/master/matplotlib.gif

10. JupyterLab Plotly

Enquanto o Matplotlib é a biblioteca mais básica e poderosa para visualização de dados, plotly é uma das prediletas nesta área. Plotly envolve muitos gráficos comuns que podemos gerar gráficos incríveis em algumas linhas de código.

Para que o Jupyter Lab suporte perfeitamente e seja capaz de exibir gráficos Plotly interativos, precisa ser instalado:

jupyterlab-plotly

Restricted Time and Data Processing

Os últimos anos testemunharam grandes investimentos em infraestrutura de negócios que têm melhorado a capacidade das empresas coletarem dados. A ampla disponibilidade desses dados levou ao aumento do interesse em métodos para extração informações úteis e conhecimento — o domínio de data science.

Agora, praticamente todos os aspectos dos negócios estão abertos para a coleta de dados e, muitas vezes, até instrumentados para isso: operações, manufatura, gestão da cadeia de fornecimento, comportamento do cliente, desempenho de campanha de marketing, procedimentos de fluxo de trabalho e assim por diante.

Mineração de Dados

Provavelmente, a maior aplicação de técnicas de mineração de dados está no marketing, para tarefas como marketing direcionado, publicidade online e recomendações para venda cruzada. A mineração de dados é usada para gestão de relacionamento com o cliente para analisar seu comportamento a fim de gerenciar o desgaste e maximizar o valor esperado do cliente.

Indústrias Financeiras

Estas utilizam a mineração de dados para classificação e negociação de crédito e em operações via detecção de fraude e gerenciamento de força de trabalho.

Varejistas

Os principais varejistas, do Walmart à Amazon, aplicam a mineração de dados em seus negócios, do marketing ao gerenciamento da cadeia de fornecimento.

Data Science vs. Data Mining

Os termos “Data Science” e “Data Mining” são, muitas vezes, utilizados de forma intercambiável. Em um nível mais elevado, data science é um conjunto de princípios fundamentais que norteiam a extração de conhecimento a partir de dados.

É importante compreender Data Science, mesmo que nunca vá aplicá-lo. O pensamento analítico de dados permite avaliar propostas para projetos. Por exemplo, se um funcionário, um consultor ou um potencial alvo de investimento propõe melhorar determinada aplicação de negócios a partir da obtenção de conhecimento de dados, você deve ser capaz de avaliar a proposta de maneira sistemática e decidir se ela é boa ou ruim.

Objetivo de Data Science

Data Science, Engenharia e Tomada de Decisão Orientada em Dados envolve princípios, processos e técnicas para compreender fenômenos por meio da análise (automatizada) de dados. O objetivo primordial de data science é o aprimoramento da tomada de decisão, uma vez que isso sustenta a saúde do negócio.

Problema recorrente da Rotatividade de Clientes

Considere um segundo e mais típico cenário de negócios e como ele pode ser tratado a partir de uma perspectiva de dados.

Vamos supor que você acabou de ingressar em um ótimo trabalho analítico e sua empresa tem um grande problema com a retenção de clientes no negócio de produtos e serviços wireless. Em uma determinada, 20% dos clientes de telefonia celular abandonam o serviço quando seus contratos vencem, e está ficando cada vez mais difícil adquirir novos clientes.

Como agora o mercado dos telefones celulares está saturado, o enorme crescimento do mercado sem fio diminuiu. A transferência de clientes de uma empresa para outra é chamada de rotatividade, e é algo dispendioso em todos os sentidos: uma empresa gasta em incentivos para atrair um cliente, enquanto outra empresa perde renda com a saída do cliente.

Primordial Delinear o Problema

Fomos chamados para ajudar a entender o problema e encontrar uma solução. Atrair novos clientes é muito mais caro do que manter os que já existem, por isso, uma boa verba de marketing é alocada para evitar a rotatividade. O marketing já projetou uma oferta especial de retenção.

Nossa tarefa é elaborar um plano preciso, passo a passo, para saber como a equipe de data science deve usar os vastos recursos de dados para decidir quais clientes devem receber uma oferta especial de retenção antes do término de seus contratos.

Mineração de Dados

Na verdade, a retenção de clientes tem sido uma das grandes utilizações para tecnologias de mineração de dados — especialmente nos setores de telecomunicação e finanças. Esses, de forma mais geral, foram alguns dos primeiros e mais amplos adotantes das tecnologias de mineração de dados.

Tipos de Decisões

A principais decisões são:

decisões para as quais “descobertas” precisam ser feitas nos dados
decisões que se “repetem”, principalmente em grande escala, e, assim, a tomada de decisão pode se beneficiar até mesmo de pequenos aumentos na precisão deste processo com base em análise de dados.

Exemplo Varejista

O competidor do Walmart, Target, virou notícia por um caso próprio de tomada de decisão orientada por dados. Como a maioria dos varejistas, a Target se preocupa com os hábitos de compra dos consumidores, o que os motiva e o que pode influenciá-los. Os consumidores tendem a permanecer inertes em seus hábitos e fazê-los mudar é difícil.

“Quem compra Fralda, compra todo o resto”

Quem tomava as decisões na Target sabia, no entanto, que a chegada de um novo bebê na família é um momento em que as pessoas mudam significativamente seus hábitos de compras. A maioria dos varejistas sabe disso e, portanto, competem entre si tentando vender produtos de bebês para novos pais. Como a maior parte dos registros de nascimento é pública, os varejistas obtêm informações sobre nascimentos e enviam ofertas especiais para os novos pais.

A Target desejava sair na frente da concorrência. Eles estavam interessados em saber se conseguiriam prever se as pessoas estavam esperando um bebê. Se pudessem, ganhariam uma vantagem ao fazer ofertas antes de seus concorrentes. Usando técnicas de data science, a Target analisou dados históricos sobre os clientes que souberam posteriormente que estavam grávidas! e foi capaz de obter informações que poderiam predizer quais consumidores estavam esperando um bebê e adiantar suas ofertas.

Hadoop, HBase e MongoDB para Big Data

Recentemente, tecnologias “big data” como Hadoop, HBase e MongoDB têm recebido considerável atenção da mídia. Essencialmente, o termo big data significa conjuntos de dados que são grandes demais para os sistemas tradicionais de processamento e, portanto, exigem novas tecnologias para processá-los.

O Hadoop, por exemplo, é uma arquitetura de código fonte aberto, amplamente utilizada para fazer cálculos altamente paralelizáveis. É uma das atuais tecnologias de “Big Data” para o processamento de enormes conjuntos de dados que excedem a capacidade dos sistemas de base de dados relacionais. Hadoop é baseado na estrutura de processamento paralelo MapReduce, introduzida pelo Google.

Data Science é um Ativo Estratégico

As seções anteriores sugerem um dos princípios fundamentais de data science: os dados, e a capacidade de extrair conhecimento útil a partir deles, devem ser considerados importantes ativos estratégicos. Visualizar isso como ativos nos permite pensar explicitamente sobre a extensão em que se deve investir neles.

A melhor equipe de data science pode gerar pouco valor sem os dados adequados; muitas vezes, os dados corretos não podem melhorar substancialmente as decisões sem um talento adequado em data science. Como acontece com todos os ativos, com frequência, é necessário fazer investimentos. É importante compreender data science, mesmo que não pretenda fazê-lo sozinho, porque a análise dos dados é, agora, crucial para a estratégia e saúde dos negócios.

Vantagem Competitiva

As empresas estão cada vez mais impulsionadas pela análise de dados, portanto, há grande vantagem profissional em ser capaz de interagir com competência dentro e fora dessas empresas. As empresas em muitos setores tradicionais estão explorando recursos de dados novos e existentes para obter vantagem competitiva. Elas empregam equipes de data science para trazer tecnologias avançadas para suportar o aumento do rendimento e diminuir os custos.

Devemos ressaltar que data science, como ciência da computação, é um campo novo. O sucesso no ambiente empresarial de hoje, orientado em dados, exige a capacidade de pensar sobre como esses conceitos fundamentais se aplicam a determinados problemas de negócios — pensar analiticamente em dados e cobrar-se por resultados melhores a partir das decisões tomadas.

Top 15 Most Used Functions in Seaborn

Seaborn is a popular Python data visualization library based on matplotlib. It provides a high-level interface for creating attractive and informative statistical graphics. Here are the 15 most common functions of the Seaborn library in order from the most to the least common:

distplot( ): The distplot function is a versatile tool for visualizing univariate distributions. It plots histograms, kernel density estimates (KDE), rug plots, and can also fit parametric distributions such as normal, exponential, and others.

https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot

2. countplot( ): The countplot function creates a bar chart that displays the frequency of each category in a categorical dataset. This function is useful for understanding the distribution of categorical data.

https://seaborn.pydata.org/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot

3. jointplot( ): The jointplot function creates a scatterplot of two variables with bivariate and univariate representations of the distributions of the variables. This function can also display a hexbin plot, a kde plot, and fit a regression line.

https://seaborn.pydata.org/generated/seaborn.jointplot.html?highlight=jointplot#seaborn.jointplot

4. boxplot( ): A box plot is a standardized way of displaying the distribution of a dataset, based on five number summary (minimum, first quartile, median, third quartile, and maximum). The Seaborn boxplot function can also display violin plots and swarm plots, which give more information about the distribution of the data.

https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot

5. barplot( ): The barplot function creates a bar chart that displays the mean of a continuous variable for each category in a categorical dataset. This function is useful for comparing the means of different groups.

https://seaborn.pydata.org/generated/seaborn.barplot.html?highlight=barplot#seaborn.barplot

6. lineplot( ): The lineplot function creates a line chart that displays the relationship between two continuous variables. This function is useful for visualizing trends in the data.

https://seaborn.pydata.org/generated/seaborn.lineplot.html?highlight=lineplot#seaborn.lineplot

7. pairplot( ): The pairplot function creates a matrix of scatterplots to visualize the relationships between multiple variables. It also creates histograms and KDE plots for each variable.

https://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot

8. violinplot( ): A violin plot is a combination of a box plot and a kernel density plot. It displays the distribution of the data, including the median, quartiles, and the density of the data.

https://seaborn.pydata.org/generated/seaborn.violinplot.html?highlight=violinplot#seaborn.violinplot

9. swarmplot( ): Swarmplot is similar to a violin plot but instead of showing the density of the data, it shows every single observation. This function is useful for visualizing the distribution of the data when the number of observations is not too large.

https://seaborn.pydata.org/generated/seaborn.swarmplot.html?highlight=swarm#seaborn.swarmplot

10. stripplot( ): The stripplot function creates a scatterplot where one variable is categorical. This function is useful for visualizing the relationship between a continuous variable and a categorical variable.

https://seaborn.pydata.org/generated/seaborn.stripplot.html?highlight=stripplot#seaborn.stripplot

11. lmplot( ): The lmplot function fits a linear regression model to a dataset and creates a scatterplot with the regression line. This function is useful for visualizing the relationship between two continuous variables and fitting a regression line to the data.

https://seaborn.pydata.org/generated/seaborn.lmplot.html?highlight=lmplot#seaborn.lmplot

12. heatmap( ): The heatmap function creates a color-encoded matrix representation of a dataset. This function is useful for visualizing the relationship between multiple variables.

https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap

12. pairgrid( ): The pairgrid function creates a matrix of subplots using pairplot. This function is useful for visualizing the relationships between multiple variables and comparing their distributions.

https://seaborn.pydata.org/generated/seaborn.PairGrid.html

13. FacetGrid( ): The FacetGrid function is used to create multiple plots on the same figure, where each plot shows the same relationship conditioned on different levels of a categorical variable. This function is useful for exploring the relationship between variables for different categories.

https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid

14. regplot( ): The regplot function creates a scatterplot and fits a simple linear regression model to the data. This function is useful for visualizing the relationship between two continuous variables and fitting a simple linear regression model to the data.

These functions cover a wide range of data visualization needs, from univariate distributions to bivariate relationships and provide a rich set of tools for visualizing and analyzing any kind of data in Seaborn.

The high-level interface and attractive default styles make it easy to create informative and attractive graphics.

Thank you for taking the time to read it.

Leonardo Anello

in/anello92

Page updated

Google Sites

Report abuse