This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.

Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).

In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.

I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.

I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:

The winutils should explicitly be inside a bin folder inside the Hadoop Home folder. In my case HADOOP_HOME points to C:\tools\WinUtils and all the binaries are inside C:\tools\WinUtils\bin

Can you please elaborate how it affects the spark functionality ? I am very pissed with this point why it cant be downloaded in some other folder. I have windows 10 and I dont have permissions to install anything in my windows C:\..

The blog has helped me lot with all installation while errors occurred but still facing an problem while installing spark on windows while launching spark-shell.

can anybody please help with solution for as soon as possible? thanks in advance.

Never mind. I figured out the problem was with Java installation location. I changed the installation directory from program files to some other directory without spaces and all seem to work fine then. thanks!

I know that there is a very similar post to this one(Failed to locate the winutils binary in the hadoop binary path), however, I have tried every step that was suggested and the same error still appears.

If you are running Spark on Windows with Hadoop, then you need to ensure your windows hadoop installation is properly installed. to run spark you need to have winutils.exe and winutils.dll in your hadoop home directory bin folder.

Seems in your Windows machine you are missing the winutil.exe. Can you try this:

1. Download winutils.exe from

2. Set your HADOOP_HOME environment variable on the OS level to the full path to the bin folder with winutils.

Your question went into a thread that was over three years old. You would have a better chance of receiving a prompt and satisfactory resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.

The findspark Python module, which can be installed by running python -m pip install findspark either in Windows command prompt or Git bash if Python is installed in item 2. You can find command prompt by searching cmd in the search box.

In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon ; between the values.

To run Jupyter notebook, open Windows command prompt or Git Bash and run jupyter notebook. If you use Anaconda Navigator to open Jupyter Notebook instead, you might see a Java gateway process exited before sending the driver its port number error from PySpark in step C. Fall back to Windows cmd if it happens.

My system is throwing the following error while I tried to start the Name-node for my latest Hadoop-2.2 Version. My system did not find winutils.exe file in my Hadoop bin folder. I tried below codes to fix the issue but it hardly worked. Help me out to sort this out.

Note: You can also use the Hive default scratch directory, which is c:\tmp\hive. In this case, you need to create the directory manually and call winutils.exe chmod -R 777 c:\tmp\hive to set up the correct permissions.

This article explains how Databricks Connect works, walks you through the steps to get started with Databricks Connect, explains how to troubleshoot issues that may arise when using Databricks Connect, and differences between running using Databricks Connect versus running in a Databricks notebook.

For example, when you run the DataFrame command using Databricks Connect, the logical representation of the command is sent to the Spark server running in Databricks for execution on the remote cluster.

Run large-scale Spark jobs from any Python, R, Scala, or Java application. Anywhere you can import pyspark, require(SparkR) or import org.apache.spark, you can now run Spark jobs directly from your application, without needing to install any IDE plugins or use Spark submission scripts.

Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Java library dependencies in Databricks Connect, because each client session is isolated from each other in the cluster.

Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs, and DataFrame objects defined in a notebook.

For Python development with SQL queries, Databricks recommends that you use the Databricks SQL Connector for Python instead of Databricks Connect. the Databricks SQL Connector for Python is easier to set up than Databricks Connect. Also, Databricks Connect parses and plans jobs runs on your local machine, while jobs run on remote compute resources. This can make it especially difficult to debug runtime errors. The Databricks SQL Connector for Python submits SQL queries directly to remote compute resources and fetches results.

You must install Python 3 on your development machine, and the minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster. The following table shows the Python version installed with each Databricks Runtime.

Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. This can help to reduce the time spent resolving related technical issues.

The Databricks Connect major and minor package version must always match your Databricks Runtime version. Databricks recommends that you always use the most recent package of Databricks Connect that matches your Databricks Runtime version. For example, when you use a Databricks Runtime 12.2 LTS cluster, you must also use the databricks-connect==12.2.* package.

With your virtual environment activated, uninstall PySpark, if it is already installed, by running the uninstall command. This is required because the databricks-connect package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run the show command.

With your virtual environment still activated, install the Databricks Connect client by running the install command. Use the --upgrade option to upgrade any existing client installation to the specified version.

Accept the license and supply configuration values. For Databricks Host and Databricks Token, enter the workspace URL and the personal access token you noted in Step 1.

SQL configs or environment variables. The following table shows the SQL config keys and the environment variables that correspond to the configuration properties you noted in Step 1. To set a SQL config key, use sql("set config=value"). For example: sql("set spark.databricks.service.clusterId=0304-201045-abcdefgh").

To shut down JupyterLab, click File > Shut Down. If the JupyterLab process is still running in your terminal or Command Prompt, stop this process by pressing Ctrl + c and then entering y to confirm.

If running with a virtual environment, which is the recommended way to develop for Python in VS Code, in the Command Palette type select python interpreter and point to your environment that matches your cluster Python version.

Configure the Spark lib path and Spark home by adding them to the top of your R script. Set to the directory where you unpacked the open source Spark package in step 1. Set to the Databricks Connect directory from step 2. 152ee80cbc

