While you can use a public cloud to setup Hadoop/Spark and develop your own Hadoop/Spark programs, it is highly suggested you setup your own local environments to develop and test your programs before moving to the cloud. The cloud usage time is rounded to hours, thus it can be costly for testing.
Hadoop Tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Spark quick start: https://spark.apache.org/docs/latest/quick-start.html
Example Java codes: WordCount.java WordCount2.java
Windows:
Setup Hadoop on Windows:
https://medium.com/analytics-vidhya/hadoop-how-to-install-in-5-steps-in-windows-10-61b0e67342f8
Setup Spark on Windows:
https://phoenixnap.com/kb/install-spark-on-windows-10
Linux (Native or VirtualBox/VMWare Fusion)
VirtualBox download (May not work well for Mac M2 chips)
https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
Spark installation on single node on Linux:
https://data-flair.training/blogs/spark-installation-standalone-mode/
https://medium.com/@sandeepsinh/spark-installation-on-single-node-7e4cf4514c29
MacOS:
Single Node/Pseudo Distributed Hadoop Cluster on macOS:
Installing Hadoop on MacOS (M1/M2)
Apache Spark on a Single Node/Pseudo Distributed Cluster in macOS:
Setting Up Apache Spark (macOS): A Comprehensive Guide
We have Google cloud credit approved for this class. You have two ways to setup your account to use your credit. Please note that the setup depends on your Gmail account to be used, thus we suggest you use the incognito mode of your web browser (Chrome) or open a private window from Firefox or Safari, so you can make sure you log into the right Gmail account.
1. Use your cs.stonybrook.edu email (safe option if you have a CS account).
Please first use Student Coupon Retrieval Link to make a request to your CS email account. You will be asked for a name and email address (CS email). A confirmation email will be sent to your CS email with a coupon code. After you click on the verification link, a credit code will be sent to your email. You are safe to use “Click [here] to redeem” to get the credit to your cs email account.
2. Stony Brook email (you can get the credit, but you have to use your personal Gmail to apply the credit and use GCP).
Our university is a Google Workspace for Education school, and the Google Developer Console is not turned on within the Google Workspace for Education console by the University Google Workspace for Education Administrator. I submitted an IT ticket for our university to turn it on, but it may take forever to happen. Instead, you can retrieve the credit to get the coupon code using your stonybrook.edu email, and use your personal Gmail to use it.
Please first use Student Coupon Retrieval Link to make a request to your stonybrook.edu email account. You will be asked for a name and email address (stonybrook.edu email). A confirmation email will be sent to you with a coupon code. After you click on the verification link, a credit code will be sent to your email. Please DO NOT click to redeem in “Click [here] to redeem”.
Instead, First, log into using your personal Gmail account. Once logged into Gmail, navigate to console.cloud.google.com/education, copy and paste the coupon code to redeem.
Please also accept terms after login. You will be able to start to create your first project.
Please let me know if you have any further questions.
Please follow the screenshots at Google drive (sorting files by names) on how to start a GCP cluster. Instructions on using Python is also available.
Additional documentations:
How to use Spark & Hadoop in GCP: https://medium.com/codex/how-to-use-spark-hadoop-in-gcp-8620ed3e35bd (PDF)
Create a Hadoop cluster: https://cloud.google.com/bigtable/docs/creating-hadoop-cluster