Now that you know what Spark is for, we’ll show you how to set up and test a Spark development environment on Windows, Linux (Ubuntu), and macOS X — whatever common operating system you are using, this article should give you what you need to be able to start developing Spark applications.
What is Spark development environment?
The development environment is an installation of Apache Spark and other related components on your local computer that you can use for developing and testing Spark applications prior to deploying them to a production environment.
Spark provides support for Python, Java, Scala, R. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so to run Spark all you need is an installation of Java. If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on your machine.
What are the spark setup options
The options for getting started with Spark are:
- downloading and installing Apache Spark on your laptop.
- Downloading the quick start VM distribution.
- Running a web-based version in Databricks Community Edition, a free cloud environment.
I’ll explain all these options to you.
Downloading Spark Locally
To download and run Spark locally, the first step is to make sure that you have Java installed on your machine, as well as a Python version if you would like to use Python. Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.
Building Spark from Source
In case you choose to install Spark manually I suggest to use Vagrant, which will provide the isolated environment in your host OS and prevent the the host OS from getting corrupted.
The detailed steps are available on my GitHub Page.
Downloading the quick start VM or Distribution
You can download the quick start VM from Hortonworks or Cloudera. Once downloaded you need with VMware or Oracle Virtualbox to host the image. These VM’s are pre-configured, and you don’t have to do any additional installation or configuration.
For distributions you can choose Hortonworks, Cloudera, or MapR.
Running Spark in the Cloud
Databricks offers a free community edition of its cloud service as a learning environment. You need to Sign Up and follow the steps.
Testing the intallation
pysparkon console and you will be logged into the Spark shell for python
>>> print("Hello World!") Hello World!
spark-shellon console and you will be logged into Spark shell for scala
scala> println("Hello World!") Hello World!
You now have a Spark development environment up and running on your computer.
In the testing section you also briefly saw how we can create a Spark applications and run it using spark-submit. In the next article we expand on this process, building a simple but complete data cleaning application.