Setting up Spark development environment


Now that you know what Spark is for, we’ll show you how to set up and test a Spark development environment on Windows, Linux (Ubuntu), and macOS X — whatever common operating system you are using, this article should give you what you need to be able to start developing Spark applications.

What is Spark development environment?

The development environment is an installation of Apache Spark and other related components on your local computer that you can use for developing and testing Spark applications prior to deploying them to a production environment.

Spark provides support for Python, Java, Scala, R. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so to run Spark all you need is an installation of Java. If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on your machine.

What are the spark setup options

The options for getting started with Spark are:

I’ll explain all these options to you.

Installing Manually

Downloading Spark Locally

To download and run Spark locally, the first step is to make sure that you have Java installed on your machine, as well as a Python version if you would like to use Python. Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.

Building Spark from Source

You can also build and configure Spark from source. You can select a Spark source package from Github to get just the source and follow the instructions in the README file for building.

Installation/Configuration Steps

In case you choose to install Spark manually I suggest to use Vagrant, which will provide the isolated environment in your host OS and prevent the the host OS from getting corrupted.
The detailed steps are available on my GitHub Page.

Downloading the quick start VM or Distribution

You can download the quick start VM from Hortonworks or Cloudera. Once downloaded you need with VMware or Oracle Virtualbox to host the image. These VM’s are pre-configured, and you don’t have to do any additional installation or configuration.
For distributions you can choose Hortonworks, Cloudera, or MapR.

Running Spark in the Cloud

Databricks offers a free community edition of its cloud service as a learning environment. You need to Sign Up and follow the steps.

Testing the intallation

  • Type pyspark on console and you will be logged into the Spark shell for python
>>> print("Hello World!")
Hello World!
  • Type spark-shell on console and you will be logged into Spark shell for scala
scala> println("Hello World!")
Hello World!

Summary

You now have a Spark development environment up and running on your computer.

In the testing section you also briefly saw how we can create a Spark applications and run it using spark-submit. In the next article we expand on this process, building a simple but complete data cleaning application.

Advertisements

One thought on “Setting up Spark development environment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s