Fast computers and cheaper memory have stimulated the rapid growth of a new way of doing data computation. During this time, parallel computation infrastructures have evolved from experimental in a lab to become everyday tools of data scientists who need to analyze and get insights from data. However, barriers to the widespread use of parallelism are still at least one of three common large subdivision of computing – hardware, algorithms and software.
Imagine old days when we should deal with all of those three at same time, from high speed intercommunication network switches, parallelizing sequential algorithms and various software stacks from compilers, libraries, frameworks and middleware. Many parallelism models have been introduced for decades, like data partitioning in old day FORTRAN or other SIMD machines, shared memory parallelisms and message passing (remember C/C++ MS-MPI cluster in old days). I am part of generation who faced the “Dark Age” period of distributed numerical computing. Now is much better, I hope.
Apache Spark – is fast and general-purpose cluster computing system. Spark promises to make our life easier in writing distributed programs like other normal programs by abstracting away the “nitty-gritty” details of distributed systems – like my previous experiences with message passing (MPI). I know it is too early to make prediction on the success of Spark, but I’m biased with my previous distributed system experiences and liked to continue that in this post ☺.
We all need speed in data computation. Imagine if your forecasting analytic on large business datasets requires a day to complete while your business people expected it to produce results in hours or minutes – nowcasting vs forecasting. On the speed side, Spark extended the MapReduce model to supports more types of computations like batch, iterative/recursive algorithms, interactive queries and micro-batch streaming processing. Spark makes it easy and inexpensive (as price of CPU and GPU become cheaper) to run those processing types and reduces the burdens of maintaining infrastructure, tools and frameworks. Spark is designed to be friendly for developers, offered language bindings to Python, Java, Scala, R (via SparkR) and SQL (SparkSQL), and of course shipped with ready to use libraries such as GraphX and MLLib. A growing supports from Deep Machine Learning practitioners are also happening, like H2O Sparkling Water, DL4J and Prediction.IO. It also integrated closely with other big data tools, like Hadoop, YARN, HBase, Cassandra, Mesos, Hive, etc. Spark ecosystem is growing very fast.
Spark started in 2009 as a research project in UC Berkeley RAD Lab (AMPLab). The researchers in AMPLab that previously work with Hadoop MapReduce found that MapReduce was inefficient for iterative and interactive computing jobs. You can refer to some research papers for better scientific proofs, or following a thriving OSS developer community around Spark, including famous startups like DataBricks.
Let me share my hacking experiences on Apache Spark. Spark is written in Scala and requires JVM to run. If you want to work with Python later, you may need to install Python package like Anaconda that combine all frameworks you need for scientific computing – include the famous Jupiter Notebook. I started by downloading Spark binary then later source codes to build on my Mac machine. A straightforward maven based compilation took sometime (~24 minutes) till I can run spark shell. But I was impatient; so during the compilation I just downloaded and used the binary version (now version 1.4.0) to test some commands. The good fact was I can use Spark without Hadoop, even in my single Mac machine to practice its basic principles. When Spark was ready in my machine, I just followed the README.md file (good habits of a geek) to test it, for example in Scala shell:
scala> sc.parallelize(1 to 1000).count()
or in Python shell:
Spark comes with several sample programs in the `examples` directory. To run one of them, I used `./bin/run-example <class> [params]`. Here for example for SparkPi:
First thing I learnt about Spark was to make custom driver program that launches various parallel operations on my single machine Spark instance. The driver program (we can write in Python, Scala, Java, and R), contains main function and defines distributed datasets on the cluster, then applies data transformation actions to them. Spark shells are obvious examples of driver programs that access Spark through a SparkContext object, which represents a connection to a Spark’s computing cluster. In the any shell, a SparkContext is predefined for us as sc object, like in above examples. Spark default distribution (now version 1.4.0) provides spark-shell (for Scala) and pyspark (for Python) for interactive computing with sc object.
Second thing I learnt was about Spark’s main abstractions for working with distributed data, the RDD (Resilient Distributed Dataset), a distributed immutable collection of objects. In a clustered environment, each RDD is split into multiple partitions, which may be computed on different nodes. Programming in Spark is expressed as either creating new RDDs from data sources, transforming existing RDDs, or perform actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across cluster and parallelizes the operations we want to perform on them.
RDDs can contain any type of Python, Java, Scala, or R (through SparkR) objects, including user- defined classes. Users can create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program. Once created, RDDs offer two types of operations: transformations and actions. Transformations construct a new RDD from a previous one. Spark context object directly provides a lot of functions to perform RDD transformations. Actions, on the other side, compute a result based on an RDD, and either returns it to the driver program or save it to external storage systems like HDFS, HBase, Cassandra, ElasticSearch etc.
For example – Python filtering of README.md file:
>>> lines = sc.textFile(“README.md”)
>>> pythonLines = lines.filter(lambda line: “Python” in line)
And Scala filtering version for the same file:
scala> val lines = sc.textFile(“README.md”)
scala> val pythonLines = lines.filter(line => line.contains(“Python”))
Finally, third thing I learnt was about a lazy fashion of Spark execution. Although we can define new RDDs any time, Spark computes them only in a lazy fashion—that is, the first time they are used in an action. Spark’s RDDs are by default recomputed each time we run an action. To reuse an RDD in multiple actions, we can ask Spark to persist data in a number of different places using RDD.persist(). After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible. The behavior of not persisting by default may again seem unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD, there’s no reason to waste storage space when Spark could instead stream through the data once and just compute the result. In real practice, we will often use persist() to load a subset of data into memory and query it repeatedly.
Example of persisting previous RDD in memory:
As this is just a quick intro to Spark, lot more to hack if you are curious. To learn more, read the official Spark programming guide. If you prefer MOOC style, I recommend eDX- BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark from DataBricks. Books can also help your learning curve, you can try these :
- Learning Spark: Lightning-Fast Big Data Analysis
- Advanced Analytics with Spark: Patterns for Learning from Data at Scale
- Machine Learning with Spark
Lastly, hacking specific computation problems is always better way to learn. Good luck with your Spark hacking!