Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Using Python with Spark

Now, people who are already familiar with Spark might say why are you using Python in this book? Well, I'm using Python as the scripting language that I'm going to be working with, and there are some pretty good reasons, so I want to defend my choice here a little bit. For one thing, it's a lot easier to get up and running with Python under Spark. You don't have to deal with dependencies, running Maven, and figuring out how to get JAR files, where they need to be, and what not. With Python, if you just type your code, you don't have to compile it, it just runs and that makes life a lot simpler.

In this book, I really want to focus on the concepts behind Spark, how to deal with RDDs, what you can do with them, and how you can put these different things together; I don't really want to be focusing on things such as the mechanics of compiling and distributing JAR files and dealing with the Java and Scala stuff. Also, there's probably a better chance you are already familiar with Python as opposed to Scala, which is a newer language, and like I said, Java is just a little bit more complicated.

However, I will say that Scala is the more popular choice with Spark and that's because Spark is written in Scala. So if you write Scala code, it's native to Spark, it doesn't have to go through any sort of interpreter to convert your Python code and get it running in a Scala environment. I've never really run into problems with Python running on Spark, but you know, in theory, Scala should be a little bit faster and a little bit more reliable. The other thing to consider is that new features and libraries tend to come out in Scala before they come out in other languages in Spark. A good case in point is the GraphX library that we just talked about; today, as of writing this book, it is only available in Scala, but they're pretty far along in introducing support in Python and Java for graphics right now, and by the time you're reading this book, there's a good chance it will be available already. Similarly, with Spark Streaming, as of now, it is only partially implemented in Python, but again it's moving forward quickly and I believe that they will have feature parity among the three languages pretty quickly, so that shouldn't be too much of a concern going forward.

Another reason that Python isn't that bad of a choice for this book is that Python code in Spark looks a lot like Scala code in Spark. Even though they're very different languages, within Spark they look very much the same. They both use a lot of functional programming paradigms, as you can see here, we're using lambda functions in Python instead of the syntax in Scala for doing the same thing. Besides, having to qualify some of our variables as being vals in Scala, where they're just untyped in Python, the code looks very similar in most cases. In fact, the two bits of code shown here do exactly the same thing and you see that they look extremely similar to each other:

So even if you learn Spark using Python, it's not going to be hard to transfer that knowledge to Scala, if you need to later.

So that's Spark at a very high level. What's the big deal about it, why is it so fast, and why are people using it? It's pretty cool stuff and it's very easy to use. So let's come down from the high level and get a little bit more technical. We're going to take a closer look at how Spark works under the hood and what the RDD object is.