INTRODUCTION
How did we get here?
To really understand Spark, you have to really understand the problem of processing data which cant be fit into one single computer's memory.
Say for example - how do you sort a 10GB file of integers with a machine with only say 32 bit address space. In simple case - a 32bit machine is capped at 4gb process memory limit.
The solution to fit something larger than the limit is not a complicated one but rather cheeky: Just break the file into chunks which fit into memory, set up your processing into multiple stages which process these smaller chunks, and then a final stage which stitches up these intermediate chunks into result files.
So, for our specific example
- break the input file into 10 parts, and iterate over each of them to sort and write back to disk. - Stage 1
- Now - read the top records of each of the intermediate result files generated and write the smallest to the final results file. Repeat this until there are no numbers left in the 10 intermediate sorted files. - Stage 2
This works, but State-1 processing of the 10 chunks is all done serially, though each of iterations does not depend on the other. Now, if we could only add 9 more machines to process these chunks in parallel - we could cut down Stage-1 time by 1/10th. Great! But to automate this, we would have build some kind of a distributed process which runs on all these 10 machines, to perform the stages.
Well - enter Map Reduce & BigTable. When Google published its two, now famous, white papers on DFS and Map Reduce - it changed the way we look at data.
- MapReduce: Simplified data processing on Large Clusters
- Bigtable: A distributed storage system for structured data
Hadoop is a natural formalization of these ideas. Hadoop was great. Changed the world and all. Literally.
But the Information world quickly took Hadoop to its performance limits, with an evergrowing size of data. petabytes & exabytes started coming into reality of daily processing, instead of being limited to academic glossary. We needed something faster. And Spark quickly filled that space, as it ironed out a few of the design limitations of Hadoop. It is a OpenSource Cluster computing framework built originally for Mesos, but designed to integrate well with Hadoop as well. It also brought in unique value of having APIs in popular data languages (Python, R) but also promoted Scala natively.