With the emergence and increasing popularity of DAG execution engines like Spark, one of the most common questions our clients want answers to is, “What is the future of MapReduce as an execution engine for Hadoop?” Many mainstream production applications are using MapReduce as their foundation, and companies have made huge investments in developing Mappers, Reducers, custom readers and other MapReduce components. So, they are rightly concerned to think they need to re-invent the wheel while trying not to lose out on the promise of real-time Big Data processing by Spark.
Relax. Hadoop is not going anywhere. Or maybe it is…
When our clients express concern about Hadoop’s longevity, we assure them there is no need to worry. Their biggest investment is in creating and maintaining the Hadoop cluster, which won’t be obsolete anytime soon. On the other hand, Jade Global may have to upgrade, patch, and add new components to it, but hey, that’s what we are here for! Newly popular execution engines (like Spark), visualization tools (Datameer) and interpreters are not meant to replace the core of Hadoop (which is HDFS). In fact, they are designed to work more efficiently with HDFS than MapReduce does. With so many open source projects, strong community, and support from giant companies, Hadoop still has a few more years to thrive.
Okay, I get it. But is MapReduce Dead?
There are two reasons MapReduce will not be obsolete in the short term, meaning a couple of years down the line. Huge investments have already been made by many companies developing MapReduce applications. Companies are not simply going to throw away all their hard work and start writing for other execution engines. If the applications work, they will keep them (most of them do work!). Another and much less talked about the reason why MapReduce will stay around is that there are still some use cases where MapReduce is the best tool available. It’s still hard to find another tool that beats MapReduce’s maturity and throughput for non-iterative batch processing. Not to mention it’s complex data structure and logical flow.
In the long run, Jade Global anticipates the use of MapReduce to be very minimal. Based on our experience working on MapReduce, Spark, and Impala, we think using MapReduce will continue to decline in favor of other frameworks and platforms. We anticipate mining engines like Impala to replace MapReduce-based engines like Pig and Hive. Also, projects like Sqoop, Flume, Pig, and Hive may get transformed to use Spark in the future instead of MapReduce under the hood. As mentioned earlier, Jade expects Spark to gain popularity against MapReduce for the majority of cases in the long run. Seeing that Spark is better at utilizing memory from cluster, it will continue to get better and better at the real-time performance. Below is a small recommendation table Jade Global has put together to help clients determine whether they should stay with MapReduce or not. This is very general purpose and high-level, so feel free to reach out to us if you want to discuss your own unique case!
Roadmap for Transition?
If you decide to phase-out MapReduce completely or partially from your cluster, the Roadmap to transition to other engines is not as hard as most people think. Jade has already executed a smooth transition for a handful of clients with zero downtime. The basic steps we follow during transition to Spark or Impala from MapReduce/ Pig/ Hive combo are:
- Inventory existing code, configurations and other components
- Identify re-usability of components. Most java libraries can be directly used in Spark without any changes. Also most Hive SQL queries/ scripts can be used on Impala with small changes.
- Iteratively replace MapReduce/ Pig/ Hive jobs after cycles of automated testing and bug fixes.
So, the moral of the story is to that you don’t need to panic! Jade has you covered. Just contact Jade if you are currently using or considering using MapReduce or planning to switch.