Apache Hadoop 3.0.0-alpha2-SNAPSHOT incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).
This is an alpha release to facilitate testing and the collection of feedback from downstream application developers and users. There are no guarantees regarding API stability or quality.
Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.
All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.
Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.
Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.
More details are available in the HDFS Erasure Coding documentation.
We are introducing an early preview (alpha 1) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.
YARN Timeline Service v.2 alpha 1 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity. Most importantly, security is not enabled. Do not set up or use Timeline Service v.2 until security is implemented if security is a critical requirement.
More details are available in the YARN Timeline Service v.2 documentation.
The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.
Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.
More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.
MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.
See the release notes for MAPREDUCE-2841 for more detail.
The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.
However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.
The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.
Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.
These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.
Hadoop now supports integration with Microsoft Azure Data Lake as an alternative Hadoop-compatible filesystem.
A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.
This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guide for more information.
A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.
HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated. See the full release notes of HADOOP-10950 for more detail.
MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.
The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.