These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.
Change default namespace quota of root directory from Integer.MAX_VALUE to Long.MAX_VALUE.
Remove the deprecated FSDataOutputStream constructor, FSDataOutputStream.sync() and Syncable.sync().
Remove the deprecated DFSOutputStream.sync() method.
Documented that the “fs -getmerge” shell command may not work properly over non HDFS-filesystem implementations due to platform-varying file list ordering.
test-patch.sh adds a new option “–build-native”. When set to false native components are not built. When set to true native components are built. The default value is true.
This change affects wire-compatibility of the NameNode/DataNode heartbeat protocol.
Support for hftp and hsftp has been removed. They have superseded by webhdfs and swebhdfs.
Deprecated and unused classes in the org.apache.hadoop.record package have been removed from hadoop-streaming.
Appends in HDFS can no longer be disabled.
The classes in org.apache.hadoop.record are moved from hadoop-common to a new hadoop-streaming artifact within the hadoop-tools module.
fsck does not print out dots for progress reporting by default. To print out dots, you should specify ‘-showprogress’ option.
The Hadoop shell scripts have been rewritten to fix many long standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.
INCOMPATIBLE CHANGES:
NEW FEATURES:
BUG FIXES:
IMPROVEMENTS:
This changes the output of the ‘hadoop version’ command to generically say ‘Source code repository’ rather than specify which type of repo.
Fix a typo. If a configuration is set through program, the source of the configuration is set to ‘programmatically’ instead of ‘programatically’ now.
Adds a native implementation of the map output collector. The native library will build automatically with -Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred. nativetask.NativeMapOutputCollectorDelegator in their job configuration. For shuffle-intensive jobs this may provide speed-ups of 30% or more.
org.apache.hadoop.fs.permission.AccessControlException was deprecated in the last major release, and has been removed in favor of org.apache.hadoop.security.AccessControlException
HADOOP_HEAPSIZE variable has been deprecated (It will still be honored if set, but expect it to go away in the future). In its place, HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN have been introduced to set Xmx and Xms, respectively.
The internal variable JAVA_HEAP_MAX has been removed.
Default heap sizes have been removed. This will allow for the JVM to use auto-tuning based upon the memory size of the host. To re-enable the old default, configure HADOOP_HEAPSIZE_MAX=“1g” in hadoop-env.sh.
All global and daemon-specific heap size variables now support units. If the variable is only a number, the size is assumed to be in megabytes.
.hadooprc allows users a convenient way to set and/or override the shell level settings.
The memory values for mapreduce.map/reduce.memory.mb keys, if left to their default values of -1, will now be automatically inferred from the heap size value system property (-Xmx) specified for mapreduce.map/reduce.java.opts keys.
The converse is also done, i.e. if mapreduce.map/reduce.memory.mb values are specified, but no -Xmx is supplied for mapreduce.map/reduce.java.opts keys, then the -Xmx value will be derived from the former’s value.
If neither is specified, then a default value of 1024 MB gets used.
For both these conversions, a scaling factor specified by property mapreduce.job.heap.memory-mb.ratio is used, to account for overheads between heap usage vs. actual physical memory usage.
Existing configs or job code that already specify both the set of properties explicitly would not be affected by this inferring change.
The user ‘yarn’ is no longer allowed to run tasks for security reasons.
Old | New |
---|---|
HADOOP_HDFS_LOG_DIR | HADOOP_LOG_DIR |
HADOOP_HDFS_LOGFILE | HADOOP_LOGFILE |
HADOOP_HDFS_NICENESS | HADOOP_NICENESS |
HADOOP_HDFS_STOP_TIMEOUT | HADOOP_STOP_TIMEOUT |
HADOOP_HDFS_PID_DIR | HADOOP_PID_DIR |
HADOOP_HDFS_ROOT_LOGGER | HADOOP_ROOT_LOGGER |
HADOOP_HDFS_IDENT_STRING | HADOOP_IDENT_STRING |
HADOOP_MAPRED_LOG_DIR | HADOOP_LOG_DIR |
HADOOP_MAPRED_LOGFILE | HADOOP_LOGFILE |
HADOOP_MAPRED_NICENESS | HADOOP_NICENESS |
HADOOP_MAPRED_STOP_TIMEOUT | HADOOP_STOP_TIMEOUT |
HADOOP_MAPRED_PID_DIR | HADOOP_PID_DIR |
HADOOP_MAPRED_ROOT_LOGGER | HADOOP_ROOT_LOGGER |
HADOOP_MAPRED_IDENT_STRING | HADOOP_IDENT_STRING |
YARN_CONF_DIR | HADOOP_CONF_DIR |
YARN_LOG_DIR | HADOOP_LOG_DIR |
YARN_LOGFILE | HADOOP_LOGFILE |
YARN_NICENESS | HADOOP_NICENESS |
YARN_STOP_TIMEOUT | HADOOP_STOP_TIMEOUT |
YARN_PID_DIR | HADOOP_PID_DIR |
YARN_ROOT_LOGGER | HADOOP_ROOT_LOGGER |
YARN_IDENT_STRING | HADOOP_IDENT_STRING |
YARN_OPTS | HADOOP_OPTS |
YARN_SLAVES | HADOOP_SLAVES |
YARN_USER_CLASSPATH | HADOOP_CLASSPATH |
YARN_USER_CLASSPATH_FIRST | HADOOP_USER_CLASSPATH_FIRST |
KMS_CONFIG | HADOOP_CONF_DIR |
KMS_LOG | HADOOP_LOG_DIR |
The hadoop kerbname subcommand has been added to ease operational pain in determining the output of auth_to_local rules.
This deprecates the following environment variables:
Old | New |
---|---|
HTTPFS_LOG | HADOOP_LOG_DIR |
HTTPFS_CONFG | HADOOP_CONF_DIR |
Prior to this change, distcp had hard-coded values for memory usage. Now distcp will honor memory settings in a way compatible with the rest of MapReduce.
The output of du has now been made more Unix-like, with aligned output.
Remove “downgrade” from “namenode -rollingUpgrade” startup option since it may incorrectly finalize an ongoing rolling upgrade.
The output format of hadoop fs -du has been changed. It shows not only the file size but also the raw disk usage including the replication factor.
Jars in the various subproject lib directories are now de-duplicated against Hadoop common. Users who interact directly with those directories must be sure to pull in common’s dependencies as well.
Now “mapred job -list” command displays the Job Name as well.
WebHDFS is mandatory and cannot be disabled.
Stopping the namenode on secure systems now requires the user be authenticated.
Python is now required to build the documentation.
Now auto-downloads patch from issue-id; fixed race conditions; fixed bug affecting some patches.
io.native.lib.available was removed. Always use native libraries if they exist.
The patch improves the reporting around missing blocks and corrupted blocks.
A partitioner is now only created if there are multiple reducers.
Remove -finalize option from hdfs namenode command.
Users may need special attention for this change while upgrading to this version. Previously user could call some APIs(example: setReplication) wrongly even after closing the fs object. With this change DFS client will not allow any operations to call on closed fs objects. As calling fs operations on closed fs is not right thing to do, users need to correct the usage if any.
Removed DistCpV1 and Logalyzer.
The FSConstants class has been deprecated since 0.23 and it is removed in the release.
mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.
In algorithm version 1:
commitTask renames directory $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/
recoverTask renames $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/
commitJob merges every task output file in $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS
commitJob’s run time, number of RPC, is O(n) in terms of output files, which is discussed in MAPREDUCE-4815, and can take minutes.
Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.
commitTask renames all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/
recoverTask is a nop strictly speaking, but for upgrade from version 1 to version 2 case, it checks if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and renames them to $joboutput/
commitJob deletes $joboutput/_temporary and writes $joboutput/_SUCCESS
Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker.
Removed consumption of the MAX_APP_ATTEMPTS_ENV environment variable
“Permission denied” error message when unable to read local file for -put/copyFromLocal
Zookeeper jar removed from hadoop-client dependency tree.
Related to the decommission enhancements in HDFS-7411, this change removes the deprecated configuration key “dfs.namenode.decommission.nodes.per.interval” which has been subsumed by the configuration key “dfs.namenode.decommission.blocks.per.interval”.
This feature adds support for running additional standby NameNodes, which provides additional fault-tolerance. It is designed for a total of 3-5 NameNodes.
WARNING: No release note provided for this change.
This removes the deprecated DistributedFileSystem#getFileBlockStorageLocations API used for getting VolumeIds of block replicas. Applications interested in the volume of a replica can instead consult BlockLocation#getStorageIds to obtain equivalent information.
getSoftwareVersion method would replace original getVersion method, which returns the version string.
The new getVersion method would return both version string and revision string.
HDFS now provides native support for erasure coding (EC) to store data more efficiently. Each individual directory can be configured with an EC policy with command hdfs erasurecode -setPolicy. When a file is created, it will inherit the EC policy from its nearest ancestor directory to determine how its blocks are stored. Compared to 3-way replication, the default EC policy saves 50% of storage space while also tolerating more storage failures.
To support small files, the currently phase of HDFS-EC stores blocks in striped layout, where a logical file block is divided into small units (64KB by default) and distributed to a set of DataNodes. This enables parallel I/O but also decreases data locality. Therefore, the cluster environment and I/O workloads should be considered before configuring EC policies.
The output of the “hdfs fetchdt –print” command now includes the token renewer appended to the end of the existing token information. This change may be incompatible with tools that parse the output of the command.
In the extremely rare event that HADOOP_USER_IDENT and USER environment variables are not defined, we now fall back to use ‘hadoop’ as the identification string.
When Hadoop JVMs create other processes on OS X, it will always use posix_spawn.
The output of fsck command for being written hdfs files had been changed. When using fsck against being written hdfs files with {{-openforwrite}} and {{-files -blocks -locations}}, the fsck output will include the being written block for replication files or being written block group for erasure code files.
The preferred block size XML element has been corrected from “\<perferredBlockSize>” to “\<preferredBlockSize>”.
The following shell environment variables have been deprecated:
Old | New |
---|---|
DEFAULT_LIBEXEC_DIR | HADOOP_DEFAULT_LIBEXEC_DIR |
SLAVE_NAMES | HADOOP_SLAVE_NAMES |
TOOL_PATH | HADOOP_TOOLS_PATH |
In addition:
Snapshots can be allowed/disallowed on a directory via WebHdfs from users with superuser privilege.
The support of the deprecated dfs.umask key is removed in Hadoop 3.0.
SortedMapWritable has changed to SortedMapWritable<K extends WritableComparable<? super K>>. That way user can declare the class by such as SortedMapWritable<Text>.
Now TotalFiles metric is removed from FSNameSystem. Use FilesTotal instead.
If hadoop.token.files property is defined and configured to one or more comma-delimited delegation token files, Hadoop will use those token files to connect to the services as named in the token.
Default of ‘mapreduce.jobhistory.jhist.format’ property changed from ‘json’ to ‘binary’. Creates smaller, binary Avro .jhist files for faster JHS performance.
This change contains the content of HADOOP-10115 which is an incompatible change.
Makes the getFileChecksum API works with striped layout EC files. Checksum computation done by block level in the distributed fashion. The current API does not support to compare the checksum generated with normal file and the checksum generated for the same file but in striped layout.
DistCp in Hadoop 3.0 no longer supports -mapredSSLConf option. Use global ssl-client.xml configuration file for swebhdfs file systems instead.
On Unix platforms, HADOOP_PREFIX has been deprecated in favor of returning to HADOOP_HOME as in prior Apache Hadoop releases.
Removed FileUtil.copyMerge.
The default port for KMS service is now 9600. This is to avoid conflicts on the previous port 16000, which is also used by HMaster as the default port.
The patch updates the HDFS default HTTP/RPC ports to non-ephemeral ports. The changes are listed below: Namenode ports: 50470 –> 9871, 50070 –> 9870, 8020 –> 9820 Secondary NN ports: 50091 –> 9869, 50090 –> 9868 Datanode ports: 50020 –> 9867, 50010 –> 9866, 50475 –> 9865, 50075 –> 9864
This patch will attempt to allocate all replicas to remote DataNodes, by adding local DataNode to the excluded DataNodes. If no sufficient replicas can be obtained, it will fall back to default block placement policy, which writes one replica to local DataNode.
This feature introduces a new command called “hadoop dtutil” which lets users request and download delegation tokens with certain attributes.
With this change, the .hadooprc file is now processed after Apache Hadoop has been fully bootstrapped. This allows for usage of the Apache Hadoop Shell API. A new file, .hadoop-env, now provides the ability for end users to override hadoop-env.sh.
LocalJobRunnerMetrics and ShuffleClientMetrics were updated to use Hadoop Metrics V2 framework.
The output of “hdfs oev -p stats” has changed. The option prints 0 instead of null for the count of the operations that have never been executed.
Remove invisible synchronization primitives from DataInputBuffer
It is now possible to add or modify the behavior of existing subcommands in the hadoop, hdfs, mapred, and yarn scripts. See the Unix Shell Guide for more information.
Two new configuration have been added “dfs.namenode.lease-recheck-interval-ms” and “dfs.namenode.max-lock-hold-to-release-lease-ms” to fine tune the duty cycle with which the Namenode recovers old leases.
The hadoop-ant module in hadoop-tools has been removed.
Hadoop now supports integration with Azure Data Lake as an alternative Hadoop-compatible file system. Please refer to the Hadoop site documentation of Azure Data Lake for details on usage and configuration.
Adds support for Azure ActiveDirectory tokens using client ID and keys
Upgrading Jersey and its related libraries:
After upgrading Jersey from 1.12 to 1.13, the root element whose content is empty collection is changed from null to empty object({}).
The ‘slaves’ file has been deprecated in favor of the ‘workers’ file and, other than the deprecation warnings, all references to slavery have been removed from the source tree.
The default permissions of files and directories created via WebHDFS and HttpFS are now 644 and 755 respectively. See HDFS-10488 for related discussion.
The rcc command has been removed. See HADOOP-12485 where unused Hadoop Streaming classes were removed.
The s3 file system has been removed. The s3a file system should be used instead.
Upgrading following dependences: * Guice from 3.0 to 4.0 * cglib from 2.2 to 3.2.0 * asm from 3.2 to 5.0.4
This removes the configuration property {{dfs.client.use.legacy.blockreader}}, since the legacy remote block reader class has been removed from the codebase.