These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc.
What is Hadoop & HDFS? Hadoop based data hub architecture & basics | Hadoop eco system basics Q&As style.
List files in HDFS
The following Java code uses the Hadoop API to list files in HDFS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocatedFileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.RemoteIterator; /** * Similar to: hdfs dfs -ls hdfs://xx.xxx.xxx.xx:8020/user/someuser/test * */ public class HadoopSimple { public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://ip-address:8020/"); String hdfsPath = "/user/someuser/some_folder_path"; FileSystem fs = FileSystem.get(conf); Path path = new Path(hdfsPath); RemoteIterator<LocatedFileStatus> files = fs.listFiles(path, true); // true means recursive while (files.hasNext()) { LocatedFileStatus file = files.next(); System.out.println(file); } fs.close(); } } |
You can use the handle on “fs” to perform operations on a file like
1 2 3 4 5 6 | fs.rename(src, dst); //rename a file fs.copyFromLocalFile(src, dst); fs.delete(f, recursive); FSDataOutputStream os = fs.append(path); //apend to an existing file |
List files in a local Unix file system
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocatedFileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.RemoteIterator; public class HadoopSimple { public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String hdfsPath = "file:///home/user/path_to_a_folder"; FileSystem fs = FileSystem.get(conf); Path path = new Path(hdfsPath); RemoteIterator<LocatedFileStatus> files = fs.listFiles(path, true); // true means recursive while (files.hasNext()) { LocatedFileStatus file = files.next(); System.out.println(file); } fs.close(); } } |
Append contents to a file in HDFS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocatedFileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.RemoteIterator; public class HadoopSimple { public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://ip-address:8020/"); String hdfsPath = "/user/someuser/test/"; FileSystem fs = FileSystem.get(conf); Path path = new Path(hdfsPath); RemoteIterator<LocatedFileStatus> files = fs.listFiles(path, true); // true means recursive while (files.hasNext()) { LocatedFileStatus file = files.next(); System.out.println(file); if(file.isFile() && file.getPath().getName().equalsIgnoreCase("test.txt")) { FSDataOutputStream os = fs.append(file.getPath()); os.write("Some test\n".getBytes()); os.close(); } } fs.close(); } } |
The above Java code is equivalent to the following command-line.
1 2 3 | echo -e "some text" | hdfs dfs -appendToFile - hdfs://ip-address:8020//user/someuser/test/test.txt |
How do you find the name node URI?
Option 1: On the edge node Via /etc/hadoop/conf/core-site.xml.
1 2 3 4 5 6 7 | <property> <name>fs.defaultFS</name> <value>hdfs://<ip-address>:8020</value> </property> |
Option 2: If you are on Cloudera, go to Cloudera Manager, and click on “HDFS“, and then select NameNode to get its configuration details including the ip address.
Option 3: If you are on Cloudera, go to Cloudera Manager, and click on “HDFS“, and then click on the Actions drop down and click the “Download Client Configuration“, which will have all the config files including the core-site.xml as a zip file.
What libraries do you need in the classpath?
The Hadoop examples shown in this post must have the relevant JARs shown in the pom.xml file. The spark-core will transitively bring in the Hadoop libraries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mytutorial</groupId> <artifactId>simple-spark</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>simple-spark</name> <url>http://maven.apache.org</url> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <hadoop.version>2.7.2</hadoop.version> <scala.version>2.10.4</scala.version> <scala.binary.version>2.10</scala.binary.version> </properties> <repositories> <repository> <id>central</id> <name>Maven Central</name> <url>http://repo1.maven.org/maven2/</url> </repository> <repository> <id>cloudera</id> <url>http://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <!-- Spark libraries --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> </dependencies> </project> |