Big Data Analytics Experiment 1: Understanding and using basic HDFS commands You are tasked with managing files and directories within the Hadoop Distributed File System (HDFS) as part of a practical exercise. Objectives: 1. File Creation and HDFS Directory Management: o Create three files named subject1, subject2, and subject3 on your local system. o Build a directory structure /college/cse and /college/ece in HDFS. o Copy the files subject1, subject2, and subject3 from the local file system to the /college/cse directory in HDFS. 2. File Operations in HDFS: o Copy the files subject1 and subject2 from /college/cse to /college/ece. o Move the file subject3 from /college/cse to /college/ece. 3. Verification: o Verify the directory structure using HDFS commands and the HDFS web interface. 4. Cleanup Tasks: o Remove the file subject2 from the /college/cse directory. o Remove the /college/ece directory, along with its contents. o Delete the entire /college directory from HDFS. o Verify each step through HDFS commands and the web interface. Requirements: Document all HDFS commands used during the process and verify the results after each step using appropriate HDFS commands. Solution HDFS Introduction The Hadoop Distributed File System (HDFS) is a core component of the Hadoop framework, designed to store and process large-scale data across clusters of machines efficiently. HDFS is a highly scalable, fault-tolerant, and distributed storage system that works seamlessly with Hadoop’s processing frameworks, such as MapReduce and Spark. Key Features of HDFS 1. Distributed Storage: o HDFS distributes data across multiple nodes in a cluster, enabling efficient storage and parallel processing. 2. Fault Tolerance: o HDFS stores multiple replicas of data blocks across different nodes, ensuring data availability even in the case of node failures. 3. High Throughput: o It is optimized for high throughput rather than low latency, making it ideal for handling large datasets. 4. Write Once, Read Many: o Data is generally written once and read multiple times, simplifying data consistency management. 5. Scalability: o The system can handle petabytes of data, and new nodes can be easily added to scale storage and processing capacity. 6. Compatibility with Commodity Hardware: o HDFS runs on low-cost commodity hardware, reducing infrastructure costs. Architecture of HDFS 1. NameNode (Master Node): o Responsible for managing metadata (directory structure, file locations, and block replication). o It does not store actual data but keeps track of where data blocks are stored on DataNodes. 2. DataNodes (Worker Nodes): o Store actual data blocks and serve read/write requests from clients. o Send periodic heartbeats and block reports to the NameNode. 3. Secondary NameNode: o Assists the NameNode by periodically saving snapshots of metadata. o It is not a backup NameNode. 4. Blocks: o Files in HDFS are broken into blocks (default size: 128 MB or 256 MB). o Each block is replicated (default: 3 times) across different DataNodes. Advantages of Using HDFS ● Handles massive datasets effortlessly. ● Cost-effective, running on commodity hardware. ● Provides high data reliability through replication. ● Easily integrates with Hadoop’s ecosystem for seamless big data processing. Overview of HDFS Commands The Hadoop Distributed File System (HDFS) provides a set of commands for performing file and directory operations, similar to a traditional file system but tailored for distributed environments. These commands allow users to interact with HDFS for uploading files, managing directories, and verifying storage. Commonly Used HDFS Commands 1. File and Directory Operations Command Description Example hdfs dfs -mkdir Creates a new directory in HDFS. hdfs dfs -mkdir /data hdfs dfs -ls Lists the contents of a directory in HDFS. hdfs dfs -ls /data hdfs dfs -put Uploads a file from the local file system to HDFS. hdfs dfs -put localfile.txt /data Command Description Example hdfs dfs -get Downloads a file from HDFS to the local file system. hdfs dfs -get /data/file.txt ./ hdfs dfs -cp Copies a file or directory within HDFS. hdfs dfs -cp /data/file.txt /backup/ hdfs dfs -mv Moves a file or directory within HDFS. hdfs dfs -mv /data/file.txt /archive/ hdfs dfs -rm Deletes a file from HDFS. hdfs dfs -rm /data/file.txt hdfs dfs -rm -r Recursively deletes a directory and its contents from HDFS. hdfs dfs -rm -r /data 2. Viewing File Contents Command Description Example hdfs dfs -cat Displays the contents of a file stored in HDFS. hdfs dfs -cat /data/file.txt hdfs dfs -tail Displays the last few lines of a file stored in HDFS. hdfs dfs -tail /data/file.txt 3. Disk Usage and Quota Information Command Description Example hdfs dfs -du Displays disk usage of files or directories. hdfs dfs -du /data hdfs dfs -df Displays the available space and total size of HDFS. hdfs dfs -df -h 4. Permissions and Ownership Command Description Example hdfs dfs -chmod Changes the permissions of files or directories in HDFS. hdfs dfs -chmod 755 /data hdfs dfs -chown Changes the ownership of a file or directory. hdfs dfs -chown user1 /data hdfs dfs -chgrp Changes the group ownership of a file or directory. hdfs dfs -chgrp group1 /data 5. Administrative Commands Command Description Example hdfs dfsadmin -report Displays a detailed report of the HDFS cluster status. hdfs dfsadmin -report hdfs dfsadmin -safemode Enables or disables HDFS safemode for maintenance. hdfs dfsadmin -safemode get 1. Create Local Files mkdir test cd test nano subject1 # Create and edit file content nano subject2 nano subject3 2. Set Up Directory Structure in HDFS hdfs dfs -mkdir /college hdfs dfs -mkdir /college/cse hdfs dfs -mkdir /college/ece 3. Copy Local Files to HDFS hdfs dfs -put subject1 /college/cse hdfs dfs -put subject2 /college/cse hdfs dfs -put subject3 /college/cse 4. Copy Files Between HDFS Directories hdfs dfs -cp /college/cse/subject1 /college/ece hdfs dfs -cp /college/cse/subject2 /college/ece 5. Move a File CopyEdit hdfs dfs -mv /college/cse/subject3 /college/ece 6. Verify Directory Structure ● Check /college/cse directory contents: hdfs dfs -ls /college/cse ● Check /college/ece directory contents: hdfs dfs -ls /college/ece 7. Remove Specific Files and Directories 1. Remove subject2 from /college/cse: hdfs dfs -rm /college/cse/subject2 2. Remove /college/ece Directory: hdfs dfs -rm -r /college/ece 3. Remove /college Directory: hdfs dfs -rm -r /college 8. Final Verification ● Confirm the structure of HDFS root directory: hdfs dfs -ls / Verification Using Web Interface 1. Log into the HDFS web UI, typically accessible at http://192.168.5.100:9870. 2. Check directory structures after every operation: o Navigate to /college/cse and /college/ece to verify changes. 3. Ensure the /college directory and its contents are removed after cleanup. This practical exercise familiarizes you with essential HDFS commands and directory management techniques. Experiment – 2 : Word count application using Mapper Reducer on single node cluster Word Count Program in Hadoop Using MapReduce This Java program implements a word count operation in Hadoop using MapReduce. It consists of three main classes: 1. Mapper Class : Reads input text line by line, splits it into words, and maps each word to the value 1. 2. Reducer Class : Aggregates the counts for each word and generates the final output. 3. Driver Class : Configures and executes the MapReduce job. Code Explanation 1. WordCountMapper ● Purpose : Reads lines of text and splits them into individual words. ● Functionality : o For each word in the input, emits the word as the key and 1 as the value. import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String words[] = value.toString().split(" "); // Split the input line into words for (int i = 0; i < words.length; i++) { context.write(new Text(words[i]), new LongWritable(1)); // Emit word and count } } } 2. WordCountReducer ● Purpose : Aggregates the counts of each word emitted by the Mapper. ● Functionality : o Sums up all the counts for a given word (key) and outputs the total count. import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : values) { sum += val.get(); // Sum the counts for each word } context.write(key, new LongWritable(sum)); // Emit word and total count } } 3. WordCountDriver ● Purpose : Sets up the configuration and executes the MapReduce job. ● Functionality : o Specifies input and output paths. o Configures Mapper and Reducer classes. o Runs the job and waits for completion. import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountDriver { public static void main(String s[]) throws IOException, ClassNotFoundException, InterruptedException { if (s.length != 2) { System.err.println("2 arguments required: <input path> <output path>"); System.exit(-1); } @SuppressWarnings("deprecation" ) Job job = new Job(); job.setJarByClass(WordCountDriver.class) ; job.setJobName("WordCount"); FileInputFormat.addInputPath(job, new Path(s[0])); // Input path FileOutputFormat.setOutputPath(job, new Path(s[1])); // Output path job.setMapperClass(WordCountMapper.class); // job.setCombinerClass(TransposeCombiner.class); // Optional combiner job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class ); System.exit(job.waitForCompletion(true) ? 0 : 1); // Submit job and wait for completion } } Steps to Execute the Program 1. Compile the Code : o Use the following command to compile the Java program: javac -cp `hadoop classpath` -d . WordCountMapper.java WordCountReducer.java WordCountDriver.java 2. Create a JAR File : o Package the compiled classes into a JAR file: jar -cvf WordCount.jar -C . . 3. Input File : o Create a text file containing the input data (e.g., input.txt). 4. Run the Job : o Submit the JAR file to Hadoop with the input and output paths: hadoop jar WordCount.jar WordCountDriver /path/to/input /path/to/output 5. View Results : o After the job completes, view the output using: hdfs dfs -cat /path/to/output/part-r-00000 Complete Workflow of the Word Count Program with Key-Value Pairs The Word Count Program is a simple implementation of the MapReduce framework, which involves three main phases: Mapper , Shuffle and Sort , and Reducer . Below is a detailed explanation of the workflow, including the key-value pairs generated at each step. 1. Input Data The input file contains text data, split into lines. For example: kotlin CopyEdi t Input File (input.txt): Hadoop is fun Hadoop is scalable Input Key-Value Pair (to Mapper) ● Key : Byte offset of the line in the input file. ● Value : The line of text at the byte offset. Key (Byte Offset) Value (Line of Text) 0 Hadoop is fun 18 Hadoop is scalable 2. Mapper Phase The Mapper processes each line of text and splits it into words. For each word, the Mapper emits a key-value pair: ● Key : The word itself. ● Value : 1 (indicating a single occurrence of the word). Mapper Input Key Value 0 Hadoop is fun 18 Hadoop is scalable Mapper Output Key-Value Pairs Key (Word) Value (Count) Hadoop 1 is 1 Key (Word) Value (Count) fun 1 Hadoop 1 is 1 scalable 1 3. Shuffle and Sort Phase This phase is managed by the Hadoop framework, and it performs the following: 1. Group : Groups all identical keys from the Mapper output. 2. Sort : Sorts the keys in lexicographical order. 3. Prepare Input for Reducer : Passes grouped values as an iterable list to the Reducer. Shuffle and Sort Output Key (Word) Values (List of Counts) fun [1] Hadoop [1, 1] is [1, 1] scalable [1] 4. Reducer Phase The Reducer aggregates the counts for each word. It sums up all values in the iterable list to calculate the total occurrence of each word. Reducer Input Key (Word) Values (List of Counts) fun [1] Hadoop [1, 1] is [1, 1] scalable [1] Reducer Output Key-Value Pairs Key (Word) Value (Total Count) fun 1 Hadoop 2 is 2 scalable 1 5. Final Output The Reducer writes the output as a file in the specified output directory in HDFS. The output is stored in key-value format. Output File (part-r-00000) fun 1 Hadoop 2 is 2 scalable 1 Workflow Summary with Key-Value Pairs Phase Input Key-Value Pair Output Key-Value Pair Mapper (ByteOffset, LineOfText) (Word, 1) Phase Input Key-Value Pair Output Key-Value Pair Shuffle and Sort (Word, 1) (Word, [List of Counts]) Reducer (Word, [List of Counts]) (Word, TotalCount) Final Output N/A (Reducer handles final output writing) (Word, TotalCount) in output file Detailed Example for Input Data Input File: kotlin CopyEdit Hadoop is fun Hadoop is scalable End-to-End Workflow with Key-Value Pairs: 1. Mapper Phase : vbnet CopyEdit Input: (0, "Hadoop is fun") (18, "Hadoop is scalable") Output: ("Hadoop", 1) ("is", 1) ("fun", 1) ("Hadoop", 1) ("is", 1) ("scalable", 1) 2. Shuffle and Sort Phase : mathematica CopyEdit Grouped Output: Key: "fun" -> [1] Key: "Hadoop" -> [1, 1] Key: "is" -> [1, 1] Key: "scalable" -> [1] 3. Reducer Phase : rust CopyEdi t Input: ("fun", [1]) -> Output: ("fun", 1) ("Hadoop", [1,1]) -> Output: ("Hadoop", 2) ("is", [1,1]) -> Output: ("is", 2) ("scalable", [1]) -> Output: ("scalable", 1) 4. Final Output : kotlin CopyEdit Output File: fun 1 Hadoop 2 is 2 scalable 1 This workflow demonstrates how the MapReduce model transforms and processes data using key- value pairs at every stage. Experiment – 3: Working with files in Hadoop file system: Reading, Writing and Copying Write a Java program that interacts with the Hadoop Distributed File System (HDFS) and performs the following operations: 1. Write a text file with some sample content to HDFS. 2. Read the content of the file from HDFS and print it to the console. 3. Copy the file within HDFS to a new location. 4. Delete both files from HDFS after completing the operations. Java Program: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class HDFSSimpleExample { public static void main(String[] args) throws Exception { // Initialize Hadoop configuration and FileSystem Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); // Define paths for write and copy operations Path writePath = new Path("/user/hadoop/hdfsfile.txt"); Path copyPath = new Path("/user/hadoop/hdfsfile_copy.txt"); // Write to HDFS fs.create(writePath).write("Hello, HDFS!".getBytes()); // Read from HDFS byte[] data = new byte[(int) fs.getFileStatus(writePath).getLen()]; fs.open(writePath).readFully(0, data); System.out.println(new String(data)); // Copy within HDFS fs.copyToLocalFile(writePath, copyPath); // Clean up (delete the files) fs.delete(writePath, true); fs.delete(copyPath, true); // Close FileSystem fs.close(); } } Explanation for Each Command in the Program: 1. Hadoop Configuration Setup : Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); o Configuration conf: Creates a Hadoop configuration object, which is essential for connecting to Hadoop's resources. o FileSystem fs: Obtains a FileSystem object based on the configuration. This is used for all interactions with HDFS (reading, writing, deleting files, etc.). 2. Defining File Paths : Path writePath = new Path("/user/hadoop/hdfsfile.txt"); Path copyPath = new Path("/user/hadoop/hdfsfile_copy.txt"); o Path writePath: Specifies the path in HDFS where the file will be written (/user/hadoop/hdfsfile.txt). o Path copyPath: Defines the destination path for the copied file in HDFS (/user/hadoop/hdfsfile_copy.txt). 3. Writing Data to HDFS : fs.create(writePath).write("Hello, HDFS!".getBytes()); o fs.create(writePath): Creates a new file at the specified writePath in HDFS. o .write("Hello, HDFS!".getBytes()): Writes the text Hello, HDFS! to the file. The text is converted to bytes because HDFS works with byte data. o This command effectively writes the string into HDFS at the specified location. 4. Reading Data from HDFS : byte[] data = new byte[(int) fs.getFileStatus(writePath).getLen()]; fs.open(writePath).readFully(0, data); System.out.println(new String(data)); o fs.getFileStatus(writePath).getLen(): Retrieves the length of the file in bytes at writePath. o byte[] data = new byte[...]: Allocates an array large enough to store the file content based on its length. o fs.open(writePath).readFully(0, data): Opens the file from HDFS and reads its content into the data array starting at byte offset 0. o System.out.println(new String(data)): Converts the byte array data back into a string and prints it, showing the content (Hello, HDFS!). 5. Copying File within HDFS : fs.copyToLocalFile(writePath, copyPath); o fs.copyToLocalFile(writePath, copyPath): Copies the file from writePath to copyPath within HDFS. It essentially duplicates the file from one path to another. 6. Cleaning Up (Deleting Files from HDFS) : fs.delete(writePath, true); fs.delete(copyPath, true); o fs.delete(writePath, true): Deletes the file located at writePath. The second argument (true) ensures that the deletion is recursive (used if a directory is being deleted, though in this case the file is a simple file). o fs.delete(copyPath, true): Deletes the copied file from copyPath. 7. Closing the FileSystem : fs.close(); o fs.close(): Closes the FileSystem object to release resources and connections to the HDFS. Summary of Operations: 1. Write Operation : The program writes Hello, HDFS! into HDFS. 2. Read Operation : Reads the file content back from HDFS and prints it. 3. Copy Operation : Copies the file within HDFS from one location to another. 4. Cleanup : Deletes the files after operations to clean up the HDFS environment. Experiment – 4: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig Given the emp4.txt file containing employee data in the following format: CopyEdit 101,John,50000,1 102,Alice,60000,2 103,Bob,55000,1 104,Charlie,70000, 3 105,David,48000,2 Perform the following tasks using Pig: 1. Load the data into Pig using the PigStorage function. 2. Display the entire content of the data. 3. Group the data and calculate the average salary of all employees. 4. Dump the result showing the average salary to the console. Solution: Steps and Pig Commands: 1. Create the File : o Create a file named emp4.txt and add the employee details. o Move the file to HDFS. $ nano emp4.txt 101,John,50000, 1 102,Alice,60000,2 103,Bob,55000,1 104,Charlie,70000, 3 105,David,48000,2 $ hdfs dfs -put emp4.txt 2. Start Pig in Grunt Shell : pig 3. Load the Data into Pig : a = load 'emp4.txt' using PigStorage(',') as (eno: int, ename: chararray, sal: int, did: int); 4. Display the Data : dump a; Output : (101,John,50000,1) (102,Alice,60000,2) (103,Bob,55000,1) (104,Charlie,70000,3 ) (105,David,48000,2) (,,,) 5. Group the Data for Aggregation : grouped_data = group a all; 6. Calculate the Average Salary : avg_salary = foreach grouped_data generate AVG(a.sal); 7. Display the Average Salary : dump avg_salary; Output : (56600.0) Explanation of Key Steps and Commands: 1. hdfs dfs -put emp4.txt : o This command copies the local file emp4.txt into HDFS so it can be processed by Pig. 2. load command : o The load command reads the data from the HDFS file into Pig for processing. o PigStorage(',') specifies the delimiter (comma in this case). o The schema defines the structure: eno (employee number), ename (name), sal (salary), and did (department ID). 3. dump a : o Displays the content of relation a to verify the data was loaded correctly. 4. group a all : o Groups all records together for the aggregation (average salary calculation). 5. AVG(a.sal) : o Computes the average of the sal (salary) field across all employees. 6. dump avg_salary : o Displays the computed average salary. Experiment 5: Retrieving user login credentials from /etc/passwd using Pig Latin You are given a file named passwords.txt with records in the following format: username,password,group Write a Pig Latin script to extract the username and password fields and display the results. Solution: Step-by-Step Instructions: 1. Create the File : Create and populate the passwords.txt file as follows: nano passwords.txt Content of the file: Rama,R12345,root Krishna,K53412,superuse r Suresh,T5342s3,root Laxmi, L43t56*,superuser 2. Copy the File to HDFS : Place the file in the HDFS under /user/hduser using the following command: hdfs dfs -put passwords.txt /user/hduser/passwords.txt 3. Start the Pig Grunt Shell : Launch the Pig shell: pig 4. Write and Execute the Pig Script : In the Pig Grunt shell, execute the following commands: pig -- Load the file from HDFS a = load '/user/hduser/passwords.txt' using PigStorage(',') as (uname: chararray, password: chararray, grp: chararray); -- Extract username and password fields credentials = foreach a generate uname, password; -- Display the extracted data dump credentials; 5. Output : The output will display the extracted username and password pairs as follows: (Rama,R12345) (Krishna,K53412 ) (Suresh,T5342s3 ) (Laxmi, L43t56*) Explanation of Pig Script: 1. Loading the Data : a = load '/user/hduser/passwords.txt' using PigStorage(',') as (uname: chararray, password: chararray, grp: chararray); o The LOAD command loads the data from the file stored in HDFS. o PigStorage(','): Specifies the comma as the delimiter between fields. o The AS clause defines the schema: ▪ uname: Username (data type: chararray). ▪ password: Password (data type: chararray). ▪ grp: Group (data type: chararray). 2. Extracting Specific Fields : credentials = foreach a generate uname, password; o This command iterates over each record in a, and creates a new relation credentials containing only the uname and password fields. 3. Dumping the Output : dump credentials; o The DUMP command outputs the extracted data to the console. Experiment 6: Working with HiveQL. Task: Manage Employee Data Using Hive with Advanced Queries Problem Statement: Given a dataset test1.txt containing employee details in the following format: eno,ename,salary,dno Perform the following tasks using Hive: 1. Create a database named my_database. 2. Create a table named emp1 with fields eno, ename, salary, and dno. 3. Load the test1.txt file into the emp1 table. 4. Perform basic and advanced queries on the emp1 table as listed below: o Fetch all data from the emp1 table. o Fetch details of employees belonging to department 1. o Fetch employees with salaries greater than 75,000. o Calculate the average salary of employees in each department. o Find the highest salary in the table. o Fetch details of employees earning the highest salary. o Count the total number of employees in each department. o List employees whose names start with the letter s. Solution Steps: 1. Prepare the Input File: 1. Create and Populate test1.txt : 2. nano test1.txt File content: 1,john,70000,1 2,hari,70848,2 3,ramu,69489,2 4,teju,97582,1 5,shiva,79050,2 6,giri,58170,1 7,raju,89792,2 8,bindu,80802,1 9,sindhu,70179, 2 10,vicky,87982,1 3. Upload the File to HDFS : 4. hdfs dfs -put test1.txt /user/student/test/test1.txt 2. Create and Load Hive Table: 1. Launch Hive Shell : hive 2. Create a Database : CREATE DATABASE my_database; 3. Use the Database : USE my_database; 4. Create the Table : CREATE TABLE emp1 ( eno INT, ename STRING, salary INT, dno INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/student/test/test1.txt'; 5. Verify the Data : Query to view all records: SELECT * FROM emp1; Output : OK