Sunday, July 19, 2015

Most commonly used HDFS Commands


1. List all the files and directories under root hdfs directory
hdfs dfs -ls /

To list all the files and directories recursively, use lsr command as below.
hdfs dfs -lsr /

2. Copy a file or directory to another directory in hdfs
hdfs dfs -cp /hdfs/src/dir /hdfs/dest/dir/
hdfs dfs -cp /hdfs/src/dir/file1 /hdfs/dest/dir/

3. Move or rename a file or directory to another directory in hdfs

hdfs dfs -mv /hdfs/src/dir /hdfs/dest/dir
hdfs dfs -mv /hdfs/current/dir/file1 /hdfs/current/dir/file2

4. Create a directory in hdfs

hdfs dfs -mkdir /hdfs/new/dir/path

If parent directory not present, -p option can be user to create all the directories at one go.
hdfs dfs -mkdir -p /hdfs/new/dir/path

5. Read a file in hdfs

hdfs dfs -cat /hdfs/dir/file1.dat

If the file is snappy compressed, use text command instead to read the same.
hdfs dfs -text /hdfs/dir/file1.dat.snappy

6. Copy a file from local file system to hdfs file system

hdfs dfs -copyFromLocal /local/dir/path/file1.txt /hdfs/dir/path/file.txt

put command also does the same.
hdfs dfs -put /local/dir/path/file1.txt /hdfs/dir/path/file.txt

7. Copy a file from hdfs file system to local file system

hdfs dfs -copyToLocal /hdfs/dir/path/file.txt /local/dir/path/file1.txt

get command also does the same.
hdfs dfs -get /hdfs/dir/path/file.txt /local/dir/path/file1.txt

8. Delete a file or directory from hdfs

Use rm command to delete a file
hdfs dfs -rm /hdfs/dir/path/file.txt

Use rmr command to delete a directory and it's contents
hdfs dfs -rmr /hdfs/dir/path/

9. Create a zero byte file in hdfs

hdfs dfs -touchz /hdfs/dir/path/file.txt

10. Verify a directory or file using test command in hdfs

hadoop fs -test -[defsz] URI

Options:
-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.

Example:
hadoop fs -test -e /hdfs/dir/path

Top 10 Hive Developer interview questions


1) What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). It is a data warehouse framework to query and analyse the data that is stored in HDFS. Hive is an open-source-software that lets programmers analyze large data sets on Hadoop.

2) What are the Key components in Hive Architecture?

  • Command Line Interface (cli)
  • Hive Web Interface (hwi)
  • HiveServer (hiveserver)
  • Metastore
  • Driver
  • Execution Engine

3) What is a Hive Metastore?

Hive Metastore is a central repository in Hive.  It is used for storing schema information or metadata in the external database.

4) Mention what are the different modes of Hive?

Different modes of Hive depends on the size of data nodes in Hadoop.

These modes are,

  • Local mode
  • Map reduce mode

5) What is the use of Hcatalog?

Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.

6) What are the differences between Hive and HBase?

  • Hive enables most of the SQL queries, but HBase does not allow SQL queries
  • Hive does not support record level insert, update, and delete operations on table
  • Hive is a data warehouse framework whereas HBase is NoSQL database
  • Hive run on the top of MapReduce, HBase runs on the top of HDFS

7) Where is table data stored in Apache Hive by default?

hdfs://namenode_server/user/hive/warehouse

8) Write a hive query to view all the databases whose name begins with "db"

hive> SHOW DATABASES LIKE 'db.*';

9) Write a query to rename a table Student to Student_2.

hive> Alter Table Student RENAME to Student_2;

10) How to create an index on a table in Hive?

hive> CREATE INDEX index_salary ON TABLE employee (salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

The above query creates an index named index_salary which points to the salary column in the employee table. 

11)  How to delete the above index named index_salary?

DROP INDEX index_salary ON employee;

12) How to see the present working directory in UNIX from hive. Is it possible to run this command from hive?

Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.

hive> !pwd

Top 10 PIG interview questions


1. What is PIG script?
2. Write the skeleton of a pig script.
3. What is the difference between STORE and DUMP command?
4. What is the use of FILTER in PIG?
5. Can you use joins in PIG?
6. Can you have multiple inputs to a pig script?
7. What is the use of UDF?
8. What is the use of GROUP BY in PIG?
9. What is the use of UNION in PIG?
10. What is a touple?

Top 10 Hadoop interview questions


1. What is Big Data?
2. What is Hadoop?
3. What are the components of a Hadoop Cluster?
4. What is single point of failure in Hadoop and why?
5. List down the functions of NameNode.
6. What is the function of Job Tracker?
7. What are the phases in a Map Reduce Job Processing?
8.  How do you copy data from local to cluster and vice versa?
9. What are the different file formats in Hadoop?
10. What do you mean by decommission a data node in Hadoop?