Sunday, April 15, 2018

You have been asked to write a Hive table. What are the optimization techniques you should apply and make it an effective table so that query execution will be faster ?


Before writing a Hive table, below optimization techniques should be applied to make it an effective table.
1. Input and Output FileFormat.
2. Partitioning
3. Bucketing
4. Serialization, Deserialization techniques.

Lets discuss them one by one in short for interview point of view.

1. Input and Output FileFormat.

When you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text with one row per line. The default row delimiter is not a tab character, but the Ctrl-A character. In hive, different binary storage formats are being used. These are as below:-
Binary storage formats: 
2. Sequence files
3. Avro datafiles
4. Parquet files
5. RCFiles 
6. ORCFiles 

Among them, ORCFiles format is much more effective because of the row columnar structure. Row-oriented formats are appropriate when a large number of columns of a single row are needed for processing at the same time. 
The reading operation in an ORC File format is much more faster.

2. Partitioning

Which column to make the partition table ? This depends on the cardinality of the column. Cardinality represents the no of distinct values a column can have. Less is the cardinality, more is the choice of making it as a partitioning column.

Ex: Let say I have a Employee table having columns Employee Id, Department Id,  Salary.
In an organization the no of department is calculative. That means we can easily find the distinct departments in the Organiation. So here, Department Id will be the partioning column. Also, we need to make sure that, the no of partitions in a table should be feasible, as this creates directory in the HDFS. More is the no of directories, more overhead to the name node as name node stores all the meta data information.

3. Bucketing

Bucketing creates fixed no of files in the HDFS based on the no of buckets defined during create table statement. In the above example, we can make the Employee Id as bucketing. The bucketing works with hash function of the bucketing columns. Let say we have 1000 employee ids in all the department. We have applied the partitioning to Department id followed by bucketing in employee ids with 10 buckets. Therefore bucketing will results fixed no of employees in each department based on the hash function of each employee ids. Buckets are nothing but the files inside the partition directories.

Points to remember :-
No of Buckets = No of Reducer

4. Serialization, De-serialization techniques.
These techniques are applicable for hive storage formats. The row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format is defined by a SerDe .

Internally, Hive uses a SerDe called LazySimpleSerDe for  delimited text. This is the default value when none of the Ser-De is mentioned. The “lazy” prefix comes about because it deserializes fields lazily—only as they are accessed. This is the optimized Ser-De by Hive. We can also create user defined SerDe ex. regexSerDe for reading input data. 

Saturday, April 14, 2018

What is Speculative Execution and how it enhance job execution



1. Speculative execution is an job optimization. The goal of speculative execution is to reduce job execution time 
2.  In a hadoop environment, the scheduler tracks the progress of all tasks of the same type (map and reduce) in a job, and only launches speculative duplicates for the small proportion that are running significantly slower than the average.
3. When a task completes successfully, any duplicate tasks that are running are killed since they are no longer needed.  So, if the original task completes before the speculative task, the speculative task is killed; on the other hand, if the speculative task finishes first, the original is killed.
4. Speculative execution is turned on by default. Its applicable for both Map and Reduce phase.
5. You can se the speculative execution on a per-job basis or set it true globally which will be applicable for all jobs running in the hadoop cluster. Below are the properties to enable/disable speculative execution :-



set mapreduce.map.speculative true . Enable Speculative execution For Map tasks
set mapreduce.reduce.speculative true Enable Speculative execution For Reduce tasks

set mapreduce.map.speculative false . Disable Speculative execution For Map tasks
set mapreduce.reduce.speculative false Disable Speculative execution For Reduce tasks
Disadvantages of Speculative Execution:-
1. On a busy cluster, speculative execution can reduce overall throughput, since redundant tasks are being executed in an attempt to bring down the execution time for a single job. 
2. There is a good case for turning off speculative execution for reduce tasks, since any duplicate reduce tasks have to fetch the same map outputs as the original task, and this can significantly increase network traffic on the cluster. 

Wednesday, April 11, 2018

Most Frequently asked Hive Interview Questions


1. What is Speculative Execution and how it enhance job execution ?
2. You have been asked to write a Hive table. What are the optimization techniques you should apply and make it an effective table so that query execution will be faster ?
3. How will you decide which column(s) of a table to make Partition or Bucket ?
4. On what basis bucketing is applied on a table.
5. Describe the Hive architecture in brief ?
6. Write the Hive Word Count program and describe each step.
7. What are the major differences or key differences between partioning and bucketing.
8. What are the key differences between Managed and External table in Hive ? Why do we require an external table ? Give practical example.
9. Lets say I have 3 partitions in my data. Ex. Bangalore, Pune, Hyderabad. Each partiotion is having employee ids. I want to give read/write/update access to individual partitions to specific people. Let say Bangalore partiotion is only accessible by person:A, Hyderabad to persion: B and so on. Person A cant access Hyderbad partition and so on. How do we achieve that?
10. Can we make the same coulmn as partitioning and bucketing ? What will happen if we make so ?
11. Write a Hive query to find the duplicate rows in a table.
12. Why do we use the OVER() windows function in hive ? We can still use the group by clause which is having the similar functionality.
13. What is the key difference between -copyFromLocal and -put command in hadoop ?
14. Increase in number of partitions is a over head of name node as it has to store the meta data information. However considering the increased data volume day by day how to avoid the Name node crashing or name node being down ?
15. What is a hive server ?
16. What are the hive services available ?
17. Wrire the ALTER table syntax to change the data type of a column from string to bigint.
18. Write a unix shell script to remove 5 days older log from today.
19. How bucketing enahnce joining?
20. What are the key differences between Bucket Map Join and Sort Merge Bucket Join (SMB Join)?
      

Sunday, July 19, 2015