You have been asked to write a Hive table. What are the optimization techniques you should apply and make it an effective table so that query execution will be faster ?
Before writing a Hive table, below optimization techniques should be applied to make it an effective table.
1. Input and Output FileFormat.
2. Partitioning
3. Bucketing
4. Serialization, Deserialization techniques.
Lets discuss them one by one in short for interview point of view.
1. Input and Output FileFormat.
Among them, ORCFiles format is much more effective because of the row columnar structure. Row-oriented formats are appropriate when a large number of columns of a single row are needed for processing at the same time.
The reading operation in an ORC File format is much more faster.
2. Partitioning
Which column to make the partition table ? This depends on the cardinality of the column. Cardinality represents the no of distinct values a column can have. Less is the cardinality, more is the choice of making it as a partitioning column.
Ex: Let say I have a Employee table having columns Employee Id, Department Id, Salary.
In an organization the no of department is calculative. That means we can easily find the distinct departments in the Organiation. So here, Department Id will be the partioning column. Also, we need to make sure that, the no of partitions in a table should be feasible, as this creates directory in the HDFS. More is the no of directories, more overhead to the name node as name node stores all the meta data information.
3. Bucketing
Bucketing creates fixed no of files in the HDFS based on the no of buckets defined during create table statement. In the above example, we can make the Employee Id as bucketing. The bucketing works with hash function of the bucketing columns. Let say we have 1000 employee ids in all the department. We have applied the partitioning to Department id followed by bucketing in employee ids with 10 buckets. Therefore bucketing will results fixed no of employees in each department based on the hash function of each employee ids. Buckets are nothing but the files inside the partition directories.
Points to remember :-
No of Buckets = No of Reducer
4. Serialization, De-serialization techniques.
These techniques are applicable for hive storage formats. The row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format is defined by a SerDe .
1. Input and Output FileFormat.
2. Partitioning
3. Bucketing
4. Serialization, Deserialization techniques.
Lets discuss them one by one in short for interview point of view.
1. Input and Output FileFormat.
When you create a table with no ROW FORMAT or STORED AS clauses, the default format
is delimited text with one row per line. The default row delimiter is not a tab character, but the Ctrl-A character. In hive, different binary storage formats are being used. These are as below:-
Binary storage formats:
2. Sequence files
3. Avro datafiles
4. Parquet files
5. RCFiles
6. ORCFiles
Binary storage formats:
2. Sequence files
3. Avro datafiles
4. Parquet files
5. RCFiles
6. ORCFiles
Among them, ORCFiles format is much more effective because of the row columnar structure. Row-oriented formats are appropriate when a large number of columns of a single row are needed for processing at the same time.
The reading operation in an ORC File format is much more faster.
2. Partitioning
Which column to make the partition table ? This depends on the cardinality of the column. Cardinality represents the no of distinct values a column can have. Less is the cardinality, more is the choice of making it as a partitioning column.
Ex: Let say I have a Employee table having columns Employee Id, Department Id, Salary.
In an organization the no of department is calculative. That means we can easily find the distinct departments in the Organiation. So here, Department Id will be the partioning column. Also, we need to make sure that, the no of partitions in a table should be feasible, as this creates directory in the HDFS. More is the no of directories, more overhead to the name node as name node stores all the meta data information.
3. Bucketing
Bucketing creates fixed no of files in the HDFS based on the no of buckets defined during create table statement. In the above example, we can make the Employee Id as bucketing. The bucketing works with hash function of the bucketing columns. Let say we have 1000 employee ids in all the department. We have applied the partitioning to Department id followed by bucketing in employee ids with 10 buckets. Therefore bucketing will results fixed no of employees in each department based on the hash function of each employee ids. Buckets are nothing but the files inside the partition directories.
Points to remember :-
No of Buckets = No of Reducer
4. Serialization, De-serialization techniques.
These techniques are applicable for hive storage formats. The row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format is defined by a SerDe .
Internally, Hive uses a SerDe called LazySimpleSerDe for delimited text. This is the default value when none of the Ser-De is mentioned. The “lazy” prefix comes about because it deserializes fields lazily—only as they are accessed. This is the optimized Ser-De by Hive. We can also create user defined SerDe ex. regexSerDe for reading input data.
0 comments:
Post a Comment