Pay Attention to Hcatelog’s Partition Filters


It is a common practice to load Hive tables through a Hcatelog loader in Pig. However, as Pig does not understand the concepts “partitions”, which is naturally defined in Hive, HCatelog supports to use “Filter” statement from Pig to behave partition selectors like what is mentioned in here. We reproduce an example here:

1
2
3
4
5
6
7
A = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
 
-- date is a partition column; age is not
B = FILTER A BY DATE == '20100819' AND age < 30;
 
-- both date and country are partition columns
C = FILTER A BY DATE == '20100819' AND country == 'US';

From the description from here, it suggests that the “Filter” statement for partitions should follow the “Load” statement immediately where the example above may indicate that both “B” and “C” relations might be optimized as they both specify a partition-filter.

In reality, “B” and “C” may not be optimized to only load a specific partition, depending on which partitions you specify in the script. For example, if “B” and “C” would filter different partitions, they are most likely not optimized. So, a safer way for the code above is:

1
2
3
4
5
6
7
8
A1 = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
 
-- date is a partition column; age is not
B = FILTER A1 BY DATE == '20100819' AND age < 30;
 
A2 = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
-- both date and country are partition columns
C = FILTER A2 BY DATE == '20100819' AND country == 'US';

which guarantees that both “B” and “C” follows a “Load” statement immediately and the partition-filter can be used.

Leave a comment

Your email address will not be published. Required fields are marked *