Practical Programming


Pay Attention to Hcatelog’s Partition Filters

It is a common practice to load Hive tables through a Hcatelog loader in Pig. However, as Pig does not understand the concepts “partitions”, which is naturally defined in Hive, HCatelog supports to use “Filter” statement from Pig to behave partition selectors like what is mentioned in here. We reproduce an example here:

1
2
3
4
5
6
7
A = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
 
-- date is a partition column; age is not
B = FILTER A BY DATE == '20100819' AND age < 30;
 
-- both date and country are partition columns
C = FILTER A BY DATE == '20100819' AND country == 'US';

From the description from here, it suggests that the “Filter” statement for partitions should follow the “Load” statement immediately where the example above may indicate that both “B” and “C” relations might be optimized as they both specify a partition-filter.

In reality, “B” and “C” may not be optimized to only load a specific partition, depending on which partitions you specify in the script. For example, if “B” and “C” would filter different partitions, they are most likely not optimized. So, a safer way for the code above is:

1
2
3
4
5
6
7
8
A1 = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
 
-- date is a partition column; age is not
B = FILTER A1 BY DATE == '20100819' AND age < 30;
 
A2 = LOAD 'XXX' USING  org.apache.hcatalog.pig.HCatLoader();
-- both date and country are partition columns
C = FILTER A2 BY DATE == '20100819' AND country == 'US';

which guarantees that both “B” and “C” follows a “Load” statement immediately and the partition-filter can be used.


Embedding Pig Latin in Python – Practical Guide

Introduction

Hadoop is a de-facto parallel computing platform for many industrial companies. However, it is non-trivial to write MapReduce programs on Hadoop for complex data analysis tasks. Therefore, high-level structure query languages like Pig can help data scientists better perform analysis tasks without distracting to details of MapReduce programs on Hadoop. One problem with Pig is that, it is lack of capability to deal with complex conditions such as “if”, “while” and “switch”, which are widely supported by other general-purpose programming languages. To tackle the issue, one solution is to embed Pig into a general-purpose programming language.

There are several great references for introductory materials about this topic. For example, [here], [here], [here] and [here]. In this post, we do not want to discuss the basic idea of embedding Pig into Python but demonstrate some important practices.

One important fact that needs to pay attention is that “Python” here is essentially Jython, a JVM-based Python implementation which uses the same syntax with Python. However, the standard libraries used in Python may or may not be available in Jython.

Pig User-Defined Functions

Within Pig, user-defined functions (UDF) are usually used for performing complicated tasks which are not supported well enough from Pig native functions. One example is to draw a Gaussian random variable from each row of the data source, centered by some field names.

Many languages can be used to write Pig UDFs. One easy choice is Jython. For general information about Jython UDFs, please see here.

When functions are included in the main script of the embedding code, it is by default treated as Pig UDFs. For instance,

def abc:
    print 'Hello World'
if __name__ == '__main__':
    from org.apache.pig.scripting import Pig
    P = Pig.compileFromFile(...)
    Q = P.bind(...)
    results = Q.run()

Here, function “abc” will be treated as a Pig UDF not a normal Jython function.

Python User-Defined Functions

As normal functions defined in the script will be treated as Pig UDFs, how can we write functions without putting too much program into the main script. One way is to put those functions into a separate “.py” file and used as a module. Therefore, we can use those functions by calling modules and corresponding functions. For instance,

import ComplexTask
if __name__ == '__main__':
    tasks = ComplexTask.ParseTasks(...)
    from org.apache.pig.scripting import Pig
    P = Pig.compileFromFile(...)
    Q = P.bind(...)
    results = Q.run()

Here, “ComplexTask” is another Python script “ComplexTask.py” in the same directory. In such way, we can put different functionalities into different modules.

Summary

Once we can use Python to control the workflow of Pig scripts, it would be even better that we can start and configure different workflows from different configuration files. Unfortunately, we do not have such tools so far. Thus, we probably need to read and parse XML configuration files by hand.


How to Generate Gamma Random Variables

In this post, I would like to discuss how to generate Gamma distributed random variables. Gamma random variate has a number of applications. One of the most important application is to generate Dirichlet distributed random vectors, which plays a key role in topic modeling and other Bayesian algorithms.

A good starting point is a book by Kroese et al. [1] where detailed discussion about how to generate a number of different random distributed variables. In the book (Section 4.2.6), they list the following methods for Gamma distribution:

  • Marsaglia and Tsang’s method [2]
  • Ahrens and Dieter’s method [3]
  • Cheng and Feast’s method [4]
  • Best’s method [5]

Here, we focus on Marsaglia and Tsang’s method, which is used in GSL Library and Matlab “gamrnd” command (you can check this by typing “open gamrnd”).

Gamma Distribution

As there are at least two forms of Gamma distribution, we focus the following formalism of PDF:
\[
\begin{align}
f(x; \alpha; \beta) = \frac{\beta^{\alpha} x^{\alpha – 1} e^{-\beta x} }{\Gamma(\alpha)}, \,\,\,\, x \geq 0
\end{align}
\]where \( \alpha > 0 \) is called the shape parameter and \( \beta > 0 \) is called the scale parameter.

Marsaglia and Tsang’s Method

The algorithm works as follows for \( \mathbf{X} \sim \mbox{Gamma}(\alpha, 1) \) for \( \alpha \geq 1 \):

  1. Set \( d = \alpha – 1/3 \) and \( c = 1/\sqrt{9d} \).
  2. Generate \( Z \sim \mbox{N}(0,1) \) and \( U \sim \mbox{U}(0,1)\) independently.
  3. If \( Z > -1/c \) and \( \log U < \frac{1}{2} Z^{2} + d - d V + d \times ln V \), where \( V = (1+cZ)^{3}\), return \( \mathbf{X} = d \times V\); otherwise, go back to Step 2.

Two notes:

  1. This method can be easily extended to the cases where \( 1 > \alpha > 0 \). We just generate \( \mathbf{X} \sim \mbox{Gamma}(\alpha + 1, \beta)\), then \( \mathbf{X}^{\prime} = \mathbf{X} \times U^{1/\alpha}\) where \(U \sim (0,1)\). Thus, \( \mathbf{X}^{\prime} \sim \mbox{Gamma}(\alpha, \beta)\). See details in Page 9 of [2].
  2. For \( \beta \neq 1 \), we firstly generate \( \mathbf{X} \sim \mbox{Gamma}(\alpha, 1)\), then \( \mathbf{X}/\beta \sim \mbox{Gamma}(\alpha, \beta)\).

Implementations

Here, we discuss some implementations details. We start from the original proposed C code from [2]:

1
2
3
4
5
6
7
8
float rgama(float a) {
float d,c,x,v,u;
d5a-1./3.; c51./sqrt(9.*d);
for(;;) do {x5RNOR; v51.1c*x;} while(v,50.);
v5v*v*v; u5UNI;
if( u,1.-.0331*(x*x)*(x*x) ) return (d*v);
if( log(u),0.5*x*x1d*(1.-v1log(v)) ) return (d*v); }
}

which is slightly different from the algorithm proposed above. Again, this does not handle \( 1 > \alpha > 0 \) and \( \beta \neq 1 \).

This is the Matlab sample code from [1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function x=gamrand(alpha,lambda)
% Gamma(alpha,lambda) generator using Marsaglia and Tsang method
% Algorithm 4.33
if alpha&gt;1
    d=alpha-1/3; c=1/sqrt(9*d); flag=1;
    while flag
        Z=randn;
        if Z&gt;-1/c
            V=(1+c*Z)^3; U=rand;
            flag=log(U)&gt;(0.5*Z^2+d-d*V+d*log(V));
        end
    end
    x=d*V/lambda;
else
    x=gamrand(alpha+1,lambda);
    x=x*rand^(1/alpha);
end

This code does everything.

This is the code snippet extracted from GSL library:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
double gsl_ran_gamma (const gsl_rng * r, const double a, const double b)
{
  /* assume a &gt; 0 */
 
  if (a &lt; 1)
    {
      double u = gsl_rng_uniform_pos (r);
      return gsl_ran_gamma (r, 1.0 + a, b) * pow (u, 1.0 / a);
    }
 
  {
    double x, v, u;
    double d = a - 1.0 / 3.0;
    double c = (1.0 / 3.0) / sqrt (d);
 
    while (1)
      {
        do
          {
            x = gsl_ran_gaussian_ziggurat (r, 1.0);
            v = 1.0 + c * x;
          }
        while (v &lt;= 0);
 
        v = v * v * v;
        u = gsl_rng_uniform_pos (r);
 
        if (u &lt; 1 - 0.0331 * x * x * x * x) 
          break;
 
        if (log (u) &lt; 0.5 * x * x + d * (1 - v + log (v)))
          break;
      }
 
    return b * d * v;
  }
}

Again, this does everything. Note this code is based on a slightly different PDF: \( {1 \over \Gamma(a) b^a} x^{a-1} e^{-x/b}\).

Reference

[1] D.P. Kroese, T. Taimre, Z.I. Botev. Handbook of Monte Carlo Methods. John Wiley & Sons, 2011.
[2] G. Marsaglia and W. Tsang. A simple method for generating gamma variables. ACM Transactions on Mathematical Software, 26(3):363-372, 2000.
[3] J. H. Ahrens and U. Dieter. Computer methods for sampling from gamma, beta, Poisson, and binomial distributions. Computing, 12(3):223-246, 1974.
[4] R. C. H. Cheng and G. M. Feast. Some simple gamma variate generators. Computing, 28(3):290-295, 1979.
[5] D. J. Best. A note on gamma variate generators with shape parameters less than unity. Computing, 30(2):185-188, 1983.


How to install GCC higher version in alternative directory 3

One easy way to install GCC higher version (which supports C++ 11 standard) in alternative directory (meaning without root privilege) is as follows:

  • Download the latest GCC from http://gcc.gnu.org/
  • Untar the package
  • If you don’t have prerequisites for GCC (http://gcc.gnu.org/install/prerequisites.html), you need to the following steps:
    • Go to the source directory.
    • Do “contrib/download_prerequisites”.
  • Create a new directory called “gcc-build” peer to the source directory.
  • In “gcc-build”, perform “../gcc-XXX/configure –prefix=YYY”
  • Do “make” and “make install”
  • Add “YYY/lib” or “YYY/lib64” into your LD_LIBRARY_PATH and LD_RUN_PATH

Random Number Generation with C++ 0x in GCC 2

In this post, I woud like to explore how random number would be generated by using newly added features from C++ 0x in GCC.

Rather than explaining the details of C++ 0x, I just post code here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <random>
#include <iostream>
#include <string>
#include <stdio.h>
#include <time.h>
 
using namespace std;
 
int main()
{
  int n;
  double p, lambda, shape, mu, sigma;
  mt19937 eng(time(NULL));
 
  uniform_int_distribution<int> uniform_int(3,7);
  for(int i=0; i < 10; i++)
    cout << "[Uniform INT distribution]:" << uniform_int(eng) << endl;
  cout << endl;
 
  uniform_real_distribution<double> uniform_real(0.0,1.0);
  for(int i=0; i < 10; i++)
    cout << "[Uniform REAL distribution]:" << uniform_real(eng) << endl;
  cout << endl;
 
  n = 5;
  p = 0.3;
  binomial_distribution<int> binomial(n, p);
  for(int i=0; i < 10; i++)
    cout << "[Binomial distribution]:" << binomial(eng) << endl;
  cout << endl;
 
  lambda = 4.0;
  exponential_distribution<double> exponential(lambda);
  for(int i=0; i < 10; i++)
    cout << "[Exponential distribution]:" << exponential(eng) << endl;
  cout << endl;
 
  shape = 3.0;
  gamma_distribution<double> gamma(shape);
  for(int i=0; i < 10; i++)
    cout << "[Gamma distribution]:" << gamma(eng) << endl;
  cout << endl;
 
  p = 0.5;
  geometric_distribution<int> geometric(p);
  for(int i=0; i < 10; i++)
    cout << "[Geometric distribution]:" << geometric(eng) << endl;
  cout << endl;
 
  mu = 3.0; sigma = 4.0;
  normal_distribution<double> normal(mu, sigma);
  for(int i=0; i < 10; i++)
    cout << "[Gaussian distribution]:" << normal(eng) << endl;
  cout << endl;
 
  lambda = 7.0;
  poisson_distribution<int> poisson(lambda);
  for(int i=0; i < 10; i++)
    cout << "[Poission distribution]:" << poisson(eng) << endl;
  cout << endl;
 
  p = 0.6;
  bernoulli_distribution bernoulli(p);
  for(int i=0; i < 10; i++)
    cout << "[Bernoulli distribution]:" << bernoulli(eng) << endl;
  cout << endl;
 
  return (0);
}

Note, the code is very clean in the sense that you don’t need any extra libraries at all.

Please compile with:

g++ -std=gnu++0x

The GCC I used is GCC 4.6.1

Several references:
1. http://www.johndcook.com/test_TR1_random.html (sort of out-dated)
2. http://www.johndcook.com/cpp_TR1_random.html (sort of out-dated)
3. http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.200x (current status of C++ 0x in GCC)
4. http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt01ch03s02.html ( a list of header files of C++ in GCC)