Hadoop is a de-facto parallel computing platform for many industrial companies. However, it is non-trivial to write MapReduce programs on Hadoop for complex data analysis tasks. Therefore, high-level structure query languages like Pig can help data scientists better perform analysis tasks without distracting to details of MapReduce programs on Hadoop. One problem with Pig is that, it is lack of capability to deal with complex conditions such as “if”, “while” and “switch”, which are widely supported by other general-purpose programming languages. To tackle the issue, one solution is to embed Pig into a general-purpose programming language.
There are several great references for introductory materials about this topic. For example, [here], [here], [here] and [here]. In this post, we do not want to discuss the basic idea of embedding Pig into Python but demonstrate some important practices.
One important fact that needs to pay attention is that “Python” here is essentially Jython, a JVM-based Python implementation which uses the same syntax with Python. However, the standard libraries used in Python may or may not be available in Jython.
Pig User-Defined Functions
Within Pig, user-defined functions (UDF) are usually used for performing complicated tasks which are not supported well enough from Pig native functions. One example is to draw a Gaussian random variable from each row of the data source, centered by some field names.
Many languages can be used to write Pig UDFs. One easy choice is Jython. For general information about Jython UDFs, please see here.
When functions are included in the main script of the embedding code, it is by default treated as Pig UDFs. For instance,
def abc: print 'Hello World' if __name__ == '__main__': from org.apache.pig.scripting import Pig P = Pig.compileFromFile(...) Q = P.bind(...) results = Q.run()
Here, function “abc” will be treated as a Pig UDF not a normal Jython function.
Python User-Defined Functions
As normal functions defined in the script will be treated as Pig UDFs, how can we write functions without putting too much program into the main script. One way is to put those functions into a separate “.py” file and used as a module. Therefore, we can use those functions by calling modules and corresponding functions. For instance,
import ComplexTask if __name__ == '__main__': tasks = ComplexTask.ParseTasks(...) from org.apache.pig.scripting import Pig P = Pig.compileFromFile(...) Q = P.bind(...) results = Q.run()
Here, “ComplexTask” is another Python script “ComplexTask.py” in the same directory. In such way, we can put different functionalities into different modules.
Once we can use Python to control the workflow of Pig scripts, it would be even better that we can start and configure different workflows from different configuration files. Unfortunately, we do not have such tools so far. Thus, we probably need to read and parse XML configuration files by hand.