Ricky Ho | 7 May 00:56 2009

RE: PIG and Hive

Thanks Amr,

Without knowing the details of Hive, one constraint of SQL model is you can never generate more than one
records from a single record.  I don't know how this is done in Hive.  Another question is whether the Hive
script can take in user-defined functions ?

Using the following word count as an example.  Can you show me how the Pig script and Hive script looks like ?

  Input: a line (a collection of words)
  Output: multiple [word, 1]

  Input: [word, [1, 1, 1, ...]]
  Output: [word, count] 


-----Original Message-----
From: Amr Awadallah [mailto:aaa@...] 
Sent: Wednesday, May 06, 2009 3:14 PM
To: core-user@...
Subject: Re: PIG and Hive

> The difference between PIG and Hive seems to be pretty insignificant. 

Difference between Pig and Hive is significant, specifically:

(1) Pig doesn't require underlying structure to the data, Hive does 
imply structure via a metastore. This has it pros and cons. It allows 
Pig to be more suitable for ETL kind tasks where the input data is still 
a mish-mash and you want to convert it to be structured. On the other 
hand, Hive's metastore provides a dictionary that lets you easily see 
what columns exist in which tables which can be very handy.

(2) Pig is a new language, easy to learn if you know languages similar 
to Perl. Hive is a sub-set of SQL with very simple variations to enable 
map-reduce like computation. So, if you come from a SQL background you 
will find Hive QL extremely easy to pickup (many of your SQL queries 
will run as is), while if you come from a procedural programming 
background (w/o SQL knowledge) then Pig will be much more suitable for 
you. Furthermore, Hive is a bit easier to integrate with other systems 
and tools since it speaks the language they already speak (i.e. SQL).

You're right that HBase is a completely different game, HBase is not 
about being a high level language that compiles to map-reduce, HBase is 
about allowing Hadoop to support lookups/transactions on key/value 
pairs. HBase allows you to (1) do quick random lookups, versus scan all 
of data sequentially, (2) do insert/update/delete from middle, not just 

-- amr

Ricky Ho wrote:
> Jeff,
> Thanks for the pointer.
> It is pretty clear that Hive and PIG are the same kind and HBase is a different kind.
> The difference between PIG and Hive seems to be pretty insignificant.  Layer a tool on top of them can
completely hide their difference.
> I am viewing your PIG and Hive tutorial and hopefully can extract some technical details there.
> Rgds,
> Ricky
> -----Original Message-----
> From: Jeff Hammerbacher [mailto:hammer@...] 
> Sent: Wednesday, May 06, 2009 1:38 PM
> To: core-user@...
> Subject: Re: PIG and Hive
> Here's a permalink for the thread on MarkMail:
> http://markmail.org/thread/ee4hpcji74higqvk
> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <sharadag@...>wrote:
>> see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop
>> based technologies"
>> - Sharad
>> Ricky Ho wrote:
>>> Are they competing technologies of providing a higher level language for
>> Map/Reduce programming ?
>>> Or are they complementary ?
>>> Any comparison between them ?
>>> Rgds,
>>> Ricky