Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: David Ciemiewicz (JIRA) <jira-1oDqGaOF3Lkdnm+yROfE0A <at> public.gmane.org>
Subject: [jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
Newsgroups: gmane.comp.java.hadoop.pig.devel
Date: Monday 1st June 2009 16:27:07 UTC (over 7 years ago)
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Ciemiewicz updated PIG-826:
---------------------------------

    Summary: DISTINCT as "Function/Operator" rather than statement/operator
- High Level Pig  (was: DISTINCT as "Function" rather than statement - High
Level Pig)

> DISTINCT as "Function/Operator" rather than statement/operator - High
Level Pig
>
-------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
>     COUNT(DISTINCT(user)) as user_count,
>     COUNT(DISTINCT(country)) as country_count,
>     COUNT(DISTINCT(url) as url_count
> from
>     server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is
about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
>         as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
>         group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
>         group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
>         group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
>         DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
>         DistinctUsersCount::user_count,
>         DistinctCountriesCount::country_count,
>         DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that
permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
>         as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
>         COUNT(DISTINCT(user)) as user_count,
>         COUNT(DISTINCT(country)) as country_count,
>         COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL.  I'd expect High Level Pig to
generate Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
 
CD: 3ms