Redshift has a count() window function, but it doesn’t support counting distinct items. However, one can still count distinct items in a window by using another method.
The following table represents records with a measure of various items per day:
date | item | measure |
---|---|---|
2015-01-01 | table | 12 |
2015-01-01 | chair | 51 |
2015-01-01 | lamp | 8 |
2015-01-02 | table | 17 |
2015-01-02 | chair | 72 |
2015-01-02 | lamp | 23 |
2015-01-02 | bed | 1 |
2015-01-02 | dresser | 2 |
2015-01-03 | bed | 1 |
To get all of the data plus the number of distinct items measured per day, we would want to use this window function:
SELECT * , COUNT(DISTINCT item) OVER(PARTITION BY DATE) AS distinct_count FROM mytable;
However, this doesn’t work, because as of the time I’m writing this article, the Redshift documentation says “ALL is the default. DISTINCT is not supported.”
Instead, one will need to use the DENSE_RANK() function to get this count, with the item to be counted in the ORDER BY window clause.
SELECT * , DENSE_RANK() OVER(PARTITION BY date ORDER BY item) AS distinct_count FROM mytable;
date | item | measure | distinct_count |
---|---|---|---|
2015-01-01 | table | 12 | 3 |
2015-01-01 | chair | 51 | 3 |
2015-01-01 | lamp | 8 | 3 |
2015-01-02 | table | 17 | 5 |
2015-01-02 | chair | 72 | 5 |
2015-01-02 | lamp | 23 | 5 |
2015-01-02 | bed | 1 | 5 |
2015-01-02 | dresser | 2 | 5 |
2015-01-03 | bed | 1 | 1 |
Update: I’m told that this no longer works. It must have been an undocumented feature (bug). You now need to do another query on top of this result set.
Nice workaround. However I think you could use a better example by having duplicate items on a particular date. With the current example, regular count(*) window function would work as well.
Wow — thanks! I’ve been trying to figure out a workaround all day :-)