Two advanced SQL set operators that many analysts utilize when conducting data analysis are UNION and UNION ALL. These operators are used to run multiple queries in parallel. What this means is that each query is combined by the use of one of the union operators and will run in parallel outputting an aggregated result set with the rows from each individual result set appended to it. This is different from traditional SQL queries where we expect a result set to come from a single query with the only change in the sets shape coming in the form of additional columns related to the rows in the parent table. None of the queries in the union depend on each other to run, but they require each query to have compatible columns meaning the same number of columns in each queries SELECT clause and same data types for each sister column (i.e. the same position within the SELECT clause in each query).
To union tables, there are two operators that can be used, UNION
and UNION ALL
. UNION
selects the distinct records for the matching columns in the tables being combined. UNION ALL
selects all records for the matching columns (including duplicate records) in the tables being combined. UNION is more commonly used in standard data analysis, but it should be noted that due to the deduplication process, the query takes a performance hit making it longer to run than UNION ALL or a SQL query with table joins.
With this information in mind, let us look at examples of how you can use these clauses. The tables below display the data that will be referenced in the examples that follow. The data is from two advertising data sources, Facebook Ads and Google Ads. These two platforms do not have a shared key that would allow you to join the data, but they have similar data structures, which is enough for us to be able to combine the data sets and give an aggregated look at how paid advertising as a whole is performing.
Facebook Ad Data (facebook_ads)
ad_id | reporting_start | spent | fb_campaign_id | age | gender | interest1 | impressions | clicks | total_conversion |
---|---|---|---|---|---|---|---|---|---|
708746 | 5/27/20 | 450 | 103916 | 30-34 | M | 15 | 7350 | 974 | 80 |
708749 | 5/27/20 | 5325 | 103917 | 30-34 | M | 16 | 17861 | 5923 | 150 |
708771 | 5/27/20 | 50 | 103920 | 30-34 | M | 20 | 693 | 102 | 10 |
708815 | 5/28/20 | 1959 | 103928 | 30-34 | M | 28 | 10023 | 423 | 15 |
708818 | 5/28/20 | 185 | 103928 | 30-34 | M | 28 | 4133 | 673 | 72 |
708820 | 5/29/20 | 500 | 103929 | 30-34 | M | 29 | 9254 | 1050 | 65 |
Google Ads Data (google_ads)
date | campaign_id | investment | impressions | clicks | conversion |
---|---|---|---|---|---|
2020-05-27 | 1323 | 1252 | 1050 | 47 | 5 |
2020-05-28 | 1324 | 4923 | 5092 | 255 | 23 |
2020-05-28 | 1324 | 4923 | 5092 | 255 | 23 |
2020-05-29 | 1323 | 2467 | 2253 | 158 | 13 |
As you can see from the data in both tables, there are some similarities between the data sets in terms of the data being tracked and measurements used, but they contain slightly different naming conventions and a different count of rows and columns. None of these differences are issues with a union as you will see below.
Union
Let us say that we are looking to find the aggregate daily spend, clicks and impressions from both advertising platforms. To achieve this we will select the columns of interest from both tables and alias those columns where the naming conventions are different. Note that while the names of the columns do not have to be the same in each query, it is a best practice to align the names so that anyone navigating the code can easily find a column of interest in each query. In general, column names from first query will always be the column header value in the returned result set.
SQL Query:
SELECT reporting_start AS report_date
, spent AS investment
, impressions
, clicks
, total_conversion AS conversions
FROM facebook_ads
UNION
SELECT date AS report_date
, investment
, impressions
, clicks
, conversions
FROM google_ads
Result Set:
report_date | investment | impressions | clicks | conversions |
---|---|---|---|---|
2020-05-27 | 1252 | 1050 | 47 | 5 |
2020-05-28 | 4923 | 5092 | 255 | 23 |
2020-05-29 | 2467 | 2253 | 158 | 13 |
2020-05-27 | 50.75 | 693 | 102 | 10 |
2020-05-29 | 500.12 | 9254 | 1050 | 65 |
2020-05-27 | 450.85 | 7350 | 974 | 80 |
2020-05-27 | 5325.02 | 17861 | 5923 | 150 |
2020-05-28 | 1959.32 | 10023 | 423 | 15 |
2020-05-28 | 185.21 | 4133 | 673 | 72 |
In this query, I have aliased columns in both queries to align the taxonomy and used a specific order with the select columns based on overlapping data points. As you can see from the result set, the query ran as expected and both data sets have been combined based on the columns selected.
Errors
The only errors that will be thrown with a union clause are if there is a mismatch in the counts of columns in each query or if those columns data types are different. There must be the same amount of columns in each queries SELECT clause and the columns that match must contain the same data type. Provided below are examples of the error messages you will encounter if one or the other criteria are not met.
ERROR: each UNION query must have the same number of columns
ERROR: invalid input syntax for type *data type used in the column from the first table unioned*
Filtering
Despite the requirement of having the same amount of columns in each SELECT clause within a union, this does not mean that you must follow the same protocol when using other SQL clauses in your individual SQL queries. For example, you can use WHERE in one query, but not the others. Provided below is an example where Google Ads data has been filtered to only include data from 2020-05-28
, whereas the Facebook Ads query is returning all of the data from the table.
SQL Query:
SELECT reporting_start AS report_date
, spent AS investment
, impressions
, clicks
, total_conversion AS conversions
, 'Facebook' AS acquisition_source
FROM facebook_ads
UNION
SELECT date AS report_date
, investment
, impressions
, clicks
, conversions
, 'Google' AS acquisition_source
FROM google_ads
WHERE date = '2020-05-28'
ORDER BY acquisition_source
Result Set:
report_date | investment | impressions | clicks | conversions | acquisition_source |
---|---|---|---|---|---|
2020-05-27 | 450.85 | 7350 | 974 | 80 | |
2020-05-27 | 50.75 | 693 | 102 | 10 | |
2020-05-28 | 185.21 | 4133 | 673 | 72 | |
2020-05-29 | 500.12 | 9254 | 1050 | 65 | |
2020-05-28 | 1959.32 | 10023 | 423 | 15 | |
2020-05-27 | 5325.02 | 17861 | 5923 | 150 | |
2020-05-28 | 4923 | 5092 | 255 | 23 |
As you can see from the result set, Google Ads returns the a single row from its data table associated with the filtered date and Facebook Ads returns all rows.
UNION vs UNION ALL
Finally, let us examine how the previous result set would compare when using UNION versus UNION ALL. Using the same query, but switching UNION out for UNION ALL would deliver the following result set comparison.
UNION ALL Result Set (8 Records):
report_date | investment | impressions | clicks | conversions | acquisition_source |
---|---|---|---|---|---|
2020-05-27 | 450.85 | 7350 | 974 | 80 | |
2020-05-27 | 5325.02 | 17861 | 5923 | 150 | |
2020-05-27 | 50.75 | 693 | 102 | 10 | |
2020-05-28 | 1959.32 | 10023 | 423 | 15 | |
2020-05-28 | 185.21 | 4133 | 673 | 72 | |
2020-05-29 | 500.12 | 9254 | 1050 | 65 | |
2020-05-28 | 4923 | 5092 | 255 | 23 | |
2020-05-28 | 4923 | 5092 | 255 | 23 |
UNION Result Set (7 Records):
report_date | investment | impressions | clicks | conversions | acquisition_source |
---|---|---|---|---|---|
2020-05-27 | 450.85 | 7350 | 974 | 80 | |
2020-05-27 | 50.75 | 693 | 102 | 10 | |
2020-05-28 | 185.21 | 4133 | 673 | 72 | |
2020-05-29 | 500.12 | 9254 | 1050 | 65 | |
2020-05-28 | 1959.32 | 10023 | 423 | 15 | |
2020-05-27 | 5325.02 | 17861 | 5923 | 150 | |
2020-05-28 | 4923 | 5092 | 255 | 23 |
While the difference in the result sets are pretty minimal based on the example data, if you look at the last two rows in both result sets you can see how UNION removed the duplicate rows in the original data table, while UNION ALL included them.
As you have learned in this guide, UNION and UNION ALL are SQL set operators that allow you to join together multiple query result sets. While unions are a lot more flexible in terms of how the data is joined together, you need to be cautious of the SQL query structure to ensure the data in the result set aligns with expectations.