Union Data with UNION and UNION ALL

03/20/2022 Reading Time: 5 Minutes

Two advanced SQL set operators that many analysts utilize when conducting data analysis are UNION and UNION ALL. These operators are used to run multiple queries in parallel. What this means is that each query is combined by the use of one of the union operators and will run in parallel outputting an aggregated result set with the rows from each individual result set appended to it. This is different from traditional SQL queries where we expect a result set to come from a single query with the only change in the sets shape coming in the form of additional columns related to the rows in the parent table. None of the queries in the union depend on each other to run, but they require each query to have compatible columns meaning the same number of columns in each queries SELECT clause and same data types for each sister column (i.e. the same position within the SELECT clause in each query).

To union tables, there are two operators that can be used, UNION and UNION ALL. UNION selects the distinct records for the matching columns in the tables being combined. UNION ALL selects all records for the matching columns (including duplicate records) in the tables being combined. UNION is more commonly used in standard data analysis, but it should be noted that due to the deduplication process, the query takes a performance hit making it longer to run than UNION ALL or a SQL query with table joins.

With this information in mind, let us look at examples of how you can use these clauses. The tables below display the data that will be referenced in the examples that follow. The data is from two advertising data sources, Facebook Ads and Google Ads. These two platforms do not have a shared key that would allow you to join the data, but they have similar data structures, which is enough for us to be able to combine the data sets and give an aggregated look at how paid advertising as a whole is performing.

Facebook Ad Data (facebook_ads)

ad_id reporting_start spent fb_campaign_id age gender interest1 impressions clicks total_conversion
708746 5/27/20 450 103916 30-34 M 15 7350 974 80
708749 5/27/20 5325 103917 30-34 M 16 17861 5923 150
708771 5/27/20 50 103920 30-34 M 20 693 102 10
708815 5/28/20 1959 103928 30-34 M 28 10023 423 15
708818 5/28/20 185 103928 30-34 M 28 4133 673 72
708820 5/29/20 500 103929 30-34 M 29 9254 1050 65
date campaign_id investment impressions clicks conversion
2020-05-27 1323 1252 1050 47 5
2020-05-28 1324 4923 5092 255 23
2020-05-28 1324 4923 5092 255 23
2020-05-29 1323 2467 2253 158 13

As you can see from the data in both tables, there are some similarities between the data sets in terms of the data being tracked and measurements used, but they contain slightly different naming conventions and a different count of rows and columns. None of these differences are issues with a union as you will see below.

Union

Let us say that we are looking to find the aggregate daily spend, clicks and impressions from both advertising platforms. To achieve this we will select the columns of interest from both tables and alias those columns where the naming conventions are different. Note that while the names of the columns do not have to be the same in each query, it is a best practice to align the names so that anyone navigating the code can easily find a column of interest in each query. In general, column names from first query will always be the column header value in the returned result set.

SQL Query:

SELECT reporting_start AS report_date
, spent AS investment
, impressions
, clicks
, total_conversion AS conversions
FROM facebook_ads

UNION

SELECT date AS report_date
, investment
, impressions
, clicks
, conversions
FROM google_ads

Result Set:

report_date investment impressions clicks conversions
2020-05-27 1252 1050 47 5
2020-05-28 4923 5092 255 23
2020-05-29 2467 2253 158 13
2020-05-27 50.75 693 102 10
2020-05-29 500.12 9254 1050 65
2020-05-27 450.85 7350 974 80
2020-05-27 5325.02 17861 5923 150
2020-05-28 1959.32 10023 423 15
2020-05-28 185.21 4133 673 72

In this query, I have aliased columns in both queries to align the taxonomy and used a specific order with the select columns based on overlapping data points. As you can see from the result set, the query ran as expected and both data sets have been combined based on the columns selected.

Errors

The only errors that will be thrown with a union clause  are if there is a mismatch in the counts of columns in each query or if those columns data types are different. There must be the same amount of columns in each queries SELECT clause and the columns that match must contain the same data type. Provided below are examples of the error messages you will encounter if one or the other criteria are not met.


ERROR:  each UNION query must have the same number of columns

ERROR:  invalid input syntax for type *data type used in the column from the first table unioned*

Filtering

Despite the requirement of having the same amount of columns in each SELECT clause within a union, this does not mean that you must follow the same protocol when using other SQL clauses in your individual SQL queries. For example, you can use WHERE in one query, but not the others. Provided below is an example where Google Ads data has been filtered to only include data from 2020-05-28, whereas the Facebook Ads query is returning all of the data from the table.

SQL Query:

SELECT reporting_start AS report_date
, spent AS investment
, impressions
, clicks
, total_conversion AS conversions
, 'Facebook' AS acquisition_source
FROM facebook_ads

UNION

SELECT date AS report_date
, investment
, impressions
, clicks
, conversions
, 'Google' AS acquisition_source
FROM google_ads
WHERE date = '2020-05-28'
ORDER BY acquisition_source

Result Set:

report_date investment impressions clicks conversions acquisition_source
2020-05-27 450.85 7350 974 80 Facebook
2020-05-27 50.75 693 102 10 Facebook
2020-05-28 185.21 4133 673 72 Facebook
2020-05-29 500.12 9254 1050 65 Facebook
2020-05-28 1959.32 10023 423 15 Facebook
2020-05-27 5325.02 17861 5923 150 Facebook
2020-05-28 4923 5092 255 23 Google

As you can see from the result set, Google Ads returns the a single row from its data table associated with the filtered date and Facebook Ads returns all rows.

UNION vs UNION ALL

Finally, let us examine how the previous result set would compare when using UNION versus UNION ALL. Using the same query, but switching UNION out for UNION ALL would deliver the following result set comparison.


UNION ALL Result Set (8 Records):

report_date investment impressions clicks conversions acquisition_source
2020-05-27 450.85 7350 974 80 Facebook
2020-05-27 5325.02 17861 5923 150 Facebook
2020-05-27 50.75 693 102 10 Facebook
2020-05-28 1959.32 10023 423 15 Facebook
2020-05-28 185.21 4133 673 72 Facebook
2020-05-29 500.12 9254 1050 65 Facebook
2020-05-28 4923 5092 255 23 Google
2020-05-28 4923 5092 255 23 Google

UNION Result Set (7 Records):

report_date investment impressions clicks conversions acquisition_source
2020-05-27 450.85 7350 974 80 Facebook
2020-05-27 50.75 693 102 10 Facebook
2020-05-28 185.21 4133 673 72 Facebook
2020-05-29 500.12 9254 1050 65 Facebook
2020-05-28 1959.32 10023 423 15 Facebook
2020-05-27 5325.02 17861 5923 150 Facebook
2020-05-28 4923 5092 255 23 Google

While the difference in the result sets are pretty minimal based on the example data, if you look at the last two rows in both result sets you can see how UNION removed the duplicate rows in the original data table, while UNION ALL included them.

As you have learned in this guide, UNION and UNION ALL are SQL set operators that allow you to join together multiple query result sets. While unions are a lot more flexible in terms of how the data is joined together, you need to be cautious of the SQL query structure to ensure the data in the result set aligns with expectations.

📬 Subscribe to My Newsletter

Get the latest articles and exclusive content straight to your inbox.