User Identification in GA4: The Issue of “Artificial” Sessions

Problem

Recently, several Sublime clients have noticed significant discrepancies between the session data reported by Sublime and that collected by Google Analytics 4 (GA4). These discrepancies, which ranged between 30-40%, were particularly alarming given that Sublime bases its analyses on raw GA4 data. A thorough investigation revealed that GA4 was recording additional sessions, users, and events not reflected in the raw data. Furthermore, deeper dives into GA4’s data exploration tab often showed that data broken down by selected dimensions did not match the total presented (in the same GA4 reports!).

blended sessions
Blended User Identification – 12 vs 15 sessions

 

device based
Device-Based User Identification

 

Cause:

The root cause was found in GA4’s user identification settings, which utilize machine-learning models to estimate the behaviors of users who switch browsers/devices or do not consent to tracking (Consent Mode). As a result, GA4 was effectively adding “artificial” data. Returning to user identification via GA cookies retrospectively (since changes in GA4 generally do not affect historical data) drastically altered the results, aligning them with those recorded in raw data.

Device-Based User Identification

 

Blended User Identification

 

User Identification Methods in GA4

GA4 has introduced more advanced identification options compared to its predecessor, Universal Analytics, which primarily relied on cookies tied to a specific browser:

  • Device-based: Similar to the method used in Universal Analytics, this approach relies solely on device identifiers or first-party cookies.
  • Observed: This method goes a step further by integrating data from cookies, Google Signals (if enabled), and user identifiers.
  • Blended: The most advanced option, combining all previous methods and employing machine learning to model user behavior.

 

User Identification in Sublime

To accurately identify users while avoiding tracking errors or opaque modeling in GA4, Sublime uses GA4 raw data (Device-Based Reporting Identity) and links it to customer identifiers derived from sales data. As a result, Sublime is capable of both retrospective and future attribution of user actions on the site, regardless of the browser or device used by the customer.

 

Conclusion

The implementation of these methods in GA4, particularly in conjunction with Google Consent Mode, can lead to data distortions. Some clients have reported not just an increase in user and session numbers, but also significant challenges in correctly attributing sessions to their real sources, resulting in an unexpected surge in direct traffic at the expense of other channels, especially paid ones. This underscores the importance of understanding and properly implementing new identification technologies in data analysis.

View all

Let your entire organization work on data not assumptions!

Get Sublime