User Identification in GA4: The Issue of “Artificial” Sessions
Problem
Recently, several Sublime clients have noticed significant discrepancies between the session data reported by Sublime and that collected by Google Analytics 4 (GA4). These discrepancies, which ranged between 30-40%, were particularly alarming given that Sublime bases its analyses on raw GA4 data. A thorough investigation revealed that GA4 was recording additional sessions, users, and events not reflected in the raw data. Furthermore, deeper dives into GA4’s data exploration tab often showed that data broken down by selected dimensions did not match the total presented (in the same GA4 reports!).
Cause
The root cause was found in GA4’s user identification settings, which utilize machine-learning models to estimate the behaviors of users who switch browsers/devices or do not consent to tracking (Consent Mode). As a result, GA4 was effectively adding “artificial” data. Returning to user identification via GA cookies retrospectively (since changes in GA4 generally do not affect historical data) drastically altered the results, aligning them with those recorded in raw data.
User Identification Methods in GA4
GA4 has introduced more advanced identification options compared to its predecessor, Universal Analytics, which primarily relied on cookies tied to a specific browser:
- Device-based: Similar to the method used in Universal Analytics, this approach relies solely on device identifiers or first-party cookies.
- Observed: This method goes a step further by integrating data from cookies, Google Signals (if enabled), and user identifiers.
- Blended: The most advanced option, combining all previous methods and employing machine learning to model user behavior.
User Identification in Sublime
To accurately identify users while avoiding tracking errors or opaque modeling in GA4, Sublime uses GA4 raw data (Device-Based Reporting Identity) and links it to customer identifiers derived from sales data. As a result, Sublime is capable of both retrospective and future attribution of user actions on the site, regardless of the browser or device used by the customer.
Conclusion
The implementation of these methods in GA4, particularly in conjunction with Google Consent Mode, can lead to data distortions. Some clients have reported not just an increase in user and session numbers, but also significant challenges in correctly attributing sessions to their real sources, resulting in an unexpected surge in direct traffic at the expense of other channels, especially paid ones. This underscores the importance of understanding and properly implementing new identification technologies in data analysis.
Contents:
Contents: