Skip to main content

Overlap Analysis

The Overlap Analysis template shows how partner customer data overlaps with Instacart's user base. It helps partners see their potential reach on Instacart by matching hashed identifiers.

The template matches customer identifiers provided by partners like Hashed Email (HEM) and Hashed IP Address with Instacart's user database to generate match statistics.

You can perform the following actions with this template:

  • Match hashed customer identifiers to Instacart's user database.
  • Work with multiple identifier types (HEM, Hashed IP Address).
  • Get data on both total and unique matches.
  • View summary statistics without revealing individual match details.
note

Because of pre-aggregated data, no minimum user count is required.

Data linking

Before linking your data to the Overlap Analysis template, ensure it meets the following requirements:

  • An IDENTIFIER_TYPE column with values HEM or HASHED_IP_ADDRESS.
  • An IDENTIFIER_VALUE column with properly hashed identifiers.
  • Data must be loaded into your Snowflake account.
  • Proper column naming and data types.
  • No personally identifiable information (PII) can be in raw form.
note

The Overlap Analysis template is the only one that requires data linking.

  1. Format your data according to the Overlap Analysis template requirements.

  2. Load your data into designated tables in your Snowflake account.

  3. Use the clean room API to register your data tables by running the following query:

    call samooha_by_snowflake_local_db.consumer.register_db('YOUR_DATABASE_NAME');
  4. Link your dataset to the clean room by running the following query:

    call samooha_by_snowflake_local_db.consumer.link_datasets(
    'CLEANROOM_NAME',
    ['YOUR_DATABASE.YOUR_SCHEMA.YOUR_TABLE']
    );
  5. Verify your data is properly linked and accessible in the clean room.

Run the query

Run the following query to calculate the overlap analysis:

set cleanroom_name = 'CLEANROOM_NAME';
call samooha_by_snowflake_local_db.consumer.run_analysis(
$cleanroom_name, -- cleanroom name
'overlap_analysis',
['YOUR_DATABASE.YOUR_SCHEMA.YOUR_TABLE'], -- Consumer table with identifiers
[], -- Provider tables (none needed)
object_construct() -- No additional parameters
);

Consumer table schema

ParameterDescriptionData typeAllowed valuesRequired or optional
Consumer tableA table with customer identifierstableTable with IDENTIFIER_TYPE (HEM, HASHED_IP_ADDRESS) and IDENTIFIER_VALUE columnsrequired
note

No additional parameters are required.

Your consumer table must have the following structure:

CREATE TABLE YOUR_DATABASE.YOUR_SCHEMA.YOUR_TABLE (
IDENTIFIER_TYPE VARCHAR, -- 'HEM' or 'HASHED_IP_ADDRESS'
IDENTIFIER_VALUE VARCHAR -- The hashed identifier
);

For example:

IDENTIFIER_TYPEIDENTIFIER_VALUE
HEMa1b2c3d4e5f6…
HEMf6e5d4c3b2a1…
HASHED_IP_ADDRESS1a2b3c4d5e6f…

Result

ResultDescription
TOTAL_RECORDSTotal records in the partner dataset
TOTAL_IP_MATCH_RECORDSTotal IP address matches
TOTAL_HEM_MATCH_RECORDSTotal email (HEM) matches
TOTAL_DISTINCT_HEM_MATCH_RECORDSUnique email (HEM) matches
TOTAL_DISTINCT_IP_MATCH_RECORDSUnique IP address matches

Example result

TOTAL_RECORDSTOTAL_HEM_MATCH_RECORDSTOTAL_IP_MATCH_RECORDSTOTAL_DISTINCT_HEM_MATCH_RECORDSTOTAL_DISTINCT_IP_MATCH_RECORDS
5000012500800087505250

Use cases

  • Audience sizing. Understand how many of your customers are also Instacart users.
  • Reach planning. Estimate potential reach for advertising campaigns on Instacart.
  • Data quality assessment. Evaluate the quality and coverage of your identifier data.
  • Partnership evaluation. Assess the value of data collaboration partnerships.

Best practices

Consider the following best practices when running this query:

  • Ensure consistent hashing algorithms across your data sources.
  • Clean and normalize identifiers before hashing.
  • Remove invalid or test data from your identifier lists.
  • Use HEM for the highest match accuracy.
  • Include IP addresses for broader reach estimation.
  • Consider using multiple identifier types for comprehensive analysis.
  • Account for potential duplicate identifiers in your source data.
  • Consider the recency of your identifier data.
  • Understand that match rates may vary based on identifier quality.
  • Ensure all identifiers are properly hashed before upload.
  • Follow your organization's data governance policies.
  • Document identifier hashing methods for audit purposes.

Methodology

This template employs various methodologies to gather, analyze, and interpret data.

Matching process

The template runs a process to match data while improving privacy.

  • Separate identifiers. Splits Hashed Emails (HEMs) and IP addresses for individual processing.
  • Hash matching. Links partner data with Instacart data through exact hash matches.
  • Match counts. Calculates total and unique matches for each type of identifier.
  • Summarized results. Provides summary stats without revealing individual match details.

Analyze Overlap Analysis results

When interpreting results, consider the following factors:

  • Match quality. The accuracy of the results depends on how current and reliable the identifiers are.
  • Hash consistency. Both datasets must use the same hashing method for successful matching.
  • Identifier coverage. Different types of identifiers (like HEM and IP) cover different parts of the audience.
  • Temporal factors. Identifiers may lose relevance over time.

Data checkpoints

The methodology includes built-in quality checks to support accurate results.

  • Input validation. Confirms the table structure and identifier formats are correct.
  • Type validation. Ensures identifiers (such as, Hashed Email or IP) match allowed types.
  • Hash verification. Checks that the identifiers are properly hashed before processing.

Technical implementation

The algorithm is designed to balance efficiency, security, and accuracy.

  • Scalability. Handles large datasets efficiently with optimized joining processes.
  • Security. Runs entirely in Snowflake's secure clean room environment.
  • Accuracy. Uses exact matching to avoid false positives.
  • Performance. Includes database indexing and partitioning to speed up execution.

Analysis capabilities

The template provides detailed insights into audience overlap.

  • Total records. Total count of records in the partner's dataset.
  • HEM matching. Total and unique matches based on Hashed Email.
  • IP matching. Total and unique matches based on Hashed IP Address.
  • Comprehensive coverage. Combines multiple identifier types for audience insights.

Limitations and considerations

  • Identifier quality. The accuracy of results depends on how recent and reliable the hashed identifiers are.
  • Hashing consistency. Hashing methods must be the same across both datasets for accurate matches.
  • Coverage patterns. Different types of identifiers may have unique match rates and coverage areas.
  • Temporal decay. The relevance of older identifiers may decrease over time.