S3 destination for batch exports

Batch exports support two S3-family destinations: AWS S3 for buckets hosted on Amazon Web Services, and S3-compatible for any other S3-compatible object storage provider such as Cloudflare R2, Google Cloud Storage, MinIO, or DigitalOcean Spaces.

Models

This section describes the models that can be exported to S3.

Note: New fields may be added to these models over time. Therefore, it is recommended that any downstream processes are able to handle additional fields being added to the exported files.

Events model

This is the default model for S3 batch exports. When exported in the Parquet file format, the schema is:

FieldTypeDescription
uuidSTRINGThe unique ID of the event within PostHog
eventSTRINGThe name of the event that was sent
distinct_idSTRINGThe distinct_id of the user who sent the event
person_idSTRINGThe ID of the person the event is attributed to, resolved at ingestion time
propertiesSTRINGA JSON object with all the properties sent along with the event, stored as a JSON-encoded string in Parquet
person_propertiesSTRINGA JSON object with the person's properties at ingestion time, stored as a JSON-encoded string in Parquet
elements_chainSTRINGThe chain of DOM elements for $autocapture events. Empty for other events
timestampTIMESTAMPWhen the event occurred, as reported by the client. Microsecond precision, UTC
created_atTIMESTAMPWhen PostHog ingested the event. Microsecond precision, UTC
_inserted_atTIMESTAMPInternal field used by batch exports to track export progress. Included in files but safe to ignore

Note: The types above describe the Parquet file format. Other file formats represent the same data differently. For example, in JSONLines exports the timestamp fields (timestamp, created_at, and _inserted_at) are ISO 8601 strings, and properties and person_properties are nested JSON objects rather than JSON-encoded strings.

Persons model

The schema of the persons model when exported in the Parquet file format is:

FieldTypeDescription
team_idBIGINTThe ID of the project (team) the person belongs to
distinct_idSTRINGA distinct_id associated with the person
person_idSTRINGThe ID of the person for this (team_id, distinct_id) pair
propertiesSTRINGA JSON object with the latest person properties, stored as a JSON-encoded string in Parquet
person_distinct_id_versionBIGINTInternal version of the person-to-distinct_id mapping, used by batch exports during merges
person_versionBIGINTInternal version of the person's properties, used by batch exports during merges
created_atTIMESTAMPWhen the person was created. Microsecond precision, UTC
_inserted_atTIMESTAMPInternal field used by batch exports to track export progress. Included in files but safe to ignore
is_deletedBOOLEANWhether the person has been deleted

Each export contains one row per (team_id, distinct_id) pair, mapped to their corresponding person_id and latest properties.

Note: The persons model only includes persons that have a person profile in PostHog. If your project has person profile processing disabled (via person_profiles: 'identified_only', person_profiles: 'never', or by sending events with $process_person_profile: false), anonymous users who have never been identified will not appear in the persons export. To count unique users including those without person profiles, you can fall back to distinct_id from the events model. See the example queries in each destination's documentation for details.

Note: As with the events model, these types describe the Parquet file format. In JSONLines exports, created_at and _inserted_at are ISO 8601 strings and properties is a nested JSON object rather than a JSON-encoded string.

Sessions model

You can view the schema for the sessions model in the configuration form when creating a batch export (there are a few too many fields to display here!).

Creating the batch export

  1. Go to Data > Destinations in the left sidebar.
  2. Click + New destination in the top-right corner.
  3. Choose the destination that matches your storage provider:
    • AWS S3 – Select this if you're exporting to an Amazon S3 bucket.
    • S3-compatible – Select this for any non-AWS S3-compatible provider (Cloudflare R2, Google Cloud Storage, MinIO, DigitalOcean Spaces, etc.).
  4. Click the + Create button.
  5. Fill in the necessary configuration details.
  6. Click Create to finalize.
  7. Done! The batch export will schedule its first run on the start of the next period.

S3 configuration

Both S3-family destinations share a common set of configuration fields. Some fields are specific to one destination type.

Common fields

  • Bucket name: The name of the bucket where data is exported.
  • Region: The region where the bucket is located.
  • Key prefix: A key prefix for each object created. Supports template variables.
  • Format: The file format for the export. See S3 file formats.
  • Max file size (MiB): Split exported data into multiple files when it exceeds this size. Set to 0 or leave empty for no splitting.
  • Compression: A compression method (like gzip or zstd) for exported files, or no compression.
  • Access Key ID / Secret Access Key (required): Credentials with access to the bucket. For AWS S3, these are labeled AWS Access Key ID and AWS Secret Access Key.
  • Events to exclude: A list of events to omit from the exported data.

AWS S3-only fields

  • Encryption: Server-side encryption method (AES256 or aws:kms) for AWS to encrypt data at rest.
  • AWS KMS Key ID: The KMS Key ID to use for server-side encryption. Only required when encryption is set to aws:kms.

S3-compatible-only fields

  • Endpoint URL (required): The endpoint URL for your storage provider (e.g., https://<ACCOUNT_ID>.r2.cloudflarestorage.com for Cloudflare R2 or https://storage.googleapis.com for GCS).
  • Virtual style addressing: Enable this if your provider requires virtual hosted-style bucket addressing. Check your provider's documentation – leave unchecked if unsure.

S3 key prefix template variables

The key prefix provided for data exporting can include template variables which are formatted at runtime. All template variables are defined between curly brackets (for example {day}). This allows you partition files in your S3 bucket, such as by date.

Template variables include:

  • Date and time variables:
    • year.
    • month.
    • day.
    • hour.
    • minute.
    • second.
  • Name of the table exported (for example, 'events' or 'persons')
    • table.
  • Batch export data bounds:
    • data_interval_start.
    • data_interval_end.

So, as an example, setting {year}-{month}-{day}_{table}/ as a key prefix, will produce files prefixed with keys like 2023-07-28_events/.

S3 file formats

PostHog S3 batch exports support two file formats for exporting data:

The batch export format is selected via a drop down menu when creating or editing an export.

We intend to add support for other common formats, and format-specific configuration options. You can follow the roadmap to track progress.

Compression

Each file format supports a variety of compression methods. The compression method you choose can have a significant effect on the exported file size and the overall time taken to export the data. From our own internal testing, we would recommend using Parquet with zstd compression for the best combination of speed and file size.

Note on Parquet compression: The compression type is included in the file extension, even for Parquet files. For example, files compressed with zstd will have the extension parquet.zst. Since compression is embedded in the format itself, the file should be read directly as a Parquet file and not uncompressed first.

Manifest file

If you specify a max file size in your configuration, several files may be exported. In order to know when the export is complete, we send a manifest.json file (with the same prefix as the other files) once all the data files have been exported. This manifest file contains the key names of all the files exported.

S3-compatible storage providers

When creating a batch export, select the S3-compatible destination for any non-AWS S3-compatible storage provider. Below are configuration tips for providers we have tested.

MinIO

  • Set the Endpoint URL to your MinIO instance's host and port, for example: https://my-minio-storage:9000.

Cloudflare R2

  • Set the Endpoint URL to the following after replacing your account ID: https://<ACCOUNT_ID>.r2.cloudflarestorage.com.
  • From the Region dropdown, select one of the Cloudflare R2 regions that correspond to your bucket, like "Cloudflare R2 — Automatic (AUTO)".

Google Cloud Storage (GCS)

Access to GCS for batch exports follows a similar process to accessing BigQuery as a Service Account is required:

  1. Follow the steps in the BigQuery batch export documentation to create a Service Account.
  2. Create a HMAC key for your Service Account.
  3. Grant the Service Account the Storage Object User role or a custom role with at least the following permissions:
    • storage.multipartUploads.abort
    • storage.multipartUploads.create
    • storage.multipartUploads.list
    • storage.multipartUploads.listParts
    • storage.objects.create
    • storage.objects.delete
  4. Use the HMAC key access key and secret key as Access Key ID and Secret Access Key respectively when configuring your batch export.
  5. Set the Endpoint URL to: https://storage.googleapis.com.
  6. Select the appropriate GCP region from the Region dropdown.

Community questions

Was this page useful?

Questions about this page? or post a community question.