S3 destination for batch exports
Contents
Batch exports support two S3-family destinations: AWS S3 for buckets hosted on Amazon Web Services, and S3-compatible for any other S3-compatible object storage provider such as Cloudflare R2, Google Cloud Storage, MinIO, or DigitalOcean Spaces.
Models
This section describes the models that can be exported to S3.
Note: New fields may be added to these models over time. Therefore, it is recommended that any downstream processes are able to handle additional fields being added to the exported files.
Events model
This is the default model for S3 batch exports. When exported in the Parquet file format, the schema is:
| Field | Type | Description |
|---|---|---|
| uuid | STRING | The unique ID of the event within PostHog |
| event | STRING | The name of the event that was sent |
| distinct_id | STRING | The distinct_id of the user who sent the event |
| person_id | STRING | The ID of the person the event is attributed to, resolved at ingestion time |
| properties | STRING | A JSON object with all the properties sent along with the event, stored as a JSON-encoded string in Parquet |
| person_properties | STRING | A JSON object with the person's properties at ingestion time, stored as a JSON-encoded string in Parquet |
| elements_chain | STRING | The chain of DOM elements for $autocapture events. Empty for other events |
| timestamp | TIMESTAMP | When the event occurred, as reported by the client. Microsecond precision, UTC |
| created_at | TIMESTAMP | When PostHog ingested the event. Microsecond precision, UTC |
| _inserted_at | TIMESTAMP | Internal field used by batch exports to track export progress. Included in files but safe to ignore |
Note: The types above describe the Parquet file format. Other file formats represent the same data differently. For example, in JSONLines exports the timestamp fields (
timestamp,created_at, and_inserted_at) are ISO 8601 strings, andpropertiesandperson_propertiesare nested JSON objects rather than JSON-encoded strings.
Persons model
The schema of the persons model when exported in the Parquet file format is:
| Field | Type | Description |
|---|---|---|
| team_id | BIGINT | The ID of the project (team) the person belongs to |
| distinct_id | STRING | A distinct_id associated with the person |
| person_id | STRING | The ID of the person for this (team_id, distinct_id) pair |
| properties | STRING | A JSON object with the latest person properties, stored as a JSON-encoded string in Parquet |
| person_distinct_id_version | BIGINT | Internal version of the person-to-distinct_id mapping, used by batch exports during merges |
| person_version | BIGINT | Internal version of the person's properties, used by batch exports during merges |
| created_at | TIMESTAMP | When the person was created. Microsecond precision, UTC |
| _inserted_at | TIMESTAMP | Internal field used by batch exports to track export progress. Included in files but safe to ignore |
| is_deleted | BOOLEAN | Whether the person has been deleted |
Each export contains one row per (team_id, distinct_id) pair, mapped to their corresponding person_id and latest properties.
Note: The persons model only includes persons that have a person profile in PostHog. If your project has person profile processing disabled (via
person_profiles: 'identified_only',person_profiles: 'never', or by sending events with$process_person_profile: false), anonymous users who have never been identified will not appear in the persons export. To count unique users including those without person profiles, you can fall back todistinct_idfrom the events model. See the example queries in each destination's documentation for details.
Note: As with the events model, these types describe the Parquet file format. In JSONLines exports,
created_atand_inserted_atare ISO 8601 strings andpropertiesis a nested JSON object rather than a JSON-encoded string.
Sessions model
You can view the schema for the sessions model in the configuration form when creating a batch export (there are a few too many fields to display here!).
Creating the batch export
- Go to Data > Destinations in the left sidebar.
- Click + New destination in the top-right corner.
- Choose the destination that matches your storage provider:
- AWS S3 – Select this if you're exporting to an Amazon S3 bucket.
- S3-compatible – Select this for any non-AWS S3-compatible provider (Cloudflare R2, Google Cloud Storage, MinIO, DigitalOcean Spaces, etc.).
- Click the + Create button.
- Fill in the necessary configuration details.
- Click Create to finalize.
- Done! The batch export will schedule its first run on the start of the next period.
S3 configuration
Both S3-family destinations share a common set of configuration fields. Some fields are specific to one destination type.
Common fields
- Bucket name: The name of the bucket where data is exported.
- Region: The region where the bucket is located.
- Key prefix: A key prefix for each object created. Supports template variables.
- Format: The file format for the export. See S3 file formats.
- Max file size (MiB): Split exported data into multiple files when it exceeds this size. Set to 0 or leave empty for no splitting.
- Compression: A compression method (like gzip or zstd) for exported files, or no compression.
- Access Key ID / Secret Access Key (required): Credentials with access to the bucket. For AWS S3, these are labeled AWS Access Key ID and AWS Secret Access Key.
- Events to exclude: A list of events to omit from the exported data.
AWS S3-only fields
- Encryption: Server-side encryption method (
AES256oraws:kms) for AWS to encrypt data at rest. - AWS KMS Key ID: The KMS Key ID to use for server-side encryption. Only required when encryption is set to
aws:kms.
S3-compatible-only fields
- Endpoint URL (required): The endpoint URL for your storage provider (e.g.,
https://<ACCOUNT_ID>.r2.cloudflarestorage.comfor Cloudflare R2 orhttps://storage.googleapis.comfor GCS). - Virtual style addressing: Enable this if your provider requires virtual hosted-style bucket addressing. Check your provider's documentation – leave unchecked if unsure.
S3 key prefix template variables
The key prefix provided for data exporting can include template variables which are formatted at runtime. All template variables are defined between curly brackets (for example {day}). This allows you partition files in your S3 bucket, such as by date.
Template variables include:
- Date and time variables:
year.month.day.hour.minute.second.
- Name of the table exported (for example, 'events' or 'persons')
table.
- Batch export data bounds:
data_interval_start.data_interval_end.
So, as an example, setting {year}-{month}-{day}_{table}/ as a key prefix, will produce files prefixed with keys like 2023-07-28_events/.
S3 file formats
PostHog S3 batch exports support two file formats for exporting data:
- JSON lines.
- Apache Parquet (latest version of the format specification is the only one supported).
The batch export format is selected via a drop down menu when creating or editing an export.
We intend to add support for other common formats, and format-specific configuration options. You can follow the roadmap to track progress.
Compression
Each file format supports a variety of compression methods. The compression method you choose can have a significant effect on the exported file size and the overall time taken to export the data. From our own internal testing, we would recommend using Parquet with zstd compression for the best combination of speed and file size.
Note on Parquet compression: The compression type is included in the file extension, even for Parquet files. For example, files compressed with zstd will have the extension
parquet.zst. Since compression is embedded in the format itself, the file should be read directly as a Parquet file and not uncompressed first.
Manifest file
If you specify a max file size in your configuration, several files may be exported. In order to know when the export is complete, we send a manifest.json file (with the same prefix as the other files) once all the data files have been exported. This manifest file contains the key names of all the files exported.
S3-compatible storage providers
When creating a batch export, select the S3-compatible destination for any non-AWS S3-compatible storage provider. Below are configuration tips for providers we have tested.
MinIO
- Set the Endpoint URL to your MinIO instance's host and port, for example:
https://my-minio-storage:9000.
Cloudflare R2
- Set the Endpoint URL to the following after replacing your account ID:
https://<ACCOUNT_ID>.r2.cloudflarestorage.com. - From the Region dropdown, select one of the Cloudflare R2 regions that correspond to your bucket, like "Cloudflare R2 — Automatic (AUTO)".
Google Cloud Storage (GCS)
Access to GCS for batch exports follows a similar process to accessing BigQuery as a Service Account is required:
- Follow the steps in the BigQuery batch export documentation to create a Service Account.
- Create a HMAC key for your Service Account.
- Grant the Service Account the
Storage Object Userrole or a custom role with at least the following permissions:storage.multipartUploads.abortstorage.multipartUploads.createstorage.multipartUploads.liststorage.multipartUploads.listPartsstorage.objects.createstorage.objects.delete
- Use the HMAC key access key and secret key as Access Key ID and Secret Access Key respectively when configuring your batch export.
- Set the Endpoint URL to:
https://storage.googleapis.com. - Select the appropriate GCP region from the Region dropdown.