How-to: Handle ingestion failures

When sending usage events to Zenskar via API, ingestion can fail silently if your system doesn't handle errors correctly. This document explains how failures happen, how to detect them, and how to build a reliable pipeline that prevents data loss.


Core concepts

How ingestion can fail

Ingestion failures fall into two categories with very different implications.

Rejected by Zenskar (non-retriable)

Zenskar received the event but rejected it because the payload was invalid. The API responds with an HTTP 4xx status code and a descriptive error message. If your system doesn't inspect that response, the event is silently dropped.

Never reached Zenskar (retriable)

The event never arrived due to a network outage on your side, an intermediate routing failure, or a Zenskar service disruption. Because the event never arrived, Zenskar has no record of it and cannot alert you. If you don't retry or persist the event locally, it is lost permanently.

What is a dead letter queue (DLQ)?

A dead letter queue (DLQ) is a holding area for events that failed to ingest, regardless of the reason. Instead of discarding a failed event, your system routes it to the DLQ so it can be retried, inspected, or manually replayed later. A DLQ is the primary mechanism for guaranteeing no usage data is lost.

sequenceDiagram
    participant S as Your system
    participant Z as Zenskar API
    participant D as DLQ

    S->>Z: POST /ingest (event payload)

    alt 200 OK
        Z-->>S: 200 OK
        Note over S: Event ingested successfully

    else 4xx validation error
        Z-->>S: 4xx + error message
        S->>D: Write event + error reason (non-retriable)
        Note over D: Awaits manual inspection
        D-->>S: Corrected payload
        S->>Z: POST /ingest (corrected payload)
        Z-->>S: 200 OK

    else 5xx server error
        Z-->>S: 5xx
        loop Retry with exponential backoff
            S->>Z: POST /ingest (same payload)
            Z-->>S: 5xx
        end
        S->>D: Write event + error reason (retriable, retries exhausted)

    else No response (network error)
        S-xZ: POST /ingest (no response)
        loop Retry with exponential backoff
            S->>Z: POST /ingest (same payload)
            S-xZ: No response
        end
        S->>D: Write event + error reason (retriable, retries exhausted)
    end

Retriable vs. non-retriable failures

Failure typeCauseShould you retry?
4xx validation errorMalformed or invalid payloadNo: Fix the payload first, then retry.
Network / connectivity errorNo response receivedYes: Retry with backoff
5xx server errorZenskar-side issueYes: Retry with backoff
📚

Important: Retrying a 4xx error without fixing the payload will always fail again. Route these events to the DLQ for inspection and correction before re-sending.


Quickstart guide

This walkthrough shows you how to send a usage event with basic error handling that routes failures to a DLQ. It assumes you are calling the Zenskar ingestion API directly over HTTP.

Step 1: Send the event

Send a POST request with your event payload. A valid payload looks like this:

[
  {
    "data": {
      "campaign_id": "sample_campaign_id_8",
      "impressions": 74
    },
    "timestamp": "2025-06-28 23:44:47",
    "customer_id": "c03"
  }
]

Step 2: Inspect the response

Always read the HTTP status code and response body. Do not assume success if you receive any response.

  • 200: Event accepted. No further action needed.
  • 4xx: Event rejected. Read the error message, fix the payload, then retry. Do not retry the original payload.
  • 5xx or no response: Delivery failed. Retry with exponential backoff (see Step 4).

Step 3: Route failures to a DLQ

If the event cannot be delivered (network error or 5xx) or was rejected due to a validation error (4xx), write it to your DLQ immediately. Include the original payload, the error reason, a timestamp, and the failure type (retriable vs. non-retriable) so you can process them correctly later.

Step 4: Retry retriable failures with backoff

For network errors and 5xx responses, retry using exponential backoff with jitter to avoid thundering-herd problems. A reasonable starting point:

  • Initial delay: 1 second
  • Multiplier: 2×
  • Maximum delay: 60 seconds
  • Maximum attempts: 5

After exhausting retries, move the event to the DLQ rather than discarding it.

Step 5: Drain the DLQ

Periodically process events in the DLQ. For non-retriable (4xx) failures, inspect the error message, correct the payload, and re-send. For retriable failures that were exhausted, re-attempt delivery.


How-to guides

Choose a DLQ implementation

The right approach depends on your event volume and existing infrastructure.

MethodDescriptionBest for
File-based loggingWrite failed events to a local fileLow volume, simple setups, local development
Database tableStore failed events in a dedicated table for review and manual replayModerate volume, teams that want SQL-queryable failure logs
Message queue (e.g. Kafka, RabbitMQ)Publish failed events to a dedicated DLQ topic or queueHigh volume, existing queue infrastructure
Cloud-managed DLQ (e.g. AWS SQS DLQ)Use a managed queue with built-in retry and failure handlingCloud-native stacks, teams that prefer managed infrastructure

Make events idempotent

Before retrying, ensure your events carry a stable unique identifier (e.g. a UUID tied to the originating action). Submit this as part of the payload so that if a retry delivers a duplicate, Zenskar can deduplicate it on ingestion. This prevents double-counting usage when a network failure causes an event to be delivered more than once.

Validate payloads before sending

Run basic schema validation on your side before calling the API. Check that:

  • All required keys are present (data, timestamp, customer_id)
  • All values match the expected types (see the Reference section below)
  • No unexpected keys are included in the data object
  • The timestamp is in the correct format (YYYY-MM-DD HH:MM:SS)
  • The total payload size is under 1 MB

Catching these errors locally avoids unnecessary API calls and keeps your DLQ free of easily preventable failures.


Reference

Valid payload structure

[
  {
    "data": {
      "campaign_id": "sample_campaign_id_8",
      "impressions": 74
    },
    "timestamp": "2025-06-28 23:44:47",
    "customer_id": "c03"
  }
]

The request body must be a JSON array. Each element represents one usage event.

FieldTypeRequiredNotes
customer_idStringYesMust match a customer in Zenskar
timestampDateTimeYesFormat: YYYY-MM-DD HH:MM:SS
dataObjectYesKeys and value types must match your metric schema exactly

HTTP error codes

🚧

Note: The 404 status code below is returned by Zenskar specifically for unparseable JSON bodies. This is non-standard: most APIs use 400 Bad Request for this case. If your HTTP client or logging tooling maps 404 to "resource not found," add explicit handling to avoid misclassifying this error.

StatusMeaningRetriable?Example error message
404Request body is not valid JSON (unparseable)No: Fix the JSONinvalid character '}' looking for beginning of object key string
413Payload exceeds 1 MBNo: Split into smaller batchesPayload too large
422Payload is valid JSON but fails schema validationNo: Fix the payloadInvalid type for key: impressions. Expected Int64, got string
5xxZenskar server errorYes: Retry with backoff

Validation error messages

When the API returns 422, the response body contains a message describing the exact problem.

Missing or unexpected keys

{ "error": "Missing key: impressions" }
{ "error": "Unexpected key in payload: extra_field" }

Type mismatches

{ "error": "Invalid type for key: campaign_id. Expected String, got float64" }
{ "error": "Invalid type for key: impressions. Expected Int64, got string" }
{ "error": "Invalid type for key: value. Expected Float64, got string" }
{ "error": "Invalid type for key: is_active. Expected Bool, got string" }

Date and time format errors

{ "error": "Invalid type for key: start_date. Expected Date32, got string" }
{ "error": "Invalid type for key: timestamp. Expected Date32/DateTime64, got string" }

UUID format errors

{ "error": "Invalid type for key: user_id. Expected UUID, got string" }

Nested object errors

{ "error": "Invalid type for key: data. Expected Object, got string" }
{ "error": "Invalid type for key: nested_field. Expected Int64, got string" }

Worked examples

Example 1: Type mismatch: impressions sent as a string instead of an integer

Request:

{
  "data": { "campaign_id": "sample_campaign_id_8", "impressions": "74" },
  "timestamp": "2025-06-28 23:44:47",
  "customer_id": "c03"
}

Response:

{ "error": "Invalid type for key: impressions. Expected Int64, got string" }

Fix: Send 74 (integer), not "74" (string).


Example 2: Missing required key

Request:

{
  "data": { "campaign_id": "sample_campaign_id_8" },
  "timestamp": "2025-06-28 23:44:47",
  "customer_id": "c03"
}

Response:

{ "error": "Missing key: impressions" }

Fix: Include all required fields defined in your metric schema.


Example 3: Unexpected key in payload

Request:

{
  "data": { "campaign_id": "sample_campaign_id_8", "impressions": 74, "extra_field": "not_allowed" },
  "timestamp": "2025-06-28 23:44:47",
  "customer_id": "c03"
}

Response:

{ "error": "Unexpected key in payload: extra_field" }

Fix: Remove any fields not defined in your metric schema.