Building a Serverless Analytics System

Remember the early days of web analytics when embedding Google Analytics meant surrendering user privacy and accepting whatever data schema Google provided? I found myself in that exact position recently, needing analytics for a side project but unwilling to compromise on user privacy or pay enterprise prices for simple traffic insights.

The solution? Build a custom analytics system that captures only what we need, respects user privacy, and costs pennies to operate. This is the story of how I built a complete serverless analytics platform using AWS CDK, SQS, Lambda, and DynamoDB.

What started as a simple "track page views" requirement evolved into a fascinating exploration of serverless architecture patterns. We'll journey through three distinct phases: creating a lightning-fast pixel tracker with API Gateway, building a resilient event processing pipeline, and crafting a modern dashboard that makes data actionable.

By the end of this post, you'll understand not just how to build this system, but why each architectural decision was made and how to adapt these patterns for your own projects.

Part 1: The Quest for the Perfect Tracking Pixel

The heart of any analytics system is data collection, and for web analytics, that means the humble tracking pixel. But here's where most tutorials get it wrong. They immediately reach for Lambda functions, introducing unnecessary complexity and cold start latencies that can impact user experience.

I had a different idea. What if we could eliminate Lambda entirely from the tracking pipeline? What if the pixel could be served directly by API Gateway and events could flow straight to SQS?

The Monorepo Foundation

First, I needed a solid foundation. Modern applications aren't single-purpose utilities anymore, they're ecosystems of interconnected services. The monorepo structure reflects this reality:

cdk/ - The infrastructure brain, defining our entire AWS ecosystem
api/ - A Rust-powered Lambda that serves analytics data with blazing speed
app/ - A React dashboard that makes data beautiful and actionable
queue-consumer/ - The workhorse Lambda that processes events reliably

We use Moon to manage this monorepo, providing consistent tooling, dependency management, and task orchestration across multiple languages. Each package has its own package.json or Cargo.toml, but Moon ties them all together with a single source of truth in moon.yml.

Rethinking the Tracking Pixel

The traditional approach to building a tracking pixel looks something like this: CloudFront → API Gateway → Lambda → SQS → Lambda → DynamoDB. That's four service hops with potential cold starts at two different points. For something that needs to respond in under 100ms to avoid impacting user experience, this seemed excessive.

Enter API Gateway's direct service integrations and Velocity Template Language (VTL). By crafting the right VTL templates, I could transform HTTP requests directly into SQS messages without ever touching a Lambda function. This isn't just a performance optimization – it's a cost optimization that could reduce tracking costs by 80% while improving reliability.

The implementation started with the SQS queue – the buffer that would absorb traffic spikes and ensure no analytics events were lost, even during the most intensive traffic periods:

// inbound queue - our traffic shock absorber
new Queue(this, "pixel-tracker-inbound-queue", {
  retentionPeriod: cdk.Duration.days(3),
  // the time a message stays hidden after being retried
  visibilityTimeout: cdk.Duration.seconds(300),
});

Three days of retention might seem excessive for analytics events, but here's the thing about traffic spikes – they're unpredictable. During a product launch or viral moment, you might have 10x normal traffic for hours. The queue needs to handle that gracefully without losing data.

Next came the API Gateway configuration. The binaryMediaTypes setting was crucial – our pixel returns a actual GIF image, not JSON:

// rest api - the pixel's front door
new RestApi(this, "pixel-tracker-api", {
  binaryMediaTypes: ["image/gif"],
  deployOptions: {
    stageName: "v1",
    // comprehensive logging for debugging those tricky integration issues
  },
});

API Gateway needs permission to send messages to SQS, but we can't use the typical Lambda execution role pattern. Instead, we create a dedicated service role:

// grant send message permission to api gateway
const credentialsRole = new Role(this, "pixel-tracker-api-role", {
  assumedBy: new ServicePrincipal("apigateway.amazonaws.com"),
});
queue.grantSendMessages(credentialsRole);

This role would be assumed by API Gateway for every pixel request, but since role assumption is cached, the performance impact is negligible.

Next the direct API Gateway to SQS integration. This is where VTL (Velocity Template Language) becomes our secret weapon. VTL might seem archaic compared to modern templating systems, but it's incredibly powerful for transforming HTTP requests into AWS service calls:

// The heart of our serverless pixel tracker
new AwsIntegration({
  service: "sqs",
  path: "/",
  integrationHttpMethod: "POST",
  region: cdk.Aws.REGION,
  options: {
    credentialsRole,
    requestParameters: {
      "integration.request.header.X-Amz-Target": "'AmazonSQS.SendMessage'",
      "integration.request.header.Content-Type":
        "'application/x-amz-json-1.0'",
    },
    requestTemplates: {
      // Here's where the magic happens - VTL transforms HTTP context into SQS message
      "application/json": `{
        "QueueUrl": "${this.queue.queueUrl}",
        "MessageBody": '{
            "ip":"$context.identity.sourceIp",
            "ts":"$context.requestTimeEpoch",
            "ua":"$input.params('User-Agent')",
            "referer":"$input.params('Referer')",
            "query":"$input.params().querystring",
            "country":"$input.params('CloudFront-Viewer-Country')",
            "region":"$input.params('CloudFront-Viewer-Country-Region')",
            "city":"$input.params('CloudFront-Viewer-City')",
            "timezone":"$input.params('CloudFront-Viewer-Time-Zone')"
        }'
      }`
    },
    integrationResponses: [
      {
        statusCode: "200",
        responseTemplates: {
          // Return a real 1x1 transparent GIF
          "image/gif":
            "$util.base64Decode('R0lGODlhAQABAJEAAAAAAP///////wAAACH5BAEAAAIALAAAAAABAAEAAAICVAEAOw==')",
        },
        responseParameters: {
          "method.response.header.Content-Type": "'image/gif'",
          "method.response.header.Cache-Control":
            "'no-cache, no-store, must-revalidate, max-age=0'",
          "method.response.header.Pragma": "'no-cache'",
          "method.response.header.Expires": "'0'",
          "method.response.header.Access-Control-Allow-Origin": "'*'",
          "method.response.header.Access-Control-Allow-Methods": "'GET'",
          "method.response.header.Access-Control-Allow-Headers": "'*'",
        },
      },
    ],
  },
});

That VTL template extracts the client IP, user agent, referer, query parameters, and even CloudFront geographic data, packages it all into a JSON message, and sends it directly to SQS. All without a single line of Lambda code.

The response template returns a real 1x1 transparent GIF (that base64 string decodes to an actual GIF file), making our pixel indistinguishable from traditional tracking pixels. The caching headers ensure the pixel is never cached, guaranteeing accurate tracking.

// add the pixel resource
const resource = api.root.addResource("pixel");

resource.addMethod("GET", integration, {
  methodResponses: [
    {
      statusCode: "200",
      responseParameters: {
        // Declare all headers the integration might set
        "method.response.header.Content-Type": true,
        "method.response.header.Cache-Control": true,
        "method.response.header.Pragma": true,
        "method.response.header.Expires": true,
        "method.response.header.Access-Control-Allow-Origin": true,
        "method.response.header.Access-Control-Allow-Methods": true,
        "method.response.header.Access-Control-Allow-Headers": true,
      },
    },
  ],
});

The CloudFront Layer: Global Performance and Geographic Data

Here's where the architecture gets really interesting. Most developers think of CloudFront as just a CDN, but it's actually a powerful data enrichment layer. CloudFront automatically adds geographic headers based on the client's IP address. Headers that our VTL template can capture and forward to SQS.

// CloudFront: Our global edge network and data enrichment layer
const distribution = new Distribution(this, "pixel-api-distribution", {
  defaultBehavior: {
    origin: new RestApiOrigin(api, {
      originPath: "/v1",
    }),
    allowedMethods: AllowedMethods.ALLOW_GET_HEAD_OPTIONS,
    cachePolicy: CachePolicy.CACHING_DISABLED, // Never cache the pixel
    originRequestPolicy: new OriginRequestPolicy(
      this,
      "pixel-forward-all",
      {
        // Forward geographic data from CloudFront to API Gateway
        headerBehavior: OriginRequestHeaderBehavior.allowList(
          "User-Agent",
          "Referer", 
          "CloudFront-Viewer-Country",
          "CloudFront-Viewer-Country-Region", 
          "CloudFront-Viewer-City",
          "CloudFront-Viewer-Time-Zone",
        ),
        queryStringBehavior: OriginRequestQueryStringBehavior.all(),
      },
    ),
    responseHeadersPolicy: new ResponseHeadersPolicy(
      this,
      "pixel-api-cors",
      {
        securityHeadersBehavior: {
          contentSecurityPolicy: {
            contentSecurityPolicy:
              "default-src 'none'; img-src * data: blob:; script-src 'unsafe-inline'; style-src 'unsafe-inline'; form-action 'none'; frame-ancestors 'none'; base-uri 'none';",
            override: true,
          },
          contentTypeOptions: {
            override: true,
          },
          frameOptions: {
            frameOption: HeadersFrameOption.DENY,
            override: true,
          },
          referrerPolicy: {
            referrerPolicy:
              HeadersReferrerPolicy.STRICT_ORIGIN_WHEN_CROSS_ORIGIN,
            override: true,
          },
          strictTransportSecurity: {
            accessControlMaxAge: cdk.Duration.seconds(63072000),
            includeSubdomains: true,
            preload: true,
            override: true,
          },
        },
        corsBehavior: {
          accessControlAllowCredentials: false,
          accessControlAllowHeaders: ["*"],
          accessControlAllowMethods: ["GET"],
          accessControlAllowOrigins: ["*"],
          originOverride: true,
        },
      },
    ),
  },
});

The security headers deserve special attention. That Content Security Policy is carefully crafted to allow our pixel to be embedded in any site while preventing XSS attacks. The frame options prevent clickjacking, and HSTS ensures all communication happens over HTTPS.

What's remarkable about this architecture is what we've achieved: a globally distributed pixel tracker that responds in under 50ms from anywhere in the world, processes thousands of requests per second, and costs less than $1 per million requests. All without managing a single server or worrying about cold starts.

Part 2: The Event Processing Dilemma

With our lightning-fast pixel tracker shipping events to SQS, I faced a classic serverless architecture decision: how do we reliably process potentially millions of analytics events without breaking the bank or losing data during traffic spikes?

The naive approach would be to process each SQS message individually with a Lambda function. But anyone who's worked with high-frequency events knows this is a recipe for disaster. You'll hit Lambda concurrency limits, rack up enormous costs, and still risk losing data during traffic spikes.

I needed a better approach. One that could batch process events efficiently, handle failures gracefully, and scale economically. This is where the architectural choices become really interesting, and where understanding the trade-offs between different AWS services becomes crucial.

The Rust Lambda: Speed Meets Reliability

For the event processor, I chose Rust over the more common Python or Node.js. Why? Because when you're processing potentially millions of events, every millisecond and every byte of memory matters. Rust's zero-cost abstractions and memory safety make it perfect for high-throughput, low-latency serverless functions.

But Rust in Lambda is about performance and reliability. The type system catches bugs at compile time that would be runtime errors in other languages. When you're processing financial or analytics data, that compile-time safety is invaluable.

pub(crate) async fn function_handler(event: LambdaEvent<SqsEvent>) -> Result<(), Error> {
    let config = aws_config::load_defaults(aws_config::BehaviorVersion::v2025_08_07()).await;
    let client = aws_sdk_dynamodb::Client::new(&config);
    
    let table_name: String = env::var("TABLE_NAME").expect("TABLE_NAME must be set");
    let mut write_requests = Vec::new();

    // Process each SQS message in the batch
    for record in event.payload.records {
        let body = record.body.unwrap_or_default();
        let json = serde_json::from_str::<Value>(&body)?;
        
        // Generate the same partition key hash as our API will use for queries
        let referer = json.get("referer").and_then(|v| v.as_str());
        let mut hasher = Sha256::new();
        hasher.update(referer.unwrap_or_default().as_bytes());
        let pk = format!("{:x}", hasher.finalize());
        
        // Use timestamp as sort key for chronological ordering
        let sk = json["ts"].as_str().unwrap_or_default().to_string();
        
        let put_request = PutRequest::builder()
            .item("pk", AttributeValue::S(pk))
            .item("sk", AttributeValue::S(sk))
            .item("referer", AttributeValue::S(referer.unwrap_or_default().to_string()))
            .item("ip", AttributeValue::S(json["ip"].as_str().unwrap_or_default().to_string()))
            .item("ua", AttributeValue::S(json["ua"].as_str().unwrap_or_default().to_string()))
            .item("query", AttributeValue::S(json["query"].as_str().unwrap_or_default().to_string()))
            .build()?;
            
        write_requests.push(WriteRequest::builder().put_request(put_request).build());
    }

    // Batch write to DynamoDB - up to 25 items per request
    client.batch_write_item()
        .request_items(&table_name, write_requests)
        .send()
        .await?;
        
    Ok(())
}

The beauty of this function is its simplicity and efficiency. It processes up to 10 SQS messages in a single invocation (configurable), transforms them into DynamoDB items, and writes them in a single batch operation. This batching reduces both DynamoDB costs and execution time.

Building Resilience: The Art of Graceful Failure

Here's where years of production experience with serverless systems pays off. Most tutorials show you how to build the happy path, but production systems live in the land of partial failures, traffic spikes, and cascading outages. Building a system that degrades gracefully under load is an art form.

The key insight is that analytics events are inherently fault-tolerant. Losing a few page views during a traffic spike is acceptable, but losing all of them isn't. This tolerance for some data loss allows us to build much more resilient systems by implementing proper backpressure and circuit breakers.

const consumer = new RustFunction(this, "analytics-consumer", {
  // ... other config
  timeout: cdk.Duration.seconds(10),
  memorySize: 256,
  reservedConcurrentExecutions: 2, // Limit concurrent executions
  deadLetterQueue: new sqs.Queue(this, "analytics-consumer-dlq", {
    retentionPeriod: cdk.Duration.days(14),
    removalPolicy: cdk.RemovalPolicy.DESTROY,
  }),
  deadLetterQueueEnabled: true,
});

// SQS configuration with visibility timeout
const queue = new Queue(this, "analytics-inbound-queue", {
  retentionPeriod: cdk.Duration.days(3),
  visibilityTimeout: cdk.Duration.seconds(300), // 5 minutes
  maxReceiveCount: 3, // Retry failed messages 3 times
});

// Configure SQS as event source with batch processing
consumer.addEventSource(new SqsEventSource(queue, {
  batchSize: 10, // Process up to 10 messages per invocation
  maxBatchingWindow: cdk.Duration.seconds(5),
}));

The Road Not Taken: Alternative Architectures

Before settling on SQS + Lambda, I explored several other approaches, each with compelling trade-offs that taught me valuable lessons about serverless architecture patterns.

S3 + Athena: The Data Lake Dream

// Example S3 + Athena setup (not implemented)
const bucket = new s3.Bucket(this, "analytics-data-lake");
const database = new glue.Database(this, "analytics-database");

This approach would dump all events to S3 and use Athena for queries. Great for historical analysis and complex aggregations, but terrible for the real-time dashboard experience users expect. Plus, Athena queries can take 10-30 seconds to start, making interactive dashboards feel sluggish.

Kinesis Firehose: The Streaming Specialist

// Example Kinesis setup (not implemented)
const deliveryStream = new kinesisfirehose.DeliveryStream(this, "analytics-stream");

Kinesis Firehose excels at high-throughput streaming and automatic S3 delivery, but it felt like overkill for our moderate traffic volumes. The added complexity wasn't justified by the benefits, and the minimum costs would dwarf our actual usage.

The Goldilocks Solution SQS + Lambda hit that sweet spot that's simple enough to understand and debug, cost-effective for variable workloads, and performant enough for real-time dashboards. Sometimes the best architecture is the boring one that just works reliably.

Part 3: Making Data Beautiful - The Dashboard Journey

Building the infrastructure was the hard part, but making analytics data actionable required solving an entirely different set of challenges. How do you present thousands of events in a way that tells a story? How do you make the dashboard fast enough that people actually want to use it?

This is where the magic of modern React development shines. The combination of TanStack's ecosystem – Router for navigation, Query for server state, and Table for data presentation – creates an experience that rivals any commercial analytics platform.

The Rust API: Fast Queries at Scale

First, I needed an API that could query DynamoDB efficiently and serve data to the React frontend. Rust's performance characteristics make it perfect for this with sub-millisecond query response times even when processing thousands of records.

#[derive(Clone)]
struct AppState {
    dynamodb_client: aws_sdk_dynamodb::Client,
    table_name: String,
}

async fn get_data(Path(referer): Path<String>, State(state): State<AppState>) -> impl IntoResponse {
    let decoded_referer = decode_html_entities_to_string(&referer, encoding_map);
    
    let mut hasher = Sha256::new();
    hasher.update(decoded_referer.as_bytes());
    let pk = format!("{:x}", hasher.finalize());
    
    let resp = state.dynamodb_client
        .query()
        .table_name(&state.table_name)
        .key_condition_expression("pk = :pk")
        .expression_attribute_values(":pk", AttributeValue::S(pk))
        .scan_index_forward(false) // Latest first
        .send()
        .await;
        
    match resp {
        Ok(output) => {
            let items = output.items().unwrap_or_default();
            let analytics_data: Vec<AnalyticsData> = items.iter()
                .map(|item| AnalyticsData::from_dynamodb(item))
                .collect();
            Json(analytics_data).into_response()
        }
        Err(e) => {
            tracing::error!("DynamoDB query failed: {:?}", e);
            StatusCode::INTERNAL_SERVER_ERROR.into_response()
        }
    }
}

The React Dashboard: Where Data Meets Experience

Building the frontend was where all the infrastructure work paid off. Modern React development has evolved dramatically. The days of prop drilling and manual state management are behind us. TanStack Query handles server state so elegantly that caching, background refetching, and error handling become invisible.

// API client with intelligent caching and retries
const apiClient = new ApiClient();

export function useAnalyticsData(referer: string) {
  return useQuery({
    queryKey: ["analytics", referer],
    queryFn: () => apiClient.getAnalyticsData(referer),
    staleTime: 30 * 60 * 1000, // Fresh for 30 minutes
    refetchInterval: 15 * 60 * 1000, // Background refresh every 15 minutes
    retry: 3, // Resilient to network issues
  });
}

// The data table that makes analytics beautiful
export function AnalyticsDataTable() {
  const [globalFilter, setGlobalFilter] = useState("");
  const { data, isLoading, error } = useAnalyticsData("https://example.com");
  
  const table = useReactTable({
    data: data || [],
    columns: analyticsColumns,
    state: { globalFilter },
    onGlobalFilterChange: setGlobalFilter,
    getCoreRowModel: getCoreRowModel(),
    getFilteredRowModel: getFilteredRowModel(),
    getPaginationRowModel: getPaginationRowModel(),
  });
  
  if (isLoading) return <div className="animate-pulse">Loading analytics...</div>;
  if (error) return <div className="text-red-500">Error: {error.message}</div>;
  
  return (
    <div className="space-y-4">
      <Input
        placeholder="Search across all data..."
        value={globalFilter}
        onChange={(e) => setGlobalFilter(e.target.value)}
        className="max-w-sm"
      />
      <DataTable table={table} />
      <DataTablePagination table={table} />
    </div>
  );
}

What makes this special is the user experience. Data loads instantly from the cache, updates automatically in the background, and the search/filter operations happen locally for instant feedback. It's the kind of smooth interaction that users expect from modern web applications.

Lessons Learned: What This Architecture Teaches Us

After months of running this system in production, I've learned valuable lessons about serverless architecture that extend far beyond analytics.

The Power of Boring Technology

Scalability Through Decoupling: Each component scales independently, which sounds obvious until you realize how rare this is in practice. The pixel tracker can handle 10,000 RPS while the dashboard serves 10 users – each pays only for what it uses.

Reliability Through Redundancy: Every failure mode has a graceful degradation path. SQS acts as our circuit breaker, Lambda provides our processing capacity, and DynamoDB gives us consistent performance.

Cost Efficiency Through Smart Defaults: When every Lambda invocation costs money, you naturally batch operations and optimize for efficiency.

Observability as a First-Class Citizen: CloudWatch dashboards and alarms aren't afterthoughts. You can't manage what you can't measure, especially in distributed systems.

Developer Experience Through Type Safety: The combination of Rust for performance-critical paths and TypeScript for application logic creates a development environment where entire classes of bugs simply can't exist. Runtime errors become compile-time errors.

The Real Victory: Simplicity

The most important lesson? The best architecture is often the simplest one that meets your requirements. We could have built a complex streaming analytics platform with Kafka, Spark, and Kubernetes. Instead, we built something maintainable, understandable, reliable and open to future enhancements using managed AWS services.

This system processes millions of events monthly, costs less than $10 to operate, and has never lost data or had meaningful downtime. Sometimes the boring solution is the right solution.

What's Next?

This architecture provides a solid foundation, but analytics systems are never "done." Future enhancements might include:

Real-time aggregations for instant insights
Machine learning for anomaly detection
Geographic visualization for global traffic patterns

The beauty of this serverless foundation is that each enhancement can be added incrementally without disrupting existing functionality. That's the power of well-designed, loosely coupled systems.

Whether you're building analytics, processing IoT events, or handling financial transactions, these patterns – direct service integrations, event-driven processing, and modern frontend techniques – provide a blueprint for building robust, scalable applications that can grow with your needs.

Ready to build your own analytics system? The complete source code and deployment instructions are available on GitHub. Start tracking, start learning, and most importantly, start building.