Skip to content

Observability Guide

This guide documents the monitoring, debugging, and analytics infrastructure for GitTinkerer. It covers error capture via Sentry, metrics collection and export, run artifact inspection, Redis caching, rate limiting, and structured logging patterns.

Quick Start

Key Monitoring Tools

Tool Purpose Location
Sentry Error tracking and issue aggregation sentry.io
Redis Run status caching, rate limit counters REDIS_HOST:REDIS_PORT
Metrics DB Token usage, timing, completion metrics PostgreSQL run_metrics table
Analytics API Metrics aggregation and CSV export /api/analytics/* endpoints
Artifacts Run logs, diffs, PR comments artifacts/<TIMESTAMP>/
Logs Structured request/error logs Fastify logger (Pino)

Environment Variables for Observability

# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1  # Percentage of transactions to trace

# Redis cache and rate limiting
REDIS_HOST=localhost
REDIS_PORT=6379

# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here

# Rate limiting
RATE_LIMIT_MAX=5        # Max requests per window
RATE_LIMIT_WINDOW=60    # Window duration in seconds

# Logging
NODE_ENV=production     # Controls log level (info) vs development (debug)

1. Sentry Integration: Error Capture

GitTinkerer captures errors at multiple layers: TypeScript service, bash scripts, and webhook handling.

Service Layer Error Capture

Location: service/src/infra/sentry/

The SentryService class initializes and manages Sentry integration:

// Initialization on service startup
const sentry = new SentryService(env.SENTRY_DSN, {
  environment: env.SENTRY_ENVIRONMENT,
  tracesSampleRate: env.SENTRY_TRACES_SAMPLE_RATE
});

// Global error hook captures unhandled exceptions
app.setErrorHandler(async (error, request, reply) => {
  sentry.captureException(error, {
    requestId: request.id,
    path: request.url
  });
});

// Request context tracking
sentry.withContext({
  userId: conversationId,
  metadata: { repo, pr_number }
});

Key Methods:

  • captureException(error, context) — Send exception with metadata (requestId, path)
  • captureMessage(msg, level, context) — Log info/warning/error with context
  • recordBreadcrumb(action, data) — Record event for post-hoc debugging
  • setUser(userId) — Associate subsequent errors with user
  • withContext(data) — Add arbitrary key-value context to error reports

Bash Layer Error Capture

Location: lib/sentry.sh

Bash scripts send errors directly to Sentry API:

source lib/sentry.sh

sentry_capture_message "Deploy failed" "error" "production"

Payload Structure:

{
  "message": "Deploy failed",
  "level": "error",
  "timestamp": 1672743986,
  "tags": {
    "stage": "production",
    "run_id": "<PAYLOAD_RUN_ID>",
    "repo": "owner/repo",
    "pr_number": "123",
    "source": "workflow|webhook"
  },
  "user": {
    "id": "<PAYLOAD_WEB_CONVERSATION_ID>",
    "username": "<USER>"
  },
  "extra": {
    "payload_source": "github",
    "stage": "production"
  }
}

Authentication Header:

X-Sentry-Auth: Sentry sentry_key=<key>, sentry_version=7, sentry_timestamp=<ts>, sentry_client=gittinkerer-cli/1.0

Error Capture Flow Diagram

graph TD
    A["Service Request"] --> B{"Error Occurs?"}
    B -->|Yes| C["Global Error Hook<br/>captureException"]
    C --> D["Add Request Context<br/>requestId, path, userId"]
    D --> E["SentryService.captureException"]
    E --> F["Sentry API<br/>POST /api/&lt;project_id&gt;/store"]
    F --> G["Sentry Dashboard<br/>Error Aggregation"]

    H["Bash Script"] --> I{"Error/Exit?"}
    I -->|Yes| J["sentry_capture_message"]
    J --> K["Build API Endpoint<br/>from DSN"]
    K --> L["HTTP POST<br/>with Auth Header"]
    L --> F

    M["Webhook Handler<br/>GitHub Event"]
    M --> N{"Validation Fails?"}
    N -->|Yes| O["captureException<br/>with webhookId"]
    O --> E

    style G fill:#4CAF50
    style F fill:#2196F3
    style A fill:#FFC107
    style H fill:#FFC107
    style M fill:#FFC107

Configuring Sentry

File: service/src/config/env.ts

SENTRY_DSN: z.string().url().optional(),           // Project DSN
SENTRY_ENVIRONMENT: z.enum(['development', 'production']).default('production'),
SENTRY_TRACES_SAMPLE_RATE: z.number().min(0).max(1).default(0.1)

To Disable Sentry (e.g., local development): - Leave SENTRY_DSN unset - Service will no-op all Sentry calls but log console.error

Monitoring Sentry Alerts

Key Metrics to Watch:

  • Error Rate: Spikes > 10% above baseline
  • New Issues: Projects monitor "Regressed" status
  • Affected Users: If userId context is set, track which users see errors
  • Error Distribution: By stage tag (production, staging) and repo tag

Set Alerts For:

  1. High Error Volume: 50+ errors in 5 minutes
  2. Critical Path Failures: Issues in handleIssueComment, handlePullRequest usecases
  3. Infrastructure Errors: Redis connection failures, database timeouts
  4. Rate Limit Breaches: Accumulation of 429 responses in webhook handler

2. Metrics Collection and Storage

GitTinkerer collects performance metrics at both bash and service layers, storing them in PostgreSQL for analysis.

Bash-Level Metrics (lib/metrics.sh)

Token Approximation:

# Estimates tokens from text
approximate_token_count "Long text here"
# Formula: max(word_count, ceil(char_count / 4))

Diff Metrics:

# Counts additions and removals from unified diff
calculate_diff_loc
# Additions: lines matching ^(\+[^+]|\ No newline at end of file)
# Removals: lines matching ^-[^-]
# Returns: total_loc (additions + removals)

Metrics Queue:

# Queue metrics in memory during run
record_metric "tokens_used" 1250
record_metric "files_changed" 3
record_metric "duration_ms" 5432

# Flush all queued metrics to API after run completes
flush_metrics
# POST to $METRICS_API_URL with Bearer token

Service-Level Metrics (service/src/usecases/metrics/)

Supported Metric Names:

Metric Unit Meaning
duration_ms milliseconds Total execution time
exit_code integer Process exit code (0 = success)
tokens_used count LLM tokens consumed by completion
files_changed count Number of files modified in diff
cost_usd USD Computed cost (tokens_used × 0.002 / 1000)

Recording Metrics:

// In usecases that need to track performance
import { MetricsService } from '../../services/metrics';

const metrics = new MetricsService(pool); // PostgreSQL pool

await metrics.recordMetric(runId, 'duration_ms', 5432);
await metrics.recordMetric(runId, 'tokens_used', 1250);
await metrics.recordMetric(runId, 'cost_usd', 2.50);

Database Schema

-- Metrics storage
CREATE TABLE run_metrics (
  id BIGSERIAL PRIMARY KEY,
  run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE,
  metric_name TEXT NOT NULL,
  metric_value NUMERIC,
  recorded_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX run_metrics_run_idx ON run_metrics (run_id);
CREATE INDEX run_metrics_name_idx ON run_metrics (metric_name);
CREATE INDEX run_metrics_recorded_at_idx ON run_metrics (recorded_at DESC);

Metrics Collection Flow Diagram

graph LR
    A["Bash Script<br/>bin/gittinkerer"] --> B["record_metric<br/>Queue in Memory"]
    B --> C["Calculate Metrics<br/>tokens, diff LOC, duration"]
    C --> D["Run Completes"]
    D --> E["flush_metrics<br/>POST to API"]
    E --> F["Service Handler<br/>/api/metrics/record"]

    G["Service Usecase"] --> H["MetricsService.recordMetric"]
    H --> I["INSERT run_metrics<br/>PostgreSQL"]

    F --> I

    J["Database"] --> K["run_metrics Table"]
    I --> K

    style I fill:#4CAF50
    style K fill:#2196F3
    style E fill:#FF9800
    style F fill:#FF9800

3. Analytics and CSV Export

The analytics API provides aggregation and export of metrics for operational analysis.

Analytics Endpoints

Endpoint: GET /api/analytics/metrics

Aggregate metrics across runs:

GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg&from=2026-01-01&to=2026-01-07

Response:

{
  "metric": "duration_ms",
  "aggregation": "avg",
  "value": 4852.5,
  "count": 24,
  "min": 1200,
  "max": 8950
}

Supported Aggregations: sum, avg, min, max, count, stddev

CSV Export Endpoint

Endpoint: GET /api/analytics/export

Export raw metrics in CSV format:

GET /api/analytics/export?metricName=duration_ms&format=csv&limit=1000

CSV Output:

id,run_id,metric_name,metric_value,recorded_at
1,550e8400-e29b-41d4-a716-446655440000,duration_ms,5432,2026-01-07T14:32:15.000Z
2,550e8400-e29b-41d4-a716-446655440001,duration_ms,4120,2026-01-07T14:35:22.000Z
3,550e8400-e29b-41d4-a716-446655440002,tokens_used,1250,2026-01-07T14:38:10.000Z

Key Analytics Queries for Operators

Monitor Average Completion Time:

GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg

Track Token Usage Trends:

GET /api/analytics/metrics?metricName=tokens_used&aggregation=sum&from=<yesterday>&to=<today>

Export Last 1000 Runs for Cost Analysis:

GET /api/analytics/export?metricName=cost_usd&format=csv&limit=1000

Identify Slow Runs:

GET /api/analytics/export?metricName=duration_ms&format=csv&limit=100
// Then filter locally for duration_ms > 10000

Analytics Flow Diagram

graph TD
    A["Metrics Recorded<br/>run_metrics Table"] --> B["Operator Query<br/>GET /api/analytics/*"]
    B --> C{"Export or<br/>Aggregate?"}

    C -->|Export CSV| D["SELECT run_id, metric_name,<br/>metric_value, recorded_at"]
    D --> E["Format as CSV<br/>Send with Content-Type: text/csv"]
    E --> F["Operator<br/>Import to Analysis Tool"]

    C -->|Aggregate| G["SELECT metric_name,<br/>Aggregation Function"]
    G --> H["Calculate:<br/>sum, avg, min, max, count"]
    H --> I["Return JSON<br/>with Statistics"]
    I --> F

    F --> J["Analysis<br/>Dashboard, Alerts"]

    style A fill:#2196F3
    style J fill:#4CAF50
    style E fill:#FF9800
    style I fill:#FF9800

4. Run Artifacts and Debugging Failed Runs

Each run produces timestamped artifacts containing logs, diffs, and metadata for post-mortem analysis.

Artifacts Directory Structure

artifacts/
├── 20260103T142626Z/                # Timestamp: YYYYMMDDTHHMMSSZ
│   ├── payload.json                 # Original GitHub webhook payload
│   ├── run.json                     # Run metadata and status
│   └── agent-run/
│       ├── prompt.txt               # Full prompt sent to LLM
│       ├── diff.patch               # Unified diff of proposed changes
│       ├── files_changed.json       # Array of modified files
│       ├── pr_comment.txt           # Comment posted to PR
│       ├── summary.md               # Human-readable summary
│       └── commit_sha.txt           # Committed SHA (if successful)
└── runs/
    └── github-deliveries/           # Webhook delivery logs

Run Metadata (run.json)

{
  "run_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "succeeded",
  "timestamp": "20260103T142626Z",
  "payload_path": "/path/to/artifacts/payload.json",
  "artifacts_dir": "/path/to/artifacts/20260103T142626Z/",
  "started_at": "2026-01-03T14:26:26.695Z",
  "finished_at": "2026-01-03T14:26:50.476Z",
  "exit_code": 0
}

Debugging Failed Runs

Step 1: Check Run Status

Query the runs table:

SELECT run_id, status, source, created_at, finished_at
FROM runs
WHERE status = 'failed'
ORDER BY created_at DESC
LIMIT 10;

Step 2: Review Artifacts

# Find artifacts by run_id or timestamp
ls -la artifacts/20260103T142626Z/

# Check run metadata
cat artifacts/20260103T142626Z/run.json

# Review diff that was proposed (if generation succeeded)
cat artifacts/20260103T142626Z/agent-run/diff.patch

# Check error details in run.json or Sentry

Step 3: Check Sentry for Errors

Look up errors by run_id tag:

https://sentry.io/organizations/your-org/issues/?query=tags.run_id:[run_id]

Step 4: Analyze Common Failure Points

Failure Point Check Resolution
Payload validation payload.json structure, required fields Verify webhook event type (issue, pull_request)
LLM timeout agent-run/prompt.txt, Sentry timeout errors Check LLM API status, increase timeout
Git operations Check exit_code in run.json, Sentry tags Verify SSH keys, branch permissions
PR comment posting Check pr_comment.txt exists, Sentry 403 errors Verify GitHub token scope, PR state
Rate limiting Check metrics for 429 responses Monitor /api/rate-limit endpoint

Artifacts Flow Diagram

graph TD
    A["GitHub Event<br/>Issue/PR Comment"] --> B["Create Run<br/>run_id, timestamp"]
    B --> C["Create artifacts/<br/>TIMESTAMP/ Directory"]

    C --> D["Store payload.json<br/>Original Webhook"]
    C --> E["Store run.json<br/>Metadata, Status"]

    F["Execute Agent"] --> G["Generate LLM Response"]
    G --> H["Create agent-run/<br/>subdirectory"]
    H --> I["Store prompt.txt<br/>diff.patch<br/>files_changed.json"]

    F --> J{"Success?"}
    J -->|Yes| K["Store pr_comment.txt<br/>commit_sha.txt"]
    J -->|No| L["Store Error Message<br/>Send to Sentry"]

    K --> M["Update run.json<br/>status=succeeded"]
    L --> N["Update run.json<br/>status=failed"]

    O["Operator Debugging"] --> P["Query runs Table"]
    P --> Q["Inspect artifacts/<br/>TIMESTAMP/"]
    Q --> R["Check run.json Status"]
    R --> S["Review Sentry<br/>by run_id Tag"]

    style B fill:#2196F3
    style M fill:#4CAF50
    style N fill:#F44336
    style S fill:#FF9800

5. Redis Cache: Run Status and Rate Limiting

Redis serves dual roles: caching run status for rapid polling and enforcing rate limits on webhook processing.

Run Status Caching

Location: service/src/services/redis.ts

Cache Configuration:

const redis = new Redis({
  host: env.REDIS_HOST,
  port: env.REDIS_PORT,
  retryStrategy: (times) => Math.min(times * 50, 5000),
  disableOfflineQueue: true  // Fail fast if disconnected
});

Cache Operations:

// Set cached status with 24-hour TTL
const TTL_SECONDS = 86400;
await redis.setex(
  `run-status:${runId}`,
  TTL_SECONDS,
  JSON.stringify({ status, timestamp })
);

// Get cached status (avoids database hit)
const cached = await redis.get(`run-status:${runId}`);

// Delete when run is archived
await redis.del(`run-status:${runId}`);

Cache Key Pattern: run-status:<runId>

TTL: 86,400 seconds (24 hours)

Usecase: service/src/usecases/runs/getRun.ts queries cache first before database on rapid polling, reducing DB load during status check storms.

Monitoring Cache Hit Rate

Metrics to Track:

Metric Query Target
Cache Hits INFO statskeyspace_hits > 80% of queries
Cache Misses INFO statskeyspace_misses < 20% of queries
Evictions INFO statsevicted_keys Should be 0 (TTL-based expiry)
Memory Usage INFO memoryused_memory_human < 512 MB for run cache

Redis Commands for Operators:

# Connect to Redis
redis-cli -h <REDIS_HOST> -p <REDIS_PORT>

# Check cache statistics
> INFO stats

# Monitor memory usage
> INFO memory

# Find all run-status keys
> KEYS run-status:*

# Check TTL of a specific key
> TTL run-status:550e8400-e29b-41d4-a716-446655440000

# Clear all run cache (if needed for maintenance)
> EVAL "return redis.call('del', unpack(redis.call('keys', 'run-status:*')))" 0

Rate Limiting

Location: service/src/services/redis.ts, service/src/middleware/rateLimit.ts

Two-Layer Rate Limiting Architecture:

Layer 1: Global IP-Based (Fastify Plugin)

// Registered in server.ts
app.register(fastifyRateLimit, {
  max: env.RATE_LIMIT_MAX,      // 5 requests per window
  timeWindow: `${env.RATE_LIMIT_WINDOW}s`  // 60 seconds
});

Returns HTTP 429 with headers: - Retry-After: <seconds> - X-RateLimit-Limit: 5 - X-RateLimit-Remaining: 0

Layer 2: Per-Repository Rate Limiting

// Custom check in usecases/github/handleIssueComment.ts
const result = await rateLimitService.checkLimit(repoName);

if (!result.allowed) {
  return {
    status: 429,
    error: "Rate limited per repository",
    retryAfter: result.retryAfter,
    remainingRequests: result.remaining
  };
}

Rate Limit Configuration:

RATE_LIMIT_MAX=5        // Max requests per window
RATE_LIMIT_WINDOW=60    // Window duration in seconds

Rate Limiting Flow Diagram

graph TD
    A["Webhook Arrives<br/>GitHub Event"] --> B["Global Rate Limit<br/>Check IP"]
    B --> C{"IP Limit<br/>Exceeded?"}
    C -->|Yes| D["Return 429<br/>Retry-After Header"]
    C -->|No| E["Route to Handler<br/>handleIssueComment"]

    E --> F["Per-Repo Rate Limit<br/>Check Repository"]
    F --> G{"Repo Limit<br/>Exceeded?"}
    G -->|Yes| H["Return 429<br/>Custom Response"]
    G -->|No| I["Process Webhook<br/>Increment Counter"]

    D --> J["Client Backoff"]
    H --> J
    J --> K["Retry After Window"]
    K --> A

    I --> L["Record Metric<br/>webhook_processed"]
    L --> M["Send Response<br/>to GitHub"]

    style D fill:#F44336
    style H fill:#F44336
    style M fill:#4CAF50
    style I fill:#FFC107

Monitoring Rate Limits

Alerts for Operators:

  1. Sustained 429 Responses: > 10 in 5-minute window → Check for coordinated webhook deliveries
  2. Per-Repo Limit Breaches: Same repo hitting limit repeatedly → May indicate malicious activity or misconfigured webhook
  3. Redis Unavailable: Rate limit service should log and allow request with fallback, but alert on repeated failures

6. Structured Logging

GitTinkerer uses Fastify's structured logging (Pino-based) with tag-prefixed console logs and Sentry integration.

Request Log Toggles

Request logging is off by default to reduce noisy health checks. Enable it with:

  • LOG_REQUESTS=true to log non-health requests.
  • LOG_HEALTH_REQUESTS=true to log /health requests explicitly.

Fastify Logger Configuration

Location: service/src/server.ts

const app = Fastify({
  logger: {
    level: env.nodeEnv === 'development' ? 'debug' : 'info',
    serializers: {
      req(request) {
        return {
          id: request.id,
          method: request.method,
          url: request.url,
          remoteAddress: request.ip
        };
      },
      res(reply) {
        return { statusCode: reply.statusCode };
      }
    },
    requestIdHeader: 'X-Request-ID',
    requestIdLogLabel: 'requestId',
    genReqId(req) {
      return req.headers['x-request-id'] || generateUuid();
    }
  }
});

Log Levels and Usage

Level Usage Environment
debug Detailed execution flow, variable values Development only
info Service startup, successful operations Production
warn Recoverable issues, rate limit warnings Both
error Failures, unhandled exceptions Both

Tag-Prefixed Console Logging Patterns

Sentry Initialization:

[sentry] Sentry initialized with DSN: https://...
[sentry] Environment: production

Redis Connection:

[redis] Connecting to redis://localhost:6379
[redis] Connected successfully
[redis] Connection failed: ECONNREFUSED

Cache Operations:

[cache] Cache hit for run-status:550e8400
[cache] Cache miss for run-status:550e8400
[cache] Evicting stale entries: 3 keys

Rate Limiting:

[rate-limit] IP 192.168.1.1 limit check: 4/5 remaining
[rate-limit] Repo owner/repo limit exceeded: 0/5 remaining

Database Migrations:

[db] Running migration: 001_create_runs_table.sql
[db] Migration completed in 234ms

Structured Log Context

All request logs include context from requestIdMiddleware:

// Child logger with requestId automatically included
const child = app.log.child({ requestId: request.id });
child.info('Processing issue comment');

// Output includes requestId in all subsequent logs for this request
// {
//   "level": 30,
//   "time": "2026-01-07T14:32:15.000Z",
//   "requestId": "550e8400-e29b-41d4-a716-446655440000",
//   "msg": "Processing issue comment"
// }

Error Logging with Sentry Integration

Location: service/src/controllers/errors.ts

app.setErrorHandler(async (error, request, reply) => {
  // Structured error log
  request.log.error({
    err: error,
    url: request.url,
    method: request.method,
    statusCode: reply.statusCode
  });

  // Send to Sentry (if enabled)
  if (sentry.isEnabled()) {
    sentry.captureException(error, {
      requestId: request.id,
      tags: { handler: 'global' }
    });
  }
});

Logging Flow Diagram

graph TD
    A["Request Arrives"] --> B["Generate or<br/>Extract requestId"]
    B --> C["Create Child Logger<br/>with requestId"]

    C --> D["Middleware Layer"]
    D --> E["Log Request:<br/>method, url, ip"]
    E --> F["Route to Handler"]

    F --> G{"Error?"}
    G -->|No| H["Log Success<br/>statusCode"]
    G -->|Yes| I["Log Error<br/>Error Object"]

    I --> J["captureException<br/>to Sentry"]
    J --> K["Return Error Response"]

    H --> L["All logs tagged<br/>with requestId"]
    K --> L

    L --> M["Log Aggregation<br/>ELK/Datadog"]
    M --> N["Operator Analysis<br/>by requestId"]

    style C fill:#2196F3
    style L fill:#4CAF50
    style J fill:#FF9800
    style M fill:#FF9800

Log Analysis for Operators

Find Logs by Request ID:

# If logs are in JSON format
cat logs/*.json | grep '"requestId":"550e8400-e29b-41d4-a716-446655440000"'

Correlate with Errors:

  1. Identify error from Sentry
  2. Extract requestId from error details
  3. Query logs for that requestId
  4. Follow request flow from entry to error

Track Request Latency:

# Grep for request start and end, calculate duration
grep "Processing issue comment" logs/prod.json | jq '.time'
grep "Completed with status 200" logs/prod.json | jq '.time'
# Calculate difference manually

7. Troubleshooting Guide for Operators

Symptom: Service Errors in Sentry

Diagnosis Steps:

  1. Check Error Frequency and Distribution:
  2. Is this a recent regression or persistent issue?
  3. Is error affecting all repositories or specific ones?
  4. Check stage and repo tags in Sentry

  5. Inspect Error Context:

  6. Review requestId in error breadcrumbs
  7. Check userId if error involves user-specific state
  8. Note affected GitHub PR/issue numbers

  9. Correlate with Logs:

  10. Use requestId to find full request lifecycle in logs
  11. Check for related errors in same request chain
  12. Identify which operation failed (LLM call, git operation, API request)

  13. Check Resource Constraints:

  14. Is Redis available? (Check [redis] logs)
  15. Is database responding? (Check query latencies in logs)
  16. Is external API (LLM, GitHub) available? (Check timeout errors in Sentry)

Symptom: High Metrics but Slow Completion Times

Diagnosis Steps:

  1. Query Metrics Endpoint:

    GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg
    

  2. Identify Slow Operations:

  3. Compare duration_ms from recent runs
  4. Check if slowdown is consistent or intermittent

  5. Analyze Bottleneck:

  6. LLM Timeout: Check Sentry for timeout errors in handlePullRequest
  7. Git Operations: Check exit codes in run.json files, look for permission errors
  8. Database: Check PostgreSQL query logs for slow queries
  9. Redis: Check connection delays with redis-cli LATENCY LATEST

  10. Export Detailed Data:

    GET /api/analytics/export?metricName=duration_ms&format=csv&limit=1000
    
    Analyze CSV to find percentiles and outliers.

Symptom: Redis Cache Misses Increasing

Diagnosis Steps:

  1. Check Redis Memory:

    redis-cli INFO memory
    # If used_memory > configured max, eviction policy may be evicting keys
    

  2. Monitor Cache Statistics:

    redis-cli INFO stats
    # High evicted_keys indicates memory pressure
    

  3. Review Cache Keys:

    redis-cli KEYS run-status:* | wc -l
    # Should be roughly equal to active runs (typically < 100)
    

  4. Check TTL Expiry:

    redis-cli TTL run-status:<runId>
    # If negative, key has expired and will be removed
    

Resolution: - Increase Redis memory allocation if used_memory near limit - Reduce TTL if cache bloat is issue - Monitor /api/rate-limit for spike in request volume

Symptom: Rate Limiting Affecting Legitimate Webhooks

Diagnosis Steps:

  1. Check Rate Limit Metrics:
  2. Query for 429 responses in metrics
  3. Identify which repositories are affected

  4. Review Webhook Delivery History:

  5. Check GitHub webhook deliveries in repository settings
  6. Identify if deliveries are clustered or distributed

  7. Analyze Per-Repo Limit:

  8. Check if specific repository is hitting limit repeatedly
  9. Verify RATE_LIMIT_WINDOW and RATE_LIMIT_MAX settings

  10. Inspect Recent Logs:

    grep "\[rate-limit\]" logs/prod.json | tail -20
    

Resolution: - Increase RATE_LIMIT_MAX if limit is too strict - Increase RATE_LIMIT_WINDOW to allow more time between bursts - Configure per-repo whitelisting if specific repo needs higher quota

Symptom: Failed Runs Not Creating Artifacts

Diagnosis Steps:

  1. Check Run Status:

    SELECT run_id, status, created_at, finished_at
    FROM runs
    WHERE artifacts_dir IS NULL
    ORDER BY created_at DESC
    LIMIT 5;
    

  2. Verify Artifacts Directory:

    ls -la artifacts/
    # Should show YYYYMMDDTHHMMSSZ directories for recent runs
    

  3. Check Sentry for File System Errors:

  4. Search for errors containing "ENOSPC", "EACCES", "EIO"
  5. These indicate disk space, permissions, or I/O issues

  6. Inspect Run Handler:

  7. Check service/src/usecases/runs/createRun.ts
  8. Verify artifacts_dir is being set correctly

Resolution: - Check available disk space: df -h /path/to/artifacts - Verify directory permissions: chmod 755 artifacts/ - Check for I/O errors: dmesg | tail -20

Quick Troubleshooting Checklist

Issue Check Command
Service down Health check curl http://localhost:3000/health
Sentry silent Verify DSN Check SENTRY_DSN env var in logs
Redis unavailable Connection redis-cli ping
Disk full Storage df -h /
Database down Postgres psql -U user -d dbname -c "SELECT 1;"
High latency Network ping sentry.io, redis-cli LATENCY LATEST
Rate limiting Webhook volume grep "429" logs/*.json | wc -l
Cache stale TTL redis-cli TTL run-status:*

Appendix: Environment Variables Reference

Monitoring Configuration

# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production|staging|development
SENTRY_TRACES_SAMPLE_RATE=0.0-1.0  # Percentage of transactions to trace (default: 0.1)

# Redis cache and rate limiting
REDIS_HOST=localhost               # Redis server hostname
REDIS_PORT=6379                    # Redis server port
REDIS_PASSWORD=                    # Redis password (optional)

# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here

# Rate limiting
RATE_LIMIT_MAX=5                   # Max requests per time window
RATE_LIMIT_WINDOW=60               # Time window in seconds

# Logging
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOG_LEVEL=info                     # Override log level if needed

Database Monitoring Queries

Run Status Distribution:

SELECT status, COUNT(*) as count
FROM runs
GROUP BY status;

Metrics per Run:

SELECT run_id, COUNT(*) as metric_count, 
       MAX(recorded_at) as last_updated
FROM run_metrics
GROUP BY run_id
ORDER BY last_updated DESC
LIMIT 20;

Average Metrics by Day:

SELECT DATE(recorded_at) as day, 
       metric_name,
       AVG(metric_value) as avg_value,
       MIN(metric_value) as min_value,
       MAX(metric_value) as max_value
FROM run_metrics
WHERE metric_name = 'duration_ms'
GROUP BY DATE(recorded_at), metric_name
ORDER BY day DESC;