Observability Guide¶

This guide documents the monitoring, debugging, and analytics infrastructure for GitTinkerer. It covers error capture via Sentry, metrics collection and export, run artifact inspection, Redis caching, rate limiting, and structured logging patterns.

Quick Start¶

Key Monitoring Tools¶

Tool	Purpose	Location
Sentry	Error tracking and issue aggregation	sentry.io
Redis	Run status caching, rate limit counters	`REDIS_HOST:REDIS_PORT`
Metrics DB	Token usage, timing, completion metrics	PostgreSQL `run_metrics` table
Analytics API	Metrics aggregation and CSV export	`/api/analytics/*` endpoints
Artifacts	Run logs, diffs, PR comments	`artifacts/<TIMESTAMP>/`
Logs	Structured request/error logs	Fastify logger (Pino)

Environment Variables for Observability¶

# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1  # Percentage of transactions to trace

# Redis cache and rate limiting
REDIS_HOST=localhost
REDIS_PORT=6379

# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here

# Rate limiting
RATE_LIMIT_MAX=5        # Max requests per window
RATE_LIMIT_WINDOW=60    # Window duration in seconds

# Logging
NODE_ENV=production     # Controls log level (info) vs development (debug)

1. Sentry Integration: Error Capture¶

GitTinkerer captures errors at multiple layers: TypeScript service, bash scripts, and webhook handling.

Service Layer Error Capture¶

Location: service/src/infra/sentry/

The SentryService class initializes and manages Sentry integration:

// Initialization on service startup
const sentry = new SentryService(env.SENTRY_DSN, {
  environment: env.SENTRY_ENVIRONMENT,
  tracesSampleRate: env.SENTRY_TRACES_SAMPLE_RATE
});

// Global error hook captures unhandled exceptions
app.setErrorHandler(async (error, request, reply) => {
  sentry.captureException(error, {
    requestId: request.id,
    path: request.url
  });
});

// Request context tracking
sentry.withContext({
  userId: conversationId,
  metadata: { repo, pr_number }
});

Key Methods:

captureException(error, context) — Send exception with metadata (requestId, path)
captureMessage(msg, level, context) — Log info/warning/error with context
recordBreadcrumb(action, data) — Record event for post-hoc debugging
setUser(userId) — Associate subsequent errors with user
withContext(data) — Add arbitrary key-value context to error reports

Bash Layer Error Capture¶

Location: lib/sentry.sh

Bash scripts send errors directly to Sentry API:

source lib/sentry.sh

sentry_capture_message "Deploy failed" "error" "production"

Payload Structure:

{
  "message": "Deploy failed",
  "level": "error",
  "timestamp": 1672743986,
  "tags": {
    "stage": "production",
    "run_id": "<PAYLOAD_RUN_ID>",
    "repo": "owner/repo",
    "pr_number": "123",
    "source": "workflow|webhook"
  },
  "user": {
    "id": "<PAYLOAD_WEB_CONVERSATION_ID>",
    "username": "<USER>"
  },
  "extra": {
    "payload_source": "github",
    "stage": "production"
  }
}

Authentication Header:

X-Sentry-Auth: Sentry sentry_key=<key>, sentry_version=7, sentry_timestamp=<ts>, sentry_client=gittinkerer-cli/1.0

Error Capture Flow Diagram¶

graph TD
    A["Service Request"] --> B{"Error Occurs?"}
    B -->|Yes| C["Global Error Hook<br/>captureException"]
    C --> D["Add Request Context<br/>requestId, path, userId"]
    D --> E["SentryService.captureException"]
    E --> F["Sentry API<br/>POST /api/&lt;project_id&gt;/store"]
    F --> G["Sentry Dashboard<br/>Error Aggregation"]

    H["Bash Script"] --> I{"Error/Exit?"}
    I -->|Yes| J["sentry_capture_message"]
    J --> K["Build API Endpoint<br/>from DSN"]
    K --> L["HTTP POST<br/>with Auth Header"]
    L --> F

    M["Webhook Handler<br/>GitHub Event"]
    M --> N{"Validation Fails?"}
    N -->|Yes| O["captureException<br/>with webhookId"]
    O --> E

    style G fill:#4CAF50
    style F fill:#2196F3
    style A fill:#FFC107
    style H fill:#FFC107
    style M fill:#FFC107

Configuring Sentry¶

File: service/src/config/env.ts

SENTRY_DSN: z.string().url().optional(),           // Project DSN
SENTRY_ENVIRONMENT: z.enum(['development', 'production']).default('production'),
SENTRY_TRACES_SAMPLE_RATE: z.number().min(0).max(1).default(0.1)

To Disable Sentry (e.g., local development): - Leave SENTRY_DSN unset - Service will no-op all Sentry calls but log console.error

Monitoring Sentry Alerts¶

Key Metrics to Watch:

Error Rate: Spikes > 10% above baseline
New Issues: Projects monitor "Regressed" status
Affected Users: If userId context is set, track which users see errors
Error Distribution: By stage tag (production, staging) and repo tag

Set Alerts For:

High Error Volume: 50+ errors in 5 minutes
Critical Path Failures: Issues in handleIssueComment, handlePullRequest usecases
Infrastructure Errors: Redis connection failures, database timeouts
Rate Limit Breaches: Accumulation of 429 responses in webhook handler

2. Metrics Collection and Storage¶

GitTinkerer collects performance metrics at both bash and service layers, storing them in PostgreSQL for analysis.

Bash-Level Metrics (`lib/metrics.sh`)¶

Token Approximation:

# Estimates tokens from text
approximate_token_count "Long text here"
# Formula: max(word_count, ceil(char_count / 4))

Diff Metrics:

# Counts additions and removals from unified diff
calculate_diff_loc
# Additions: lines matching ^(\+[^+]|\ No newline at end of file)
# Removals: lines matching ^-[^-]
# Returns: total_loc (additions + removals)

Metrics Queue:

# Queue metrics in memory during run
record_metric "tokens_used" 1250
record_metric "files_changed" 3
record_metric "duration_ms" 5432

# Flush all queued metrics to API after run completes
flush_metrics
# POST to $METRICS_API_URL with Bearer token

Service-Level Metrics (`service/src/usecases/metrics/`)¶

Supported Metric Names:

Metric	Unit	Meaning
`duration_ms`	milliseconds	Total execution time
`exit_code`	integer	Process exit code (0 = success)
`tokens_used`	count	LLM tokens consumed by completion
`files_changed`	count	Number of files modified in diff
`cost_usd`	USD	Computed cost (`tokens_used × 0.002 / 1000`)

Recording Metrics:

// In usecases that need to track performance
import { MetricsService } from '../../services/metrics';

const metrics = new MetricsService(pool); // PostgreSQL pool

await metrics.recordMetric(runId, 'duration_ms', 5432);
await metrics.recordMetric(runId, 'tokens_used', 1250);
await metrics.recordMetric(runId, 'cost_usd', 2.50);

Database Schema¶

-- Metrics storage
CREATE TABLE run_metrics (
  id BIGSERIAL PRIMARY KEY,
  run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE,
  metric_name TEXT NOT NULL,
  metric_value NUMERIC,
  recorded_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX run_metrics_run_idx ON run_metrics (run_id);
CREATE INDEX run_metrics_name_idx ON run_metrics (metric_name);
CREATE INDEX run_metrics_recorded_at_idx ON run_metrics (recorded_at DESC);

Metrics Collection Flow Diagram¶

graph LR
    A["Bash Script<br/>bin/gittinkerer"] --> B["record_metric<br/>Queue in Memory"]
    B --> C["Calculate Metrics<br/>tokens, diff LOC, duration"]
    C --> D["Run Completes"]
    D --> E["flush_metrics<br/>POST to API"]
    E --> F["Service Handler<br/>/api/metrics/record"]

    G["Service Usecase"] --> H["MetricsService.recordMetric"]
    H --> I["INSERT run_metrics<br/>PostgreSQL"]

    F --> I

    J["Database"] --> K["run_metrics Table"]
    I --> K

    style I fill:#4CAF50
    style K fill:#2196F3
    style E fill:#FF9800
    style F fill:#FF9800

3. Analytics and CSV Export¶

The analytics API provides aggregation and export of metrics for operational analysis.

Analytics Endpoints¶

Endpoint: GET /api/analytics/metrics

Aggregate metrics across runs:

GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg&from=2026-01-01&to=2026-01-07

Response:

{
  "metric": "duration_ms",
  "aggregation": "avg",
  "value": 4852.5,
  "count": 24,
  "min": 1200,
  "max": 8950
}

Supported Aggregations: sum, avg, min, max, count, stddev

CSV Export Endpoint¶

Endpoint: GET /api/analytics/export

Export raw metrics in CSV format:

GET /api/analytics/export?metricName=duration_ms&format=csv&limit=1000

CSV Output:

id,run_id,metric_name,metric_value,recorded_at
1,550e8400-e29b-41d4-a716-446655440000,duration_ms,5432,2026-01-07T14:32:15.000Z
2,550e8400-e29b-41d4-a716-446655440001,duration_ms,4120,2026-01-07T14:35:22.000Z
3,550e8400-e29b-41d4-a716-446655440002,tokens_used,1250,2026-01-07T14:38:10.000Z

Key Analytics Queries for Operators¶

Monitor Average Completion Time:

GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg

Track Token Usage Trends:

GET /api/analytics/metrics?metricName=tokens_used&aggregation=sum&from=<yesterday>&to=<today>

Export Last 1000 Runs for Cost Analysis:

GET /api/analytics/export?metricName=cost_usd&format=csv&limit=1000

Identify Slow Runs:

GET /api/analytics/export?metricName=duration_ms&format=csv&limit=100
// Then filter locally for duration_ms > 10000

Analytics Flow Diagram¶

graph TD
    A["Metrics Recorded<br/>run_metrics Table"] --> B["Operator Query<br/>GET /api/analytics/*"]
    B --> C{"Export or<br/>Aggregate?"}

    C -->|Export CSV| D["SELECT run_id, metric_name,<br/>metric_value, recorded_at"]
    D --> E["Format as CSV<br/>Send with Content-Type: text/csv"]
    E --> F["Operator<br/>Import to Analysis Tool"]

    C -->|Aggregate| G["SELECT metric_name,<br/>Aggregation Function"]
    G --> H["Calculate:<br/>sum, avg, min, max, count"]
    H --> I["Return JSON<br/>with Statistics"]
    I --> F

    F --> J["Analysis<br/>Dashboard, Alerts"]

    style A fill:#2196F3
    style J fill:#4CAF50
    style E fill:#FF9800
    style I fill:#FF9800

4. Run Artifacts and Debugging Failed Runs¶

Each run produces timestamped artifacts containing logs, diffs, and metadata for post-mortem analysis.

Artifacts Directory Structure¶

artifacts/
├── 20260103T142626Z/                # Timestamp: YYYYMMDDTHHMMSSZ
│   ├── payload.json                 # Original GitHub webhook payload
│   ├── run.json                     # Run metadata and status
│   └── agent-run/
│       ├── prompt.txt               # Full prompt sent to LLM
│       ├── diff.patch               # Unified diff of proposed changes
│       ├── files_changed.json       # Array of modified files
│       ├── pr_comment.txt           # Comment posted to PR
│       ├── summary.md               # Human-readable summary
│       └── commit_sha.txt           # Committed SHA (if successful)
└── runs/
    └── github-deliveries/           # Webhook delivery logs

Run Metadata (run.json)¶

{
  "run_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "succeeded",
  "timestamp": "20260103T142626Z",
  "payload_path": "/path/to/artifacts/payload.json",
  "artifacts_dir": "/path/to/artifacts/20260103T142626Z/",
  "started_at": "2026-01-03T14:26:26.695Z",
  "finished_at": "2026-01-03T14:26:50.476Z",
  "exit_code": 0
}

Debugging Failed Runs¶

Step 1: Check Run Status

Query the runs table:

SELECT run_id, status, source, created_at, finished_at
FROM runs
WHERE status = 'failed'
ORDER BY created_at DESC
LIMIT 10;

Step 2: Review Artifacts

# Find artifacts by run_id or timestamp
ls -la artifacts/20260103T142626Z/

# Check run metadata
cat artifacts/20260103T142626Z/run.json

# Review diff that was proposed (if generation succeeded)
cat artifacts/20260103T142626Z/agent-run/diff.patch

# Check error details in run.json or Sentry

Step 3: Check Sentry for Errors

Look up errors by run_id tag:

https://sentry.io/organizations/your-org/issues/?query=tags.run_id:[run_id]

Step 4: Analyze Common Failure Points

Failure Point	Check	Resolution
Payload validation	`payload.json` structure, required fields	Verify webhook event type (issue, pull_request)
LLM timeout	`agent-run/prompt.txt`, Sentry `timeout` errors	Check LLM API status, increase timeout
Git operations	Check `exit_code` in `run.json`, Sentry tags	Verify SSH keys, branch permissions
PR comment posting	Check `pr_comment.txt` exists, Sentry `403` errors	Verify GitHub token scope, PR state
Rate limiting	Check metrics for 429 responses	Monitor `/api/rate-limit` endpoint

Artifacts Flow Diagram¶

graph TD
    A["GitHub Event<br/>Issue/PR Comment"] --> B["Create Run<br/>run_id, timestamp"]
    B --> C["Create artifacts/<br/>TIMESTAMP/ Directory"]

    C --> D["Store payload.json<br/>Original Webhook"]
    C --> E["Store run.json<br/>Metadata, Status"]

    F["Execute Agent"] --> G["Generate LLM Response"]
    G --> H["Create agent-run/<br/>subdirectory"]
    H --> I["Store prompt.txt<br/>diff.patch<br/>files_changed.json"]

    F --> J{"Success?"}
    J -->|Yes| K["Store pr_comment.txt<br/>commit_sha.txt"]
    J -->|No| L["Store Error Message<br/>Send to Sentry"]

    K --> M["Update run.json<br/>status=succeeded"]
    L --> N["Update run.json<br/>status=failed"]

    O["Operator Debugging"] --> P["Query runs Table"]
    P --> Q["Inspect artifacts/<br/>TIMESTAMP/"]
    Q --> R["Check run.json Status"]
    R --> S["Review Sentry<br/>by run_id Tag"]

    style B fill:#2196F3
    style M fill:#4CAF50
    style N fill:#F44336
    style S fill:#FF9800

5. Redis Cache: Run Status and Rate Limiting¶

Redis serves dual roles: caching run status for rapid polling and enforcing rate limits on webhook processing.

Run Status Caching¶

Location: service/src/services/redis.ts

Cache Configuration:

const redis = new Redis({
  host: env.REDIS_HOST,
  port: env.REDIS_PORT,
  retryStrategy: (times) => Math.min(times * 50, 5000),
  disableOfflineQueue: true  // Fail fast if disconnected
});

Cache Operations:

// Set cached status with 24-hour TTL
const TTL_SECONDS = 86400;
await redis.setex(
  `run-status:${runId}`,
  TTL_SECONDS,
  JSON.stringify({ status, timestamp })
);

// Get cached status (avoids database hit)
const cached = await redis.get(`run-status:${runId}`);

// Delete when run is archived
await redis.del(`run-status:${runId}`);

Cache Key Pattern: run-status:<runId>

TTL: 86,400 seconds (24 hours)

Usecase: service/src/usecases/runs/getRun.ts queries cache first before database on rapid polling, reducing DB load during status check storms.

Monitoring Cache Hit Rate¶

Metrics to Track:

Metric	Query	Target
Cache Hits	`INFO stats` → `keyspace_hits`	> 80% of queries
Cache Misses	`INFO stats` → `keyspace_misses`	< 20% of queries
Evictions	`INFO stats` → `evicted_keys`	Should be 0 (TTL-based expiry)
Memory Usage	`INFO memory` → `used_memory_human`	< 512 MB for run cache

Redis Commands for Operators:

# Connect to Redis
redis-cli -h <REDIS_HOST> -p <REDIS_PORT>

# Check cache statistics
> INFO stats

# Monitor memory usage
> INFO memory

# Find all run-status keys
> KEYS run-status:*

# Check TTL of a specific key
> TTL run-status:550e8400-e29b-41d4-a716-446655440000

# Clear all run cache (if needed for maintenance)
> EVAL "return redis.call('del', unpack(redis.call('keys', 'run-status:*')))" 0

Rate Limiting¶

Location: service/src/services/redis.ts, service/src/middleware/rateLimit.ts

Two-Layer Rate Limiting Architecture:

Layer 1: Global IP-Based (Fastify Plugin)¶

// Registered in server.ts
app.register(fastifyRateLimit, {
  max: env.RATE_LIMIT_MAX,      // 5 requests per window
  timeWindow: `${env.RATE_LIMIT_WINDOW}s`  // 60 seconds
});

Returns HTTP 429 with headers: - Retry-After: <seconds> - X-RateLimit-Limit: 5 - X-RateLimit-Remaining: 0

Layer 2: Per-Repository Rate Limiting¶

// Custom check in usecases/github/handleIssueComment.ts
const result = await rateLimitService.checkLimit(repoName);

if (!result.allowed) {
  return {
    status: 429,
    error: "Rate limited per repository",
    retryAfter: result.retryAfter,
    remainingRequests: result.remaining
  };
}

Rate Limit Configuration:

RATE_LIMIT_MAX=5        // Max requests per window
RATE_LIMIT_WINDOW=60    // Window duration in seconds

Rate Limiting Flow Diagram¶

graph TD
    A["Webhook Arrives<br/>GitHub Event"] --> B["Global Rate Limit<br/>Check IP"]
    B --> C{"IP Limit<br/>Exceeded?"}
    C -->|Yes| D["Return 429<br/>Retry-After Header"]
    C -->|No| E["Route to Handler<br/>handleIssueComment"]

    E --> F["Per-Repo Rate Limit<br/>Check Repository"]
    F --> G{"Repo Limit<br/>Exceeded?"}
    G -->|Yes| H["Return 429<br/>Custom Response"]
    G -->|No| I["Process Webhook<br/>Increment Counter"]

    D --> J["Client Backoff"]
    H --> J
    J --> K["Retry After Window"]
    K --> A

    I --> L["Record Metric<br/>webhook_processed"]
    L --> M["Send Response<br/>to GitHub"]

    style D fill:#F44336
    style H fill:#F44336
    style M fill:#4CAF50
    style I fill:#FFC107

Monitoring Rate Limits¶

Alerts for Operators:

Sustained 429 Responses: > 10 in 5-minute window → Check for coordinated webhook deliveries
Per-Repo Limit Breaches: Same repo hitting limit repeatedly → May indicate malicious activity or misconfigured webhook
Redis Unavailable: Rate limit service should log and allow request with fallback, but alert on repeated failures

6. Structured Logging¶

GitTinkerer uses Fastify's structured logging (Pino-based) with tag-prefixed console logs and Sentry integration.

Request Log Toggles¶

Request logging is off by default to reduce noisy health checks. Enable it with:

LOG_REQUESTS=true to log non-health requests.
LOG_HEALTH_REQUESTS=true to log /health requests explicitly.

Fastify Logger Configuration¶

Location: service/src/server.ts

const app = Fastify({
  logger: {
    level: env.nodeEnv === 'development' ? 'debug' : 'info',
    serializers: {
      req(request) {
        return {
          id: request.id,
          method: request.method,
          url: request.url,
          remoteAddress: request.ip
        };
      },
      res(reply) {
        return { statusCode: reply.statusCode };
      }
    },
    requestIdHeader: 'X-Request-ID',
    requestIdLogLabel: 'requestId',
    genReqId(req) {
      return req.headers['x-request-id'] || generateUuid();
    }
  }
});

Log Levels and Usage¶

Level	Usage	Environment
`debug`	Detailed execution flow, variable values	Development only
`info`	Service startup, successful operations	Production
`warn`	Recoverable issues, rate limit warnings	Both
`error`	Failures, unhandled exceptions	Both

Tag-Prefixed Console Logging Patterns¶

Sentry Initialization:

[sentry] Sentry initialized with DSN: https://...
[sentry] Environment: production

Redis Connection:

[redis] Connecting to redis://localhost:6379
[redis] Connected successfully
[redis] Connection failed: ECONNREFUSED

Cache Operations:

[cache] Cache hit for run-status:550e8400
[cache] Cache miss for run-status:550e8400
[cache] Evicting stale entries: 3 keys

Rate Limiting:

[rate-limit] IP 192.168.1.1 limit check: 4/5 remaining
[rate-limit] Repo owner/repo limit exceeded: 0/5 remaining

Database Migrations:

[db] Running migration: 001_create_runs_table.sql
[db] Migration completed in 234ms

Structured Log Context¶

All request logs include context from requestIdMiddleware:

// Child logger with requestId automatically included
const child = app.log.child({ requestId: request.id });
child.info('Processing issue comment');

// Output includes requestId in all subsequent logs for this request
// {
//   "level": 30,
//   "time": "2026-01-07T14:32:15.000Z",
//   "requestId": "550e8400-e29b-41d4-a716-446655440000",
//   "msg": "Processing issue comment"
// }

Error Logging with Sentry Integration¶

Location: service/src/controllers/errors.ts

app.setErrorHandler(async (error, request, reply) => {
  // Structured error log
  request.log.error({
    err: error,
    url: request.url,
    method: request.method,
    statusCode: reply.statusCode
  });

  // Send to Sentry (if enabled)
  if (sentry.isEnabled()) {
    sentry.captureException(error, {
      requestId: request.id,
      tags: { handler: 'global' }
    });
  }
});

Logging Flow Diagram¶

graph TD
    A["Request Arrives"] --> B["Generate or<br/>Extract requestId"]
    B --> C["Create Child Logger<br/>with requestId"]

    C --> D["Middleware Layer"]
    D --> E["Log Request:<br/>method, url, ip"]
    E --> F["Route to Handler"]

    F --> G{"Error?"}
    G -->|No| H["Log Success<br/>statusCode"]
    G -->|Yes| I["Log Error<br/>Error Object"]

    I --> J["captureException<br/>to Sentry"]
    J --> K["Return Error Response"]

    H --> L["All logs tagged<br/>with requestId"]
    K --> L

    L --> M["Log Aggregation<br/>ELK/Datadog"]
    M --> N["Operator Analysis<br/>by requestId"]

    style C fill:#2196F3
    style L fill:#4CAF50
    style J fill:#FF9800
    style M fill:#FF9800

Log Analysis for Operators¶

Find Logs by Request ID:

# If logs are in JSON format
cat logs/*.json | grep '"requestId":"550e8400-e29b-41d4-a716-446655440000"'

Correlate with Errors:

Identify error from Sentry
Extract requestId from error details
Query logs for that requestId
Follow request flow from entry to error

Track Request Latency:

# Grep for request start and end, calculate duration
grep "Processing issue comment" logs/prod.json | jq '.time'
grep "Completed with status 200" logs/prod.json | jq '.time'
# Calculate difference manually

7. Troubleshooting Guide for Operators¶

Symptom: Service Errors in Sentry¶

Diagnosis Steps:

Check Error Frequency and Distribution:
Is this a recent regression or persistent issue?
Is error affecting all repositories or specific ones?
Check stage and repo tags in Sentry
Inspect Error Context:
Review requestId in error breadcrumbs
Check userId if error involves user-specific state
Note affected GitHub PR/issue numbers
Correlate with Logs:
Use requestId to find full request lifecycle in logs
Check for related errors in same request chain
Identify which operation failed (LLM call, git operation, API request)
Check Resource Constraints:
Is Redis available? (Check [redis] logs)
Is database responding? (Check query latencies in logs)
Is external API (LLM, GitHub) available? (Check timeout errors in Sentry)

Symptom: High Metrics but Slow Completion Times¶

Diagnosis Steps:

Query Metrics Endpoint:

GET /api/analytics/metrics?metricName=duration_ms&aggregation=avg

Identify Slow Operations:
Compare duration_ms from recent runs
Check if slowdown is consistent or intermittent
Analyze Bottleneck:
LLM Timeout: Check Sentry for timeout errors in handlePullRequest
Git Operations: Check exit codes in run.json files, look for permission errors
Database: Check PostgreSQL query logs for slow queries
Redis: Check connection delays with redis-cli LATENCY LATEST

Export Detailed Data:

GET /api/analytics/export?metricName=duration_ms&format=csv&limit=1000

Analyze CSV to find percentiles and outliers.

Symptom: Redis Cache Misses Increasing¶

Diagnosis Steps:

Check Redis Memory:

redis-cli INFO memory
# If used_memory > configured max, eviction policy may be evicting keys

Monitor Cache Statistics:

redis-cli INFO stats
# High evicted_keys indicates memory pressure

Review Cache Keys:

redis-cli KEYS run-status:* | wc -l
# Should be roughly equal to active runs (typically < 100)

Check TTL Expiry:

redis-cli TTL run-status:<runId>
# If negative, key has expired and will be removed

Resolution: - Increase Redis memory allocation if used_memory near limit - Reduce TTL if cache bloat is issue - Monitor /api/rate-limit for spike in request volume

Symptom: Rate Limiting Affecting Legitimate Webhooks¶

Diagnosis Steps:

Check Rate Limit Metrics:
Query for 429 responses in metrics
Identify which repositories are affected
Review Webhook Delivery History:
Check GitHub webhook deliveries in repository settings
Identify if deliveries are clustered or distributed
Analyze Per-Repo Limit:
Check if specific repository is hitting limit repeatedly
Verify RATE_LIMIT_WINDOW and RATE_LIMIT_MAX settings

Inspect Recent Logs:

grep "\[rate-limit\]" logs/prod.json | tail -20

Resolution: - Increase RATE_LIMIT_MAX if limit is too strict - Increase RATE_LIMIT_WINDOW to allow more time between bursts - Configure per-repo whitelisting if specific repo needs higher quota

Symptom: Failed Runs Not Creating Artifacts¶

Diagnosis Steps:

Check Run Status:

SELECT run_id, status, created_at, finished_at
FROM runs
WHERE artifacts_dir IS NULL
ORDER BY created_at DESC
LIMIT 5;

Verify Artifacts Directory:

ls -la artifacts/
# Should show YYYYMMDDTHHMMSSZ directories for recent runs

Check Sentry for File System Errors:
Search for errors containing "ENOSPC", "EACCES", "EIO"
These indicate disk space, permissions, or I/O issues
Inspect Run Handler:
Check service/src/usecases/runs/createRun.ts
Verify artifacts_dir is being set correctly

Resolution: - Check available disk space: df -h /path/to/artifacts - Verify directory permissions: chmod 755 artifacts/ - Check for I/O errors: dmesg | tail -20

Quick Troubleshooting Checklist¶

Issue	Check	Command
Service down	Health check	`curl http://localhost:3000/health`
Sentry silent	Verify DSN	Check `SENTRY_DSN` env var in logs
Redis unavailable	Connection	`redis-cli ping`
Disk full	Storage	`df -h /`
Database down	Postgres	`psql -U user -d dbname -c "SELECT 1;"`
High latency	Network	`ping sentry.io`, `redis-cli LATENCY LATEST`
Rate limiting	Webhook volume	`grep "429" logs/*.json \| wc -l`
Cache stale	TTL	`redis-cli TTL run-status:*`

Appendix: Environment Variables Reference¶

Monitoring Configuration¶

# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production|staging|development
SENTRY_TRACES_SAMPLE_RATE=0.0-1.0  # Percentage of transactions to trace (default: 0.1)

# Redis cache and rate limiting
REDIS_HOST=localhost               # Redis server hostname
REDIS_PORT=6379                    # Redis server port
REDIS_PASSWORD=                    # Redis password (optional)

# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here

# Rate limiting
RATE_LIMIT_MAX=5                   # Max requests per time window
RATE_LIMIT_WINDOW=60               # Time window in seconds

# Logging
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOG_LEVEL=info                     # Override log level if needed

Database Monitoring Queries¶

Run Status Distribution:

SELECT status, COUNT(*) as count
FROM runs
GROUP BY status;

Metrics per Run:

SELECT run_id, COUNT(*) as metric_count, 
       MAX(recorded_at) as last_updated
FROM run_metrics
GROUP BY run_id
ORDER BY last_updated DESC
LIMIT 20;

Average Metrics by Day:

SELECT DATE(recorded_at) as day, 
       metric_name,
       AVG(metric_value) as avg_value,
       MIN(metric_value) as min_value,
       MAX(metric_value) as max_value
FROM run_metrics
WHERE metric_name = 'duration_ms'
GROUP BY DATE(recorded_at), metric_name
ORDER BY day DESC;

Observability Guide¶

Quick Start¶

Key Monitoring Tools¶

Environment Variables for Observability¶

1. Sentry Integration: Error Capture¶

Service Layer Error Capture¶

Bash Layer Error Capture¶

Error Capture Flow Diagram¶

Configuring Sentry¶

Monitoring Sentry Alerts¶

2. Metrics Collection and Storage¶

Bash-Level Metrics (lib/metrics.sh)¶

Service-Level Metrics (service/src/usecases/metrics/)¶

Database Schema¶

Metrics Collection Flow Diagram¶

3. Analytics and CSV Export¶

Analytics Endpoints¶

CSV Export Endpoint¶

Key Analytics Queries for Operators¶

Analytics Flow Diagram¶

4. Run Artifacts and Debugging Failed Runs¶

Artifacts Directory Structure¶

Run Metadata (run.json)¶

Debugging Failed Runs¶

Artifacts Flow Diagram¶

5. Redis Cache: Run Status and Rate Limiting¶

Run Status Caching¶

Monitoring Cache Hit Rate¶

Rate Limiting¶

Layer 1: Global IP-Based (Fastify Plugin)¶

Layer 2: Per-Repository Rate Limiting¶

Rate Limiting Flow Diagram¶

Monitoring Rate Limits¶

6. Structured Logging¶

Request Log Toggles¶

Fastify Logger Configuration¶

Log Levels and Usage¶

Tag-Prefixed Console Logging Patterns¶

Structured Log Context¶

Error Logging with Sentry Integration¶

Logging Flow Diagram¶

Log Analysis for Operators¶

7. Troubleshooting Guide for Operators¶

Symptom: Service Errors in Sentry¶

Symptom: High Metrics but Slow Completion Times¶

Symptom: Redis Cache Misses Increasing¶

Symptom: Rate Limiting Affecting Legitimate Webhooks¶

Symptom: Failed Runs Not Creating Artifacts¶

Quick Troubleshooting Checklist¶

Appendix: Environment Variables Reference¶

Monitoring Configuration¶

Database Monitoring Queries¶

Bash-Level Metrics (`lib/metrics.sh`)¶

Service-Level Metrics (`service/src/usecases/metrics/`)¶