Observability Guide¶
This guide documents the monitoring, debugging, and analytics infrastructure for GitTinkerer. It covers error capture via Sentry, metrics collection and export, run artifact inspection, Redis caching, rate limiting, and structured logging patterns.
Quick Start¶
Key Monitoring Tools¶
| Tool | Purpose | Location |
|---|---|---|
| Sentry | Error tracking and issue aggregation | sentry.io |
| Redis | Run status caching, rate limit counters | REDIS_HOST:REDIS_PORT |
| Metrics DB | Token usage, timing, completion metrics | PostgreSQL run_metrics table |
| Analytics API | Metrics aggregation and CSV export | /api/analytics/* endpoints |
| Artifacts | Run logs, diffs, PR comments | artifacts/<TIMESTAMP>/ |
| Logs | Structured request/error logs | Fastify logger (Pino) |
Environment Variables for Observability¶
# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1 # Percentage of transactions to trace
# Redis cache and rate limiting
REDIS_HOST=localhost
REDIS_PORT=6379
# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here
# Rate limiting
RATE_LIMIT_MAX=5 # Max requests per window
RATE_LIMIT_WINDOW=60 # Window duration in seconds
# Logging
NODE_ENV=production # Controls log level (info) vs development (debug)
1. Sentry Integration: Error Capture¶
GitTinkerer captures errors at multiple layers: TypeScript service, bash scripts, and webhook handling.
Service Layer Error Capture¶
Location: service/src/infra/sentry/
The SentryService class initializes and manages Sentry integration:
// Initialization on service startup
const sentry = new SentryService(env.SENTRY_DSN, {
environment: env.SENTRY_ENVIRONMENT,
tracesSampleRate: env.SENTRY_TRACES_SAMPLE_RATE
});
// Global error hook captures unhandled exceptions
app.setErrorHandler(async (error, request, reply) => {
sentry.captureException(error, {
requestId: request.id,
path: request.url
});
});
// Request context tracking
sentry.withContext({
userId: conversationId,
metadata: { repo, pr_number }
});
Key Methods:
captureException(error, context)— Send exception with metadata (requestId, path)captureMessage(msg, level, context)— Log info/warning/error with contextrecordBreadcrumb(action, data)— Record event for post-hoc debuggingsetUser(userId)— Associate subsequent errors with userwithContext(data)— Add arbitrary key-value context to error reports
Bash Layer Error Capture¶
Location: lib/sentry.sh
Bash scripts send errors directly to Sentry API:
Payload Structure:
{
"message": "Deploy failed",
"level": "error",
"timestamp": 1672743986,
"tags": {
"stage": "production",
"run_id": "<PAYLOAD_RUN_ID>",
"repo": "owner/repo",
"pr_number": "123",
"source": "workflow|webhook"
},
"user": {
"id": "<PAYLOAD_WEB_CONVERSATION_ID>",
"username": "<USER>"
},
"extra": {
"payload_source": "github",
"stage": "production"
}
}
Authentication Header:
X-Sentry-Auth: Sentry sentry_key=<key>, sentry_version=7, sentry_timestamp=<ts>, sentry_client=gittinkerer-cli/1.0
Error Capture Flow Diagram¶
graph TD
A["Service Request"] --> B{"Error Occurs?"}
B -->|Yes| C["Global Error Hook<br/>captureException"]
C --> D["Add Request Context<br/>requestId, path, userId"]
D --> E["SentryService.captureException"]
E --> F["Sentry API<br/>POST /api/<project_id>/store"]
F --> G["Sentry Dashboard<br/>Error Aggregation"]
H["Bash Script"] --> I{"Error/Exit?"}
I -->|Yes| J["sentry_capture_message"]
J --> K["Build API Endpoint<br/>from DSN"]
K --> L["HTTP POST<br/>with Auth Header"]
L --> F
M["Webhook Handler<br/>GitHub Event"]
M --> N{"Validation Fails?"}
N -->|Yes| O["captureException<br/>with webhookId"]
O --> E
style G fill:#4CAF50
style F fill:#2196F3
style A fill:#FFC107
style H fill:#FFC107
style M fill:#FFC107
Configuring Sentry¶
File: service/src/config/env.ts
SENTRY_DSN: z.string().url().optional(), // Project DSN
SENTRY_ENVIRONMENT: z.enum(['development', 'production']).default('production'),
SENTRY_TRACES_SAMPLE_RATE: z.number().min(0).max(1).default(0.1)
To Disable Sentry (e.g., local development):
- Leave SENTRY_DSN unset
- Service will no-op all Sentry calls but log console.error
Monitoring Sentry Alerts¶
Key Metrics to Watch:
- Error Rate: Spikes > 10% above baseline
- New Issues: Projects monitor "Regressed" status
- Affected Users: If
userIdcontext is set, track which users see errors - Error Distribution: By
stagetag (production, staging) andrepotag
Set Alerts For:
- High Error Volume: 50+ errors in 5 minutes
- Critical Path Failures: Issues in
handleIssueComment,handlePullRequestusecases - Infrastructure Errors: Redis connection failures, database timeouts
- Rate Limit Breaches: Accumulation of 429 responses in webhook handler
2. Metrics Collection and Storage¶
GitTinkerer collects performance metrics at both bash and service layers, storing them in PostgreSQL for analysis.
Bash-Level Metrics (lib/metrics.sh)¶
Token Approximation:
# Estimates tokens from text
approximate_token_count "Long text here"
# Formula: max(word_count, ceil(char_count / 4))
Diff Metrics:
# Counts additions and removals from unified diff
calculate_diff_loc
# Additions: lines matching ^(\+[^+]|\ No newline at end of file)
# Removals: lines matching ^-[^-]
# Returns: total_loc (additions + removals)
Metrics Queue:
# Queue metrics in memory during run
record_metric "tokens_used" 1250
record_metric "files_changed" 3
record_metric "duration_ms" 5432
# Flush all queued metrics to API after run completes
flush_metrics
# POST to $METRICS_API_URL with Bearer token
Service-Level Metrics (service/src/usecases/metrics/)¶
Supported Metric Names:
| Metric | Unit | Meaning |
|---|---|---|
duration_ms |
milliseconds | Total execution time |
exit_code |
integer | Process exit code (0 = success) |
tokens_used |
count | LLM tokens consumed by completion |
files_changed |
count | Number of files modified in diff |
cost_usd |
USD | Computed cost (tokens_used × 0.002 / 1000) |
Recording Metrics:
// In usecases that need to track performance
import { MetricsService } from '../../services/metrics';
const metrics = new MetricsService(pool); // PostgreSQL pool
await metrics.recordMetric(runId, 'duration_ms', 5432);
await metrics.recordMetric(runId, 'tokens_used', 1250);
await metrics.recordMetric(runId, 'cost_usd', 2.50);
Database Schema¶
-- Metrics storage
CREATE TABLE run_metrics (
id BIGSERIAL PRIMARY KEY,
run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE,
metric_name TEXT NOT NULL,
metric_value NUMERIC,
recorded_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX run_metrics_run_idx ON run_metrics (run_id);
CREATE INDEX run_metrics_name_idx ON run_metrics (metric_name);
CREATE INDEX run_metrics_recorded_at_idx ON run_metrics (recorded_at DESC);
Metrics Collection Flow Diagram¶
graph LR
A["Bash Script<br/>bin/gittinkerer"] --> B["record_metric<br/>Queue in Memory"]
B --> C["Calculate Metrics<br/>tokens, diff LOC, duration"]
C --> D["Run Completes"]
D --> E["flush_metrics<br/>POST to API"]
E --> F["Service Handler<br/>/api/metrics/record"]
G["Service Usecase"] --> H["MetricsService.recordMetric"]
H --> I["INSERT run_metrics<br/>PostgreSQL"]
F --> I
J["Database"] --> K["run_metrics Table"]
I --> K
style I fill:#4CAF50
style K fill:#2196F3
style E fill:#FF9800
style F fill:#FF9800
3. Analytics and CSV Export¶
The analytics API provides aggregation and export of metrics for operational analysis.
Analytics Endpoints¶
Endpoint: GET /api/analytics/metrics
Aggregate metrics across runs:
Response:
{
"metric": "duration_ms",
"aggregation": "avg",
"value": 4852.5,
"count": 24,
"min": 1200,
"max": 8950
}
Supported Aggregations: sum, avg, min, max, count, stddev
CSV Export Endpoint¶
Endpoint: GET /api/analytics/export
Export raw metrics in CSV format:
CSV Output:
id,run_id,metric_name,metric_value,recorded_at
1,550e8400-e29b-41d4-a716-446655440000,duration_ms,5432,2026-01-07T14:32:15.000Z
2,550e8400-e29b-41d4-a716-446655440001,duration_ms,4120,2026-01-07T14:35:22.000Z
3,550e8400-e29b-41d4-a716-446655440002,tokens_used,1250,2026-01-07T14:38:10.000Z
Key Analytics Queries for Operators¶
Monitor Average Completion Time:
Track Token Usage Trends:
Export Last 1000 Runs for Cost Analysis:
Identify Slow Runs:
GET /api/analytics/export?metricName=duration_ms&format=csv&limit=100
// Then filter locally for duration_ms > 10000
Analytics Flow Diagram¶
graph TD
A["Metrics Recorded<br/>run_metrics Table"] --> B["Operator Query<br/>GET /api/analytics/*"]
B --> C{"Export or<br/>Aggregate?"}
C -->|Export CSV| D["SELECT run_id, metric_name,<br/>metric_value, recorded_at"]
D --> E["Format as CSV<br/>Send with Content-Type: text/csv"]
E --> F["Operator<br/>Import to Analysis Tool"]
C -->|Aggregate| G["SELECT metric_name,<br/>Aggregation Function"]
G --> H["Calculate:<br/>sum, avg, min, max, count"]
H --> I["Return JSON<br/>with Statistics"]
I --> F
F --> J["Analysis<br/>Dashboard, Alerts"]
style A fill:#2196F3
style J fill:#4CAF50
style E fill:#FF9800
style I fill:#FF9800
4. Run Artifacts and Debugging Failed Runs¶
Each run produces timestamped artifacts containing logs, diffs, and metadata for post-mortem analysis.
Artifacts Directory Structure¶
artifacts/
├── 20260103T142626Z/ # Timestamp: YYYYMMDDTHHMMSSZ
│ ├── payload.json # Original GitHub webhook payload
│ ├── run.json # Run metadata and status
│ └── agent-run/
│ ├── prompt.txt # Full prompt sent to LLM
│ ├── diff.patch # Unified diff of proposed changes
│ ├── files_changed.json # Array of modified files
│ ├── pr_comment.txt # Comment posted to PR
│ ├── summary.md # Human-readable summary
│ └── commit_sha.txt # Committed SHA (if successful)
└── runs/
└── github-deliveries/ # Webhook delivery logs
Run Metadata (run.json)¶
{
"run_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "succeeded",
"timestamp": "20260103T142626Z",
"payload_path": "/path/to/artifacts/payload.json",
"artifacts_dir": "/path/to/artifacts/20260103T142626Z/",
"started_at": "2026-01-03T14:26:26.695Z",
"finished_at": "2026-01-03T14:26:50.476Z",
"exit_code": 0
}
Debugging Failed Runs¶
Step 1: Check Run Status
Query the runs table:
SELECT run_id, status, source, created_at, finished_at
FROM runs
WHERE status = 'failed'
ORDER BY created_at DESC
LIMIT 10;
Step 2: Review Artifacts
# Find artifacts by run_id or timestamp
ls -la artifacts/20260103T142626Z/
# Check run metadata
cat artifacts/20260103T142626Z/run.json
# Review diff that was proposed (if generation succeeded)
cat artifacts/20260103T142626Z/agent-run/diff.patch
# Check error details in run.json or Sentry
Step 3: Check Sentry for Errors
Look up errors by run_id tag:
Step 4: Analyze Common Failure Points
| Failure Point | Check | Resolution |
|---|---|---|
| Payload validation | payload.json structure, required fields |
Verify webhook event type (issue, pull_request) |
| LLM timeout | agent-run/prompt.txt, Sentry timeout errors |
Check LLM API status, increase timeout |
| Git operations | Check exit_code in run.json, Sentry tags |
Verify SSH keys, branch permissions |
| PR comment posting | Check pr_comment.txt exists, Sentry 403 errors |
Verify GitHub token scope, PR state |
| Rate limiting | Check metrics for 429 responses | Monitor /api/rate-limit endpoint |
Artifacts Flow Diagram¶
graph TD
A["GitHub Event<br/>Issue/PR Comment"] --> B["Create Run<br/>run_id, timestamp"]
B --> C["Create artifacts/<br/>TIMESTAMP/ Directory"]
C --> D["Store payload.json<br/>Original Webhook"]
C --> E["Store run.json<br/>Metadata, Status"]
F["Execute Agent"] --> G["Generate LLM Response"]
G --> H["Create agent-run/<br/>subdirectory"]
H --> I["Store prompt.txt<br/>diff.patch<br/>files_changed.json"]
F --> J{"Success?"}
J -->|Yes| K["Store pr_comment.txt<br/>commit_sha.txt"]
J -->|No| L["Store Error Message<br/>Send to Sentry"]
K --> M["Update run.json<br/>status=succeeded"]
L --> N["Update run.json<br/>status=failed"]
O["Operator Debugging"] --> P["Query runs Table"]
P --> Q["Inspect artifacts/<br/>TIMESTAMP/"]
Q --> R["Check run.json Status"]
R --> S["Review Sentry<br/>by run_id Tag"]
style B fill:#2196F3
style M fill:#4CAF50
style N fill:#F44336
style S fill:#FF9800
5. Redis Cache: Run Status and Rate Limiting¶
Redis serves dual roles: caching run status for rapid polling and enforcing rate limits on webhook processing.
Run Status Caching¶
Location: service/src/services/redis.ts
Cache Configuration:
const redis = new Redis({
host: env.REDIS_HOST,
port: env.REDIS_PORT,
retryStrategy: (times) => Math.min(times * 50, 5000),
disableOfflineQueue: true // Fail fast if disconnected
});
Cache Operations:
// Set cached status with 24-hour TTL
const TTL_SECONDS = 86400;
await redis.setex(
`run-status:${runId}`,
TTL_SECONDS,
JSON.stringify({ status, timestamp })
);
// Get cached status (avoids database hit)
const cached = await redis.get(`run-status:${runId}`);
// Delete when run is archived
await redis.del(`run-status:${runId}`);
Cache Key Pattern: run-status:<runId>
TTL: 86,400 seconds (24 hours)
Usecase: service/src/usecases/runs/getRun.ts queries cache first before database on rapid polling, reducing DB load during status check storms.
Monitoring Cache Hit Rate¶
Metrics to Track:
| Metric | Query | Target |
|---|---|---|
| Cache Hits | INFO stats → keyspace_hits |
> 80% of queries |
| Cache Misses | INFO stats → keyspace_misses |
< 20% of queries |
| Evictions | INFO stats → evicted_keys |
Should be 0 (TTL-based expiry) |
| Memory Usage | INFO memory → used_memory_human |
< 512 MB for run cache |
Redis Commands for Operators:
# Connect to Redis
redis-cli -h <REDIS_HOST> -p <REDIS_PORT>
# Check cache statistics
> INFO stats
# Monitor memory usage
> INFO memory
# Find all run-status keys
> KEYS run-status:*
# Check TTL of a specific key
> TTL run-status:550e8400-e29b-41d4-a716-446655440000
# Clear all run cache (if needed for maintenance)
> EVAL "return redis.call('del', unpack(redis.call('keys', 'run-status:*')))" 0
Rate Limiting¶
Location: service/src/services/redis.ts, service/src/middleware/rateLimit.ts
Two-Layer Rate Limiting Architecture:
Layer 1: Global IP-Based (Fastify Plugin)¶
// Registered in server.ts
app.register(fastifyRateLimit, {
max: env.RATE_LIMIT_MAX, // 5 requests per window
timeWindow: `${env.RATE_LIMIT_WINDOW}s` // 60 seconds
});
Returns HTTP 429 with headers:
- Retry-After: <seconds>
- X-RateLimit-Limit: 5
- X-RateLimit-Remaining: 0
Layer 2: Per-Repository Rate Limiting¶
// Custom check in usecases/github/handleIssueComment.ts
const result = await rateLimitService.checkLimit(repoName);
if (!result.allowed) {
return {
status: 429,
error: "Rate limited per repository",
retryAfter: result.retryAfter,
remainingRequests: result.remaining
};
}
Rate Limit Configuration:
Rate Limiting Flow Diagram¶
graph TD
A["Webhook Arrives<br/>GitHub Event"] --> B["Global Rate Limit<br/>Check IP"]
B --> C{"IP Limit<br/>Exceeded?"}
C -->|Yes| D["Return 429<br/>Retry-After Header"]
C -->|No| E["Route to Handler<br/>handleIssueComment"]
E --> F["Per-Repo Rate Limit<br/>Check Repository"]
F --> G{"Repo Limit<br/>Exceeded?"}
G -->|Yes| H["Return 429<br/>Custom Response"]
G -->|No| I["Process Webhook<br/>Increment Counter"]
D --> J["Client Backoff"]
H --> J
J --> K["Retry After Window"]
K --> A
I --> L["Record Metric<br/>webhook_processed"]
L --> M["Send Response<br/>to GitHub"]
style D fill:#F44336
style H fill:#F44336
style M fill:#4CAF50
style I fill:#FFC107
Monitoring Rate Limits¶
Alerts for Operators:
- Sustained 429 Responses: > 10 in 5-minute window → Check for coordinated webhook deliveries
- Per-Repo Limit Breaches: Same repo hitting limit repeatedly → May indicate malicious activity or misconfigured webhook
- Redis Unavailable: Rate limit service should log and allow request with fallback, but alert on repeated failures
6. Structured Logging¶
GitTinkerer uses Fastify's structured logging (Pino-based) with tag-prefixed console logs and Sentry integration.
Request Log Toggles¶
Request logging is off by default to reduce noisy health checks. Enable it with:
LOG_REQUESTS=trueto log non-health requests.LOG_HEALTH_REQUESTS=trueto log/healthrequests explicitly.
Fastify Logger Configuration¶
Location: service/src/server.ts
const app = Fastify({
logger: {
level: env.nodeEnv === 'development' ? 'debug' : 'info',
serializers: {
req(request) {
return {
id: request.id,
method: request.method,
url: request.url,
remoteAddress: request.ip
};
},
res(reply) {
return { statusCode: reply.statusCode };
}
},
requestIdHeader: 'X-Request-ID',
requestIdLogLabel: 'requestId',
genReqId(req) {
return req.headers['x-request-id'] || generateUuid();
}
}
});
Log Levels and Usage¶
| Level | Usage | Environment |
|---|---|---|
debug |
Detailed execution flow, variable values | Development only |
info |
Service startup, successful operations | Production |
warn |
Recoverable issues, rate limit warnings | Both |
error |
Failures, unhandled exceptions | Both |
Tag-Prefixed Console Logging Patterns¶
Sentry Initialization:
Redis Connection:
[redis] Connecting to redis://localhost:6379
[redis] Connected successfully
[redis] Connection failed: ECONNREFUSED
Cache Operations:
[cache] Cache hit for run-status:550e8400
[cache] Cache miss for run-status:550e8400
[cache] Evicting stale entries: 3 keys
Rate Limiting:
[rate-limit] IP 192.168.1.1 limit check: 4/5 remaining
[rate-limit] Repo owner/repo limit exceeded: 0/5 remaining
Database Migrations:
Structured Log Context¶
All request logs include context from requestIdMiddleware:
// Child logger with requestId automatically included
const child = app.log.child({ requestId: request.id });
child.info('Processing issue comment');
// Output includes requestId in all subsequent logs for this request
// {
// "level": 30,
// "time": "2026-01-07T14:32:15.000Z",
// "requestId": "550e8400-e29b-41d4-a716-446655440000",
// "msg": "Processing issue comment"
// }
Error Logging with Sentry Integration¶
Location: service/src/controllers/errors.ts
app.setErrorHandler(async (error, request, reply) => {
// Structured error log
request.log.error({
err: error,
url: request.url,
method: request.method,
statusCode: reply.statusCode
});
// Send to Sentry (if enabled)
if (sentry.isEnabled()) {
sentry.captureException(error, {
requestId: request.id,
tags: { handler: 'global' }
});
}
});
Logging Flow Diagram¶
graph TD
A["Request Arrives"] --> B["Generate or<br/>Extract requestId"]
B --> C["Create Child Logger<br/>with requestId"]
C --> D["Middleware Layer"]
D --> E["Log Request:<br/>method, url, ip"]
E --> F["Route to Handler"]
F --> G{"Error?"}
G -->|No| H["Log Success<br/>statusCode"]
G -->|Yes| I["Log Error<br/>Error Object"]
I --> J["captureException<br/>to Sentry"]
J --> K["Return Error Response"]
H --> L["All logs tagged<br/>with requestId"]
K --> L
L --> M["Log Aggregation<br/>ELK/Datadog"]
M --> N["Operator Analysis<br/>by requestId"]
style C fill:#2196F3
style L fill:#4CAF50
style J fill:#FF9800
style M fill:#FF9800
Log Analysis for Operators¶
Find Logs by Request ID:
# If logs are in JSON format
cat logs/*.json | grep '"requestId":"550e8400-e29b-41d4-a716-446655440000"'
Correlate with Errors:
- Identify error from Sentry
- Extract
requestIdfrom error details - Query logs for that
requestId - Follow request flow from entry to error
Track Request Latency:
# Grep for request start and end, calculate duration
grep "Processing issue comment" logs/prod.json | jq '.time'
grep "Completed with status 200" logs/prod.json | jq '.time'
# Calculate difference manually
7. Troubleshooting Guide for Operators¶
Symptom: Service Errors in Sentry¶
Diagnosis Steps:
- Check Error Frequency and Distribution:
- Is this a recent regression or persistent issue?
- Is error affecting all repositories or specific ones?
-
Check
stageandrepotags in Sentry -
Inspect Error Context:
- Review
requestIdin error breadcrumbs - Check
userIdif error involves user-specific state -
Note affected GitHub PR/issue numbers
-
Correlate with Logs:
- Use
requestIdto find full request lifecycle in logs - Check for related errors in same request chain
-
Identify which operation failed (LLM call, git operation, API request)
-
Check Resource Constraints:
- Is Redis available? (Check
[redis]logs) - Is database responding? (Check query latencies in logs)
- Is external API (LLM, GitHub) available? (Check timeout errors in Sentry)
Symptom: High Metrics but Slow Completion Times¶
Diagnosis Steps:
-
Query Metrics Endpoint:
-
Identify Slow Operations:
- Compare
duration_msfrom recent runs -
Check if slowdown is consistent or intermittent
-
Analyze Bottleneck:
- LLM Timeout: Check Sentry for timeout errors in
handlePullRequest - Git Operations: Check exit codes in
run.jsonfiles, look for permission errors - Database: Check PostgreSQL query logs for slow queries
-
Redis: Check connection delays with
redis-cli LATENCY LATEST -
Export Detailed Data:
Analyze CSV to find percentiles and outliers.
Symptom: Redis Cache Misses Increasing¶
Diagnosis Steps:
-
Check Redis Memory:
-
Monitor Cache Statistics:
-
Review Cache Keys:
-
Check TTL Expiry:
Resolution:
- Increase Redis memory allocation if used_memory near limit
- Reduce TTL if cache bloat is issue
- Monitor /api/rate-limit for spike in request volume
Symptom: Rate Limiting Affecting Legitimate Webhooks¶
Diagnosis Steps:
- Check Rate Limit Metrics:
- Query for 429 responses in metrics
-
Identify which repositories are affected
-
Review Webhook Delivery History:
- Check GitHub webhook deliveries in repository settings
-
Identify if deliveries are clustered or distributed
-
Analyze Per-Repo Limit:
- Check if specific repository is hitting limit repeatedly
-
Verify
RATE_LIMIT_WINDOWandRATE_LIMIT_MAXsettings -
Inspect Recent Logs:
Resolution:
- Increase RATE_LIMIT_MAX if limit is too strict
- Increase RATE_LIMIT_WINDOW to allow more time between bursts
- Configure per-repo whitelisting if specific repo needs higher quota
Symptom: Failed Runs Not Creating Artifacts¶
Diagnosis Steps:
-
Check Run Status:
-
Verify Artifacts Directory:
-
Check Sentry for File System Errors:
- Search for errors containing "ENOSPC", "EACCES", "EIO"
-
These indicate disk space, permissions, or I/O issues
-
Inspect Run Handler:
- Check
service/src/usecases/runs/createRun.ts - Verify
artifacts_diris being set correctly
Resolution:
- Check available disk space: df -h /path/to/artifacts
- Verify directory permissions: chmod 755 artifacts/
- Check for I/O errors: dmesg | tail -20
Quick Troubleshooting Checklist¶
| Issue | Check | Command |
|---|---|---|
| Service down | Health check | curl http://localhost:3000/health |
| Sentry silent | Verify DSN | Check SENTRY_DSN env var in logs |
| Redis unavailable | Connection | redis-cli ping |
| Disk full | Storage | df -h / |
| Database down | Postgres | psql -U user -d dbname -c "SELECT 1;" |
| High latency | Network | ping sentry.io, redis-cli LATENCY LATEST |
| Rate limiting | Webhook volume | grep "429" logs/*.json | wc -l |
| Cache stale | TTL | redis-cli TTL run-status:* |
Appendix: Environment Variables Reference¶
Monitoring Configuration¶
# Sentry error tracking
SENTRY_DSN=https://[key]@sentry.io/[projectId]
SENTRY_ENVIRONMENT=production|staging|development
SENTRY_TRACES_SAMPLE_RATE=0.0-1.0 # Percentage of transactions to trace (default: 0.1)
# Redis cache and rate limiting
REDIS_HOST=localhost # Redis server hostname
REDIS_PORT=6379 # Redis server port
REDIS_PASSWORD= # Redis password (optional)
# Metrics collection
METRICS_API_URL=http://localhost:3000/api/metrics/record
METRICS_API_TOKEN=bearer-token-here
# Rate limiting
RATE_LIMIT_MAX=5 # Max requests per time window
RATE_LIMIT_WINDOW=60 # Time window in seconds
# Logging
NODE_ENV=production # Controls log level (info) vs development (debug)
LOG_LEVEL=info # Override log level if needed
Database Monitoring Queries¶
Run Status Distribution:
Metrics per Run:
SELECT run_id, COUNT(*) as metric_count,
MAX(recorded_at) as last_updated
FROM run_metrics
GROUP BY run_id
ORDER BY last_updated DESC
LIMIT 20;
Average Metrics by Day: