Debugging Production When You Forgot to Add Logging


It’s 2am. Production is broken. Users are reporting timeout errors on critical functionality. You check the logs and discover… there are no useful logs for this code path. Past you, confident that this simple function would never have problems, didn’t add any logging.

Current you, debugging a production issue with no logs, regrets every life choice that led to this moment.

This situation is common enough that I’ve developed systematic approaches for debugging production problems when you lack the logging and monitoring you should have built in the first place.

What You Actually Have

Even without proper logging, production systems usually have some observability:

Error rates and patterns: Your monitoring system (you have one, right?) shows when errors spiked. This gives you timing information to correlate with deployments, traffic patterns, or external events.

Application metrics: CPU, memory, request rates, database query times. These don’t tell you what’s wrong but show symptoms—memory leak, database slowdown, threading issues.

Infrastructure logs: Even if application logging is poor, web server access logs, database slow query logs, and system logs often contain useful information.

User reports: Angry users provide data about what fails, when, under what circumstances. This qualitative information helps narrow down scenarios.

Production monitoring tool data: If you have APM tools like New Relic, DataDog, or Sentry, they capture some information about errors even when your custom logging is absent.

Database state: You can query production database to see data patterns, recent changes, problematic records that might trigger bugs.

Start by gathering all available information before panicking about what you don’t have.

The Reproduction Strategy

You can’t debug what you can’t reproduce. Without logs, reproduction becomes critical:

Try to reproduce in staging: If staging uses production-like data and configuration, can you trigger the issue there? Staging usually has better logging/debugging access.

Identify user-reported patterns: Do errors happen for specific users, specific data, specific times? Finding the commonality helps create reproduction cases.

Use production queries carefully: Query production database to find records, users, or states that trigger errors. Be careful—running expensive queries on production during incidents makes things worse.

Simulate production scenarios locally: If you can identify the data or state that triggers bugs, recreate that locally where you can debug freely.

Sometimes you get lucky and can reproduce easily. Often you can’t—the bug is timing-dependent, environment-specific, or requires production data you can’t access.

The Informed Hypothesis Approach

Without logs, you’re essentially doing scientific debugging—forming hypotheses about what’s wrong and testing them:

Identify recent changes: What deployed recently? Code changes, configuration changes, infrastructure changes, dependency updates—anything that changed before errors started.

Review suspicious code: If errors point to specific endpoints or functions, review that code carefully. Look for edge cases, race conditions, null handling, error cases.

Check dependencies: External service failures, API changes, database issues, cache problems—these often cause production failures that wouldn’t happen in development.

Consider resource constraints: Production might hit memory limits, connection pool exhaustion, rate limits that never trigger in development or staging.

Think about data differences: Production data might include edge cases development data doesn’t have—null values, empty strings, very large sets, special characters, unexpected formats.

Form specific, testable hypotheses rather than vague guesses. “The bug is probably in the payment code” isn’t useful. “The payment code fails when processing refunds over $10,000 due to decimal precision handling” is testable.

The Strategic Logging Addition

You need logs to debug effectively, but deploying comprehensive logging during an incident is risky. Add logging strategically:

Log entry and exit of suspect functions: Minimal logging showing function was called and whether it completed successfully narrows down where failures happen.

Log external calls: API calls, database queries, cache operations—log these with timing information to identify slow or failing external dependencies.

Log state at decision points: Where code branches based on conditions, log what those conditions are. This reveals which code paths execute.

Use log levels appropriately: Debug level for detailed tracing, info for flow tracking, error for actual problems. Don’t log everything at error level.

Deploy incrementally: Add minimal logging, deploy, observe. Add more logging where needed, deploy again. Gradual addition reduces risk of logging itself causing problems.

Be cautious about logging sensitive data. Even in crisis, don’t log passwords, credit cards, personal information. Scrub or hash sensitive fields.

The Database Investigation

Production databases often reveal problems code analysis doesn’t:

Recent record patterns: Query for records created or modified around when errors started. Problematic data often correlates with error timing.

Data quality issues: Look for null values where you expected data, unexpected enums, strings too long or short, references to deleted records.

State machines stuck: If your application uses status fields to track workflow, check for records stuck in unexpected states indicating code failures.

Orphaned records: Missing foreign key relationships, records that should exist but don’t, can cause crashes.

Performance patterns: Slow queries, large result sets, missing indexes might not cause errors directly but can trigger timeouts that appear as application failures.

Run queries as read-only transactions during investigations to avoid accidentally breaking things further.

The External Dependency Check

Production failures often stem from external dependencies changing behavior:

API changes: Third-party APIs update without notice, change response formats, add validation, rate limit differently.

Service degradation: External services slow down or partially fail without returning errors, causing timeouts in your application.

Certificate expiration: SSL certificates expire, breaking API calls. This happens embarrassingly often.

DNS issues: DNS configuration changes or failures cause connection problems that look like application bugs.

Cloud service problems: AWS, Azure, GCP have outages affecting specific regions or services. Check status pages.

Database replica lag: If you read from replicas, lag can cause recently written data to appear missing, breaking workflows that expect immediate consistency.

Check external dependency health before assuming the problem is in your code.

The Temporary Workaround

While diagnosing and fixing root causes, implement temporary workarounds to restore service:

Increased timeouts: If the issue is timing-related, temporarily longer timeouts might reduce user impact while you debug.

Retry logic: If the failures are transient, adding retry logic might paper over the problem temporarily.

Feature flags: Disable the broken feature for all or most users while you fix it. Degraded service beats broken service.

Database hotfixes: If you’ve identified problematic records, manually fixing them might restore service immediately while you fix code.

Traffic routing: Route requests to different servers, regions, or versions that aren’t exhibiting the problem.

Workarounds aren’t fixes. Document them clearly and ensure you actually address root causes after restoring service.

The Post-Mortem Logging Plan

After fixing the immediate problem, address the logging deficiency that made debugging hard:

Comprehensive logging strategy: Define what should be logged, at what levels, in which parts of the system.

Structured logging: JSON-formatted logs with consistent fields enable better analysis than freeform text.

Correlation IDs: Track requests across services, logs, and systems using correlation identifiers.

Performance monitoring: Add APM instrumentation to catch slow operations even when they don’t fail.

Alerting on patterns: Set up alerts for error patterns, anomalies, and concerning metrics before they become crises.

The best time to add logging was before the incident. The second best time is immediately after, before you forget the pain of debugging without it.

Lessons for Next Time

Every production debugging nightmare teaches lessons:

Log proactively: Assume your code will break in mysterious ways and future you will need to debug it. Add logging while writing code.

Test with realistic data: Development data should include edge cases, realistic volumes, and problematic patterns found in production.

Staging should mirror production: Configuration, data volume, external dependencies—staging should be as production-like as feasible.

Monitor everything: Metrics, logs, traces, errors—comprehensive observability catches problems before users do.

Document system behavior: When you fix issues, document what went wrong and how you diagnosed it. Future incidents often have similarities.

Debugging production without logs is possible but painful. It requires systematic hypothesis testing, careful use of available information, and strategic addition of observability. The real lesson is avoiding this situation by investing in proper logging, monitoring, and testing before production issues force you to debug blind.

Next time you’re writing code and think “this is simple, it won’t need logging,” remember this post and add the damn logs anyway. Future you will thank current you.