What is the circuit breaker pattern in microservices?

The circuit breaker pattern prevents cascading failures by monitoring service calls. It operates in three states: Closed (normal operation), Open (requests blocked after failures exceed threshold), and Half-Open (testing if service recovered). Libraries like Resilience4j and Polly implement this pattern.

What is the difference between choreography and orchestration in the Saga pattern?

Choreography-based sagas use events without a central coordinator - each service listens for events and publishes its own. Orchestration-based sagas use a central orchestrator that tells each service what to do. Choreography offers more decoupling but is harder to debug; orchestration provides better visibility but introduces a single point of failure.

When should I use Istio vs Linkerd for service mesh?

Choose Istio for advanced traffic management, granular security policies, and extensive observability - but expect more complexity. Choose Linkerd for simplicity, ease of use, and low operational overhead. In 2025 benchmarks, Linkerd is 163ms faster than Istio at the 99th percentile.

Why is OpenTelemetry important for microservices?

OpenTelemetry provides vendor-neutral distributed tracing, letting you trace requests across multiple services. 79% of organizations use or consider it. It eliminates vendor lock-in and provides a unified framework for traces, metrics, and logs - essential for debugging microservices in production.

Microservices Communication Patterns: AI's Distributed Systems Blind Spots

Explore why AI-generated microservices code lacks circuit breakers, distributed tracing, and proper saga patterns. Learn to implement resilience with Resilience4j, OpenTelemetry, and service mesh architectures.

Introduction: The Distributed Monolith Problem

In 2025, microservices have become the dominant architecture pattern, with 61% of enterprises already using them. But there's a hidden danger: AI-generated microservices code often creates what architects call a "distributed monolith"—services that are technically separate but tightly coupled, lacking the resilience patterns that make distributed systems actually work.

According to a Camunda survey, 62% of organizations report that managing inter-service dependencies is a significant challenge. When AI generates microservices code, it typically produces naive implementations that work in development but fail catastrophically in production when network partitions occur, services become overloaded, or distributed transactions need coordination.

Key Statistics

62% of organizations struggle with inter-service dependencies
70% of companies run a service mesh
79% use or consider OpenTelemetry
41.3% is the service mesh CAGR

Why AI Struggles with Microservices

Monolithic Thinking Persists

AI models are trained predominantly on monolithic application code. When asked to generate microservices, they apply monolithic patterns:

Synchronous everything: AI defaults to HTTP request/response, missing when async messaging is appropriate
No failure handling: Generated code assumes services are always available
Missing timeouts: Network calls without timeouts lead to thread exhaustion
No circuit breakers: A failing downstream service takes down the entire system
Tight coupling: Services directly call each other instead of communicating through events

Common AI Microservices Mistakes

Here's what AI typically generates versus what production-ready code looks like:

Service Calls: AI uses direct HTTP without retry; production code uses circuit breaker + retry + timeout
Distributed Transactions: AI ignores or uses two-phase commit; production uses Saga pattern with compensation
Observability: AI provides basic logging; production needs distributed tracing + metrics
Service Discovery: AI hardcodes URLs; production uses service registry or service mesh
Data Consistency: AI assumes strong consistency; production embraces eventual consistency patterns

// AI-Generated: Naive Service Call
// No timeout, no retry, no circuit breaker
async function getOrderWithUser(orderId: string) {
  const order = await fetch(`http://order-service/orders/${orderId}`)
    .then(r => r.json());

  // If user-service is down, entire request fails
  const user = await fetch(`http://user-service/users/${order.userId}`)
    .then(r => r.json());

  return { ...order, user };
}

// What happens when user-service is overloaded?
// - Thread blocks waiting for response
// - More requests pile up
// - Order service runs out of threads
// - Cascading failure across entire system

// Production-Ready: Resilient Service Call
import { CircuitBreaker, retry, timeout } from './resilience';

const userServiceBreaker = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 30000,
  fallback: () => ({ id: 'unknown', name: 'User Unavailable' })
});

async function getOrderWithUser(orderId: string) {
  const order = await retry(
    () => timeout(
      fetch(`http://order-service/orders/${orderId}`),
      5000 // 5 second timeout
    ),
    { maxAttempts: 3, backoff: 'exponential' }
  ).then(r => r.json());

  // Circuit breaker protects against cascading failure
  const user = await userServiceBreaker.execute(async () => {
    return timeout(
      fetch(`http://user-service/users/${order.userId}`),
      3000
    ).then(r => r.json());
  });

  return { ...order, user };
}

Circuit Breaker Pattern with Resilience4j

The circuit breaker is the most critical pattern for microservices resilience. It prevents a single service failure from cascading through the entire system by "tripping" when failures exceed a threshold.

Circuit Breaker States

Closed: Requests flow normally. Failures are counted against a threshold.
Open: After threshold exceeded, requests immediately fail without calling the downstream service.
Half-Open: After a timeout, limited test requests are allowed to check if service recovered.

Resilience4j Implementation (Java/Spring Boot)

# application.yml
resilience4j:
  circuitbreaker:
    instances:
      userService:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 30s
        failureRateThreshold: 50
        slowCallRateThreshold: 100
        slowCallDurationThreshold: 2s

  retry:
    instances:
      userService:
        maxAttempts: 3
        waitDuration: 500ms
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException

  bulkhead:
    instances:
      userService:
        maxConcurrentCalls: 25
        maxWaitDuration: 0

// UserServiceClient.java
@Service
public class UserServiceClient {

    private final WebClient webClient;

    public UserServiceClient(WebClient.Builder builder) {
        this.webClient = builder
            .baseUrl("http://user-service")
            .build();
    }

    @CircuitBreaker(name = "userService", fallbackMethod = "getUserFallback")
    @Retry(name = "userService")
    @Bulkhead(name = "userService")
    @TimeLimiter(name = "userService")
    public CompletableFuture<User> getUser(String userId) {
        return webClient.get()
            .uri("/users/{id}", userId)
            .retrieve()
            .bodyToMono(User.class)
            .toFuture();
    }

    // Fallback when circuit is open or all retries exhausted
    private CompletableFuture<User> getUserFallback(String userId, Exception ex) {
        log.warn("Fallback for user {}: {}", userId, ex.getMessage());
        return CompletableFuture.completedFuture(
            User.builder()
                .id(userId)
                .name("Service Unavailable")
                .cached(true)
                .build()
        );
    }
}

Node.js Implementation with Opossum

// circuitBreaker.ts
import CircuitBreaker from 'opossum';

interface CircuitBreakerOptions {
  timeout: number;
  errorThresholdPercentage: number;
  resetTimeout: number;
}

function createServiceClient<T>(
  name: string,
  fn: (...args: any[]) => Promise<T>,
  fallback: (...args: any[]) => T,
  options: Partial<CircuitBreakerOptions> = {}
) {
  const breaker = new CircuitBreaker(fn, {
    timeout: options.timeout ?? 3000,
    errorThresholdPercentage: options.errorThresholdPercentage ?? 50,
    resetTimeout: options.resetTimeout ?? 30000,
    volumeThreshold: 5,
  });

  // Fallback when circuit opens
  breaker.fallback(fallback);

  // Monitoring events
  breaker.on('success', (result) => {
    metrics.increment(`${name}.success`);
  });

  breaker.on('failure', (error) => {
    metrics.increment(`${name}.failure`);
    logger.error(`Circuit ${name} failure:`, error);
  });

  breaker.on('open', () => {
    metrics.increment(`${name}.circuit_open`);
    logger.warn(`Circuit ${name} opened`);
  });

  return breaker;
}

// Usage
const userServiceBreaker = createServiceClient(
  'user-service',
  async (userId: string) => {
    const response = await fetch(`http://user-service/users/${userId}`);
    if (!response.ok) throw new Error(`HTTP ${response.status}`);
    return response.json();
  },
  (userId: string) => ({ id: userId, name: 'Unavailable', cached: true })
);

// Call with circuit breaker protection
const user = await userServiceBreaker.fire(userId);

Saga Pattern for Distributed Transactions

Traditional ACID transactions don't work across microservices because you can't hold locks across network boundaries. The Saga pattern manages distributed transactions through a sequence of local transactions with compensating actions for rollback.

Choreography vs Orchestration

Choreography: Services react to events independently; loose coupling but hard to trace flow
Orchestration: Central orchestrator controls flow; better visibility but single point of failure

Orchestration-Based Saga Example

// orderSaga.ts - Orchestration pattern
interface SagaStep<T> {
  name: string;
  execute: (context: T) => Promise<T>;
  compensate: (context: T) => Promise<void>;
}

class SagaOrchestrator<T> {
  private steps: SagaStep<T>[] = [];
  private completedSteps: SagaStep<T>[] = [];

  addStep(step: SagaStep<T>): this {
    this.steps.push(step);
    return this;
  }

  async execute(initialContext: T): Promise<T> {
    let context = initialContext;

    for (const step of this.steps) {
      try {
        logger.info(`Executing saga step: ${step.name}`);
        context = await step.execute(context);
        this.completedSteps.push(step);
      } catch (error) {
        logger.error(`Saga step ${step.name} failed:`, error);
        await this.rollback(context);
        throw new SagaFailedError(step.name, error);
      }
    }

    return context;
  }

  private async rollback(context: T): Promise<void> {
    logger.warn('Starting saga compensation...');

    // Compensate in reverse order
    for (const step of [...this.completedSteps].reverse()) {
      try {
        logger.info(`Compensating: ${step.name}`);
        await step.compensate(context);
      } catch (error) {
        // Log but continue - compensation must be best-effort
        logger.error(`Compensation failed for ${step.name}:`, error);
      }
    }
  }
}

// Order creation saga
const createOrderSaga = new SagaOrchestrator<OrderContext>()
  .addStep({
    name: 'createOrder',
    execute: async (ctx) => {
      const order = await orderService.create({
        userId: ctx.userId,
        items: ctx.items,
        status: 'PENDING'
      });
      return { ...ctx, orderId: order.id };
    },
    compensate: async (ctx) => {
      await orderService.cancel(ctx.orderId);
    }
  })
  .addStep({
    name: 'reserveInventory',
    execute: async (ctx) => {
      await inventoryService.reserve(ctx.orderId, ctx.items);
      return { ...ctx, inventoryReserved: true };
    },
    compensate: async (ctx) => {
      if (ctx.inventoryReserved) {
        await inventoryService.release(ctx.orderId);
      }
    }
  })
  .addStep({
    name: 'processPayment',
    execute: async (ctx) => {
      const payment = await paymentService.charge({
        orderId: ctx.orderId,
        userId: ctx.userId,
        amount: ctx.totalAmount
      });
      return { ...ctx, paymentId: payment.id };
    },
    compensate: async (ctx) => {
      if (ctx.paymentId) {
        await paymentService.refund(ctx.paymentId);
      }
    }
  });

Service Mesh: Istio vs Linkerd

A service mesh handles microservices communication at the infrastructure level, implementing patterns like circuit breaking, mutual TLS, and observability without changing application code. In 2025, 70% of companies run a service mesh.

Istio vs Linkerd Comparison

Performance: Linkerd is 163ms faster than Istio at the 99th percentile
Complexity: Istio has more features but higher complexity; Linkerd is simpler to operate
Traffic Management: Istio offers advanced fine-grained control; Linkerd provides basic but sufficient options
Best For: Istio for complex multi-cluster; Linkerd for simplicity-focused teams

Istio Circuit Breaker Configuration

# destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: user-service
spec:
  host: user-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10
        maxRetries: 3
    outlierDetection:
      # Circuit breaker settings
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
    loadBalancer:
      simple: ROUND_ROBIN

Distributed Tracing with OpenTelemetry

OpenTelemetry has become the industry standard for observability, with 79% of organizations either using it or considering it. It provides vendor-neutral distributed tracing, letting you follow a request across all services.

OpenTelemetry Setup (Node.js)

// tracing.ts - Initialize before any other imports
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'order-service',
    [SEMRESATTRS_SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/metrics',
    }),
    exportIntervalMillis: 60000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();

Custom Spans and Context Propagation

import { trace, context, SpanStatusCode, SpanKind } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function createOrder(orderData: OrderInput): Promise<Order> {
  // Create a span for this operation
  return tracer.startActiveSpan('createOrder', {
    kind: SpanKind.INTERNAL,
    attributes: {
      'order.user_id': orderData.userId,
      'order.items_count': orderData.items.length,
    }
  }, async (span) => {
    try {
      span.addEvent('Validating order');
      await validateOrder(orderData);

      span.addEvent('Creating order in database');
      const order = await orderRepository.create(orderData);
      span.setAttribute('order.id', order.id);

      // Call inventory service (trace propagates automatically)
      span.addEvent('Reserving inventory');
      await inventoryClient.reserve(order.id, orderData.items);

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Event-Driven Architecture

Event-driven architecture decouples services through asynchronous messaging, allowing them to communicate without direct dependencies. In 2025, the choice between message brokers often comes down to Kafka for high-throughput streaming and RabbitMQ for reliable message queuing.

Event-Driven Microservices with Kafka

// Producer - Order Service
import { Kafka, Partitioners } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'order-service',
  brokers: process.env.KAFKA_BROKERS?.split(',') || ['localhost:9092'],
});

const producer = kafka.producer({
  createPartitioner: Partitioners.DefaultPartitioner,
  idempotent: true, // Exactly-once semantics
});

interface OrderEvent {
  eventType: 'ORDER_CREATED' | 'ORDER_UPDATED' | 'ORDER_CANCELLED';
  orderId: string;
  timestamp: string;
  payload: Record<string, any>;
}

async function publishOrderEvent(event: OrderEvent): Promise<void> {
  await producer.send({
    topic: 'orders',
    messages: [{
      key: event.orderId, // Ensures ordering per order
      value: JSON.stringify(event),
      headers: {
        'event-type': event.eventType,
        'correlation-id': context.active().getValue(CORRELATION_ID_KEY),
      }
    }]
  });
}

Handling Network Partitions

The CAP theorem states that distributed systems can only guarantee two of three properties: Consistency, Availability, and Partition Tolerance. Since network partitions are inevitable, you must choose between consistency (CP) and availability (AP).

Graceful Degradation Pattern

// Graceful degradation during network issues
class ResilientProductService {
  private cache: Cache;
  private circuitBreaker: CircuitBreaker;

  async getProduct(productId: string): Promise<Product> {
    // Try primary source with circuit breaker
    try {
      const product = await this.circuitBreaker.execute(async () => {
        return this.productApi.getProduct(productId);
      });

      // Update cache on success
      await this.cache.set(`product:${productId}`, product, { ttl: 3600 });
      return product;
    } catch (error) {
      // Fallback to cache (potentially stale)
      const cached = await this.cache.get(`product:${productId}`);
      if (cached) {
        logger.warn(`Returning cached product ${productId} due to service error`);
        return { ...cached, _stale: true };
      }

      // Fallback to default product info
      logger.error(`No cached data for product ${productId}, returning minimal info`);
      return {
        id: productId,
        name: 'Product information temporarily unavailable',
        _unavailable: true
      };
    }
  }
}

Key Takeaways

Remember These Points

Never trust AI microservices code: It typically lacks circuit breakers, timeouts, and proper failure handling
Always use circuit breakers: Combine with retry and bulkhead patterns using Resilience4j or Opossum
Use Saga pattern for transactions: Choose orchestration for visibility, choreography for loose coupling
Implement distributed tracing: OpenTelemetry is the vendor-neutral standard (79% adoption)
Consider service mesh for cross-cutting concerns: Istio for advanced features, Linkerd for simplicity
Choose the right messaging: Kafka for streaming, RabbitMQ for reliable queuing
Design for failure: Network partitions are inevitable; embrace eventual consistency where appropriate
Implement graceful degradation: Return cached/default data rather than failing completely

Conclusion

Microservices architecture promises scalability and team autonomy, but AI-generated code often delivers the opposite—tightly coupled services that fail catastrophically when the network misbehaves. The patterns we've explored in this guide are not optional extras; they're fundamental requirements for any distributed system that needs to run in production.

The circuit breaker pattern prevents cascading failures. The Saga pattern maintains data consistency across services. Distributed tracing with OpenTelemetry gives you visibility into requests that span multiple services. Event-driven architecture decouples services properly. And service meshes handle cross-cutting concerns at the infrastructure level.

When using AI to generate microservices code, always verify that it includes proper timeout handling, circuit breakers, and failure recovery mechanisms. Treat AI-generated microservices code as a starting point that needs significant hardening before it's production-ready.

In our next article, we'll explore Localization and Internationalization Mistakes: AI's Cultural Blind Spots, examining why AI tools struggle with multi-language support, RTL layouts, and cultural formatting differences.