$ ls /presentations/

Modern C++

// Performance without compromise, Safety without overhead

								▸ Developer-friendly. Safe by design.
							
								▸ Zero-cost abstractions. Full control.
							
								▸ The language that evolved while keeping its promise:
							
								"Don't pay for what you don't use"

$ whoami
Dzmitry Martavoi - Co-CTO, IDT

Today's Journey

Part 0: Performance at Scale

Why 1% matters: Real server counts, language performance comparison

Part I: The Hidden Costs

GC overhead, compaction, metadata, write barriers, cache misses

Part II: Modern C++ Advantage

Smart pointers, coroutines, ranges, zero-cost abstractions

Performance at Scale: Why 1% Matters

Server Infrastructure at Scale

Major Tech Companies (2024):

Microsoft: 4+ million servers
Google: ~2.5 million servers
Meta: 1-2 million servers
AWS: Millions (proprietary)

Impact of 1% Performance Improvement

Example: 1 Million Server Infrastructure

1% improvement

10,000 servers

eliminated

Impact: Reduced hardware, power, cooling,
data center footprint

Language Performance Comparison

Computer Language Benchmarks Game • Go vs C++
CPU-intensive workloads (I/O amortizes differences)

1.0x

C++

1.0x

Rust

1.1x

1.6x

Java

2.0x

3.3x

Node.js

38x >>

Python

100x >>>

Part I: Understanding Cost

The hidden price of automatic memory management:
What developers pay for safety and convenience in C#, Go, and Java

Performance Costs Breakdown

1. Compaction: C# moves objects → must update all references (10-100ms+)
2. Span Allocation: Go's TCMalloc approach → internal fragmentation tradeoff
3. Type Metadata: Memory/CPU overhead for tracking object types
4. Synchronization: Write barriers and memory fences in concurrent GC
5. Tradeoffs: Latency vs throughput - can't optimize both
6. Cache Locality: C# arrays of references → scattered memory, cache misses
7. Virtual Dispatch: Go/C# interfaces → runtime vtable lookups, can't inline

GC Prerequisite: Type Metadata

The Problem

GC must know which fields are pointers to traverse object graph
Requires metadata storage (memory cost) or runtime lookups (CPU cost)
C++: No runtime cost - compiler knows types at compile time

Memory Layout Comparison

C# (Per-Object Headers)

ListNode (32 bytes)

MethodTable* (8 bytes) ← GC

SyncBlock Index (4 bytes) ← lock/hash

int data (4 bytes)

next* (8 bytes)

prev* (8 bytes)

GC uses MethodTable* to find pointer fields
SyncBlock for lock/GetHashCode (not GC)

✓ Fast GC scan
✗ 12 bytes overhead

Go (Span Metadata)

Span header (~48 bytes):
type*, size, count, bitmap...
Shared by all objects in span

ListNode (20 bytes)

int data (4 bytes)

next* (8 bytes)

prev* (8 bytes)

GC lookup:
Addr 0x1234 → span → type info
CPU cost per object scan

✓ Lower per-object overhead
✗ Span metadata + CPU lookup
✗ Still overhead vs C++

C++ (No Metadata)


// C++: No runtime type info
class ListNode {
    int data;
    ListNode* next;
    ListNode* prev;
};  // 20 bytes - no overhead

// C#: Must track types at runtime
class ListNode {
    public int data;
    public ListNode next;
    public ListNode prev;
}  // 32 bytes - 12 bytes overhead!

C++ ListNode (20 bytes)

int data (4 bytes)

next* (8 bytes)

prev* (8 bytes)

Comparison:
C#: +12 bytes/object (60% overhead for small objects)
Go: +~48 bytes/span + CPU lookup (amortized across span)
C++: Zero runtime overhead (compiler knows all types)

Tracing GC: Mark & Sweep

Algorithm

C#, Go, Java all use Tracing GC with Mark & Sweep
1. Mark: Start from roots → 2. Trace: Follow pointers, mark reachable → 3. Sweep: Free unmarked

Implementation Variations

C# GC: Generational Mark-Sweep-Compact

Generational: Gen 0 (new) / Gen 1 / Gen 2 (old)
Most GC cycles scan only Gen 0 (~1ms) - young objects die quickly
Full Gen 2 scan (rare): ~10-100ms+
Compact: Moves objects to eliminate fragmentation

Go GC: Non-Generational Concurrent

NOT generational: Scans entire heap every cycle (~1-10ms)
Simpler, more predictable latency
Concurrent: runs alongside app threads

The Cost

Cache-unfriendly pointer chasing + metadata reads per object
Even optimized, must traverse object graph (C++ avoids entirely)

graph TD Root1[Stack Var]:::root Root2[Global Var]:::root A[Object A]:::active B[Object B]:::active C[Object C]:::active D[Object D]:::active E[Object E]:::unreachable F[Object F]:::unreachable G[Object G]:::unreachable Root1 -->|1| A Root1 -->|2| B Root2 -->|3| C A -->|4| D B -->|5| D E -.-> F F -.-> G classDef root fill:#2ecc71,stroke:#27ae60,stroke-width:3px,color:#fff classDef active fill:#3498db,stroke:#2980b9,stroke-width:2px,color:#fff classDef unreachable fill:#e74c3c,stroke:#c0392b,stroke-width:2px,color:#fff

■ Roots ■ Reachable ■ Garbage
Traversal order: 1→2→3→4→5

Key Insight: GC's generational optimization helps, but fundamental cost remains: must traverse object graph to find garbage.

C# GC: Compaction is Expensive

The Fragmentation Problem

C# allocates sequentially → holes appear
Solution: Compact = move objects

Before:

									[A][B][░░][C][░░░][D]

									Can't fit!

After:

									[A][B][C][D][░░░░░]
								

C#'s Design Choice

Language: References = opaque GC handles
Implementation: Behave like addresses (can't do pointer math!)

Alternative could be:
• Indirection table (Java old GCs)
• Cheap compaction, slower access
C# chose: Fast access over cheap compaction

Compaction Cost

When C moves 0x1000 → 0x2000:

Scan: Find all refs
Update: Change addresses
Repeat for all moved

Gen 2: 10-100ms+

Compaction Process Visualization

Step 1: Mark Phase (traverse from roots)

										[A✓][B✓][Dead][C✓][Dead][D✓]
									

Step 2: Move Objects

										[A][B][C→][Dead][Dead][D→]

										[A][B][C][D][Free...]

Step 3: Update Refs (RE-TRAVERSE!)

Re-traverse from roots again: Find every pointer, update addresses


										if (ptr == old_addr) ptr = new_addr;

Why Gen 2 is SO expensive:
Must re-traverse & update ALL references:
• Gen 0: Few objects → fast (~1ms)
• Gen 2: Millions → slow! (10-100ms+)
• LOH (>85KB): Never compacted! (too costly)

Why C# Accepts This Cost:

✓ Fast dereference (direct lookup)
✓ Fast allocation (bump pointer)
✓ Cache-friendly, no waste
✗ Expensive compaction (traverse all)
✗ Can't compact LOH (too costly)

Go GC: Size Classes Avoid Compaction

The Design: TCMalloc-Inspired

TCMalloc = Thread-Caching Malloc
• Created by Google (2005) for high-performance servers
• Used in: Chrome, Google Search, YouTube, many C++ servers
• Go's allocator inspired by TCMalloc design
• github.com/google/tcmalloc

Key insight: Don't allocate sequentially - use size classes
Memory divided into spans (e.g., 8KB pages)
Each span dedicated to ONE size class (8B, 16B, 32B, etc.)
Objects allocated into appropriately-sized slots

What's a Span?

Span = contiguous memory pages dedicated to one size class

Example: 8KB span for 16-byte objects → 512 slots
Each slot holds exactly one object
Freed slot? Reuse for next object of same size
All slots freed? Return entire span to OS

Why No Fragmentation?

External fragmentation = eliminated!

Freed 16-byte slot → perfectly fits next 16-byte object
No "holes too small" problem
Each size class manages its own free list
No need to move objects or update pointers!

Size Classes Visualization

Span 1: 8-byte size class

										[8B][8B][8B][8B][8B][8B]
									

Span 2: 16-byte size class

										[16B][16B][16B]
									

Span 3: 32-byte size class

										[32B][32B][32B]
									

■ In use ■ Allocated ■ Freed

The Tradeoff

Internal Fragmentation:

12B object → 16B slot (4B wasted)
20B object → 32B slot (12B wasted)
Cost: ~10-20% memory waste

C# Compaction

✓ No wasted memory
✓ Cache-friendly
✗ Expensive compact
✗ Must pause/barrier

Go Size Classes

✓ No compaction
✓ Fast allocation
✗ ~10-20% waste
✗ Less cache-friendly

Concurrent GC: Write Barriers

Problem: Lost Object

State: A → B → D,  C (separate, unmarked)

1. GC marks A ✓, B ✓
2. GC pauses to scan elsewhere...
3. App writes: B.ref = C  (new reference!)
4. GC resumes, finishes B (doesn't see C!)

Result: C never marked → COLLECTED! 💥

Solution: Write Barrier


// Your code:
B.ref = C;

// Runtime automatically inserts barrier:
B.ref = C;                           // 1. Do write
if (atomic_load(gc_state) == MARKING) {  // 2. Check GC
    atomic_thread_fence(release);    // 3. Memory fence
    mark_object(C);                  // 4. Tell GC about C
}

✓ Now C gets marked → NOT collected
Barrier notifies GC about new references during concurrent scan

Performance Cost

Barrier breakdown (when GC active):

1. Check gc_state         ~2-5 cycles
   (atomic read from shared variable)

2. Memory fence           ~20-200 cycles ⚠️
   (flush CPU caches, ensure visibility)

3. Mark object            ~10-50 cycles
   (add to GC worklist, update metadata)

Total: 30-250 cycles per write
vs C++ without GC: ~1 cycle

When Barriers Run

C#:
Always on (for old→young writes)
~5-10 cycles overhead

Go:
Only during mark (~1-2% time)
~10-50 cycles when active

Key Insight:
Write barriers = correctness cost of concurrent GC
Every pointer write pays synchronization overhead
C++: No GC = no overhead (1 cycle)

Array of References: The Cache Killer

Why Cache Matters

CPU Cache Hierarchy:

L1: 32KB, ~4 cycles (fastest)
L2: 256KB, ~12 cycles
L3: 8MB, ~40 cycles (shared)
RAM: GBs, ~200 cycles (slow!)

Cache Lines (64 bytes):
CPU loads 64 bytes at once. Sequential data = next item already cached!
Prefetcher: Detects patterns, loads ahead

C# Example (Slow)


// Array of references → scattered
Point[] points = new Point[1000];

for (int i = 0; i < points.Length; i++) {
    points[i].X += 1; // Cache miss!
}

C++ Example (Fast)


// Contiguous values
std::vector<Point> points(1000);

for (auto& p : points) {
    p.x += 1; // Cache hit!
}

Memory Layout

C#: Scattered (Cache Miss Hell)

									Array: [ref][ref][ref][ref]

									        ↓     ↓     ↓     ↓

									Heap: [O]....[O]...[O].[O]

									❌ Each access = RAM lookup (~200 cycles)

C++/Go: Contiguous (Cache Win)

									Array: [Obj][Obj][Obj][Obj]

									✓ All in cache, ~4 cycles per access

Real Numbers:

C# scattered: 1000 items × 200 cycles = 200K cycles
C++ contiguous: 1000 items × 4 cycles = 4K cycles
50x faster! Just from memory layout

Go advantage: Slices store values (like C++), not references!

Virtual Dispatch: Runtime vs Compile-Time

Go/C#: Always Runtime Dispatch


// Go interface = vtable lookup
type Serializer interface {
    Serialize(data []byte) string
}

func processMany(items []Serializer) {
    for _, item := range items {
        item.Serialize(data)  // Vtable lookup each time
    }
}
// Cost: ~5-10 cycles per call (vtable + indirect jump)
// ✗ Can't inline, CPU can't predict virtual calls

Runtime dispatch cost:
• Vtable pointer dereference (~4 cycles)
• Indirect jump (~5-10 cycles, branch misprediction)
• Can't inline (loses 10-100x optimization)
• C# same cost for interface calls

C++: Compile-Time Dispatch (Templates)


// Template: Type known at compile time
template<typename T>
void processMany(std::vector<T>& items) {
    for (auto& item : items) {
        item.serialize(data);  // No vtable lookup!
    }
}

// Compiler generates specialized code for each type:
processMany(vector<JsonSerializer>);

What the Compiler Does

1. Inlining: Copies function body into call site
item.serialize() → { /* actual serialize code here */ }
No function call overhead (~5-10 cycles saved)

2. Loop Unrolling: Processes multiple items per iteration
for(4x) → process item[0], item[1], item[2], item[3]
Reduces loop overhead, better CPU pipelining

3. Vectorization (SIMD): Process 4-16 items at once
One CPU instruction → operates on 4+ items in parallel
2-8x speedup on data-parallel operations

4. Direct Calls: No vtable indirection
call 0x12345 vs load vtable → load function ptr → call
Predictable for CPU branch predictor

$ cd /modern_cpp/features/

Part II: Modern C++ Features

Smart Pointers: Memory Safety Without GC

Old C++ (Pre-C++11)


// Manual memory management
void process_request(const string& json_data) {
    // Allocate large buffer
    char* buffer = new char[1024 * 1024];
    
    // Parse and process
    if (!parse_json(json_data, buffer)) {
        delete[] buffer;  // Don't forget!
        return;
    }
    
    // Validate
    if (!validate(buffer)) {
        delete[] buffer;  // Don't forget again!
        return;
    }
    
    // Send response
    send_response(buffer);
    
    delete[] buffer;  // Easy to forget!
    // What if exception thrown? → LEAK!
}

Modern C++ (C++11+)


// Automatic memory management
void process_request(string_view json_data) {
    // Allocate large buffer (unique ownership)
    auto buffer = make_unique<char[]>(1024 * 1024);
    // Alt: unique_ptr<char[]> buf(new char[1024*1024]);
    // For shared: make_shared<char[]>(1024 * 1024);
    
    // Parse and process
    if (!parse_json(json_data, buffer.get())) {
        return;  // Auto-deleted!
    }
    
    // Validate
    if (!validate(buffer.get())) {
        return;  // Auto-deleted!
    }
    
    // Send response
    send_response(buffer.get());
    
    // Auto-deleted! Even with exceptions!
}

Modern C++: No manual delete needed, exception-safe, zero runtime overhead – without GC!

std::shared_ptr: When Multiple Owners Need Same Object

Problem: Who Closes the Logger?


// Old C++: Manual tracking is error-prone
class Worker {
    Logger* logger;  // Shared by all workers
public:
    Worker(Logger* log) : logger(log) {}
    
    void do_work() {
        logger->write("Working...");
    }
};

void run_tasks() {
    auto log = new Logger("app.log");
    
    auto w1 = Worker(log);
    auto w2 = Worker(log);
    auto w3 = Worker(log);
    
    w1.do_work();
    w2.do_work();
    
    delete log;  // Safe? What if w3 still running?
}

Solution: Automatic Ref Counting


// Modern C++: Automatic shared ownership
class Worker {
    shared_ptr<Logger> logger;
public:
    Worker(auto log) : logger(std::move(log)) {}
    
    void do_work() {
        logger->write("Working...");
    }
};

void run_tasks() {
    auto log = make_shared<Logger>("app.log");
    
    auto w1 = Worker(log);
    auto w2 = Worker(log);
    auto w3 = Worker(log);
    
    w1.do_work();
    w2.do_work();
    
    // Logger closed automatically when all workers done
}

shared_ptr: Automatic reference counting – object deleted when last owner done. Small atomic overhead vs GC scanning all objects.

Modern C++ Containers: Safe Alternatives to C-Style

Old C-Style (Unsafe)


// C-style array - fixed size, no bounds checking
int scores[5] = {90, 85, 92, 88, 95};
scores[10] = 100;  // Buffer overflow! UB!

// C-style string - manual memory, null terminator
char* name = new char[100];
strcpy(name, "John");  // Unsafe!
delete[] name;  // Easy to forget!

// Dynamic array - manual memory
int* data = new int[count];
// ... use it ...
delete[] data;  // Easy to forget!

// String view - pointer + length (error-prone)
void process(const char* str, size_t len) {
    // Manual bounds checking needed
}

// Array view - pointer + size (C++20)
void process(int* arr, size_t size) {
    arr[size + 1] = 42;  // Buffer overflow! UB!
}

Modern C++ (Safe & Easy)


// std::array - fixed size, bounds checking
std::array<int, 5> scores = {90, 85, 92, 88, 95};
scores.at(10);  // Throws exception! Safe!

// std::string - automatic memory management
std::string name = "John";
name += " Doe";  // Safe concatenation
// Automatic cleanup, no delete needed!

// std::vector - dynamic array, automatic memory
std::vector<int> data(count);
data.push_back(42);  // Grows automatically
// Automatic cleanup, no delete needed!

// std::string_view - efficient, safe view
void process(std::string_view str) {
    // Bounds-checked, no copy, safe
}

// std::span - safe array view (C++20)
void process(std::span<int> arr) {
    arr[0] = 42;  // Bounds-checked, safe!
    // Works with array, vector, C-array!
}

Modern C++ Types: Memory-safe, automatic cleanup, bounds-checked, STL-compatible – without sacrificing performance!

Modern C++ Containers: Features & Benefits

Type	Use Case	Key Benefits	vs C-Style
`std::array<T,N>`	Fixed-size arrays	Bounds checking, size tracking, zero overhead	vs `T arr[N]` ✓ Safe
`std::vector<T>`	Dynamic arrays	Auto memory, grows, cache-friendly, fast	vs `T* + new[]` ✓ No leaks
`std::string`	Text data	Auto memory, SSO optimization, safe ops	vs `char*` ✓ No overflow
`std::string_view`	Read-only text	Zero-copy view, no allocation, efficient	vs `const char*` ✓ Knows size
`std::span<T>`	Array view (C++20)	Safe view of any array, bounds-checked	vs `T* + size` ✓ Unified API

Why Use Modern Containers?

✓ Memory safe - automatic cleanup
✓ Bounds checking - catches errors
✓ STL compatible - works with algorithms
✓ Zero-cost abstractions - same performance

Compatible & Modern:

✓ Works with C APIs (.data())
✓ Range-based for loops
✓ Standard algorithms (sort, find, etc.)
✓ Move semantics - efficient transfers

Lambdas: From Verbose to Elegant

Old C++ (Pre-C++11)


// Option 1: Function pointer (limited)
bool compare(int a, int b) {
    return a < b;
}

std::sort(data.begin(), data.end(), compare);

// Option 2: Functor (verbose!)
class Comparator {
    bool reverse;
public:
    Comparator(bool rev) : reverse(rev) {}
    
    bool operator()(int a, int b) const {
        return reverse ? a > b : a < b;
    }
};

Comparator comp(false);  // Normal sort
std::sort(data.begin(), data.end(), comp);

// Can't capture local variables easily!
// Verbose, requires separate class definition

Modern C++ (C++11+)


// Simple lambda - inline!
std::sort(data.begin(), data.end(), 
    [](int a, int b) { return a < b; });

// Lambda with capture - easy!
bool reverse = false;
std::sort(data.begin(), data.end(),
    [reverse](int a, int b) {
        return reverse ? a > b : a < b;
    });

// Real-world: Filter and transform
auto filtered = data 
    | std::views::filter([](int x) { return x > 0; })
    | std::views::transform([](int x) { return x * 2; });

// HTTP handler with lambda
server.route("/api/users", [&db](auto req) {
    auto users = db.query("SELECT * FROM users");
    return json_response(users);
});

Lambdas: Inline, concise, can capture local variables – perfect for callbacks, algorithms, and functional programming!

C++20 Ranges: From Verbose to Elegant

Old C++ (Pre-C++20)


// Filter + Transform: Multiple passes, temporaries
std::vector<int> nums = {1, -2, 3, -4, 5};

// Step 1: Filter positives (allocates temp vector)
std::vector<int> positives;
std::copy_if(nums.begin(), nums.end(),
    std::back_inserter(positives),
    [](int n) { return n > 0; });

// Step 2: Double them (allocates another vector)
std::vector<int> result;
std::transform(positives.begin(), positives.end(),
    std::back_inserter(result),
    [](int n) { return n * 2; });

// Problems:
// - Verbose (iterators everywhere)
// - Multiple allocations (2 temp vectors)
// - Eager evaluation (processes all elements)
// - Can't easily compose operations

✗ Old way: Verbose, multiple allocations, eager

Modern C++ (C++20 Ranges)


// Same logic: One line, zero allocations!
std::vector<int> nums = {1, -2, 3, -4, 5};

auto result = nums 
    | std::views::filter([](int n) { return n > 0; })
    | std::views::transform([](int n) { return n * 2; });
// Result: lazy view, no allocation yet!

// Only allocate when needed:
std::vector<int> vec(result.begin(), result.end());

// Benefits:
// - Composable pipelines (Unix pipe style)
// - Zero intermediate allocations
// - Lazy evaluation (process on demand)
// - Works with any range (vector, list, istream, etc.)

// More examples:
auto first3 = nums | std::views::take(3);
auto skip2 = nums | std::views::drop(2);
auto reversed = nums | std::views::reverse;

✓ Ranges: Composable, lazy, zero-cost abstractions

C++20 Ranges: Functional programming style with zero overhead – lazy evaluation means no work until you iterate!

Structured Bindings: Elegant Unpacking (C++17)

Old C++ (Pre-C++17)


// Unpacking pairs - verbose
std::map<string, int> users;
auto it = users.find("john");
if (it != users.end()) {
    string key = it->first;
    int value = it->second;
    // Use key and value...
}

// Unpacking tuples - ugly!
std::tuple<int, string, bool> result = query();
int status = std::get<0>(result);
string message = std::get<1>(result);
bool success = std::get<2>(result);

// Iterating map - awkward
for (auto it = users.begin(); it != users.end(); ++it) {
    cout << it->first << ": " << it->second;
}

// Struct unpacking - manual
struct Point { int x, y; };
Point p = get_point();
int x = p.x;
int y = p.y;

Modern C++ (C++17+)


// Unpacking pairs - clean!
std::map<string, int> users;
if (auto [it, inserted] = users.insert({"john", 42}); inserted) {
    auto [key, value] = *it;
    // Use key and value directly!
}

// Unpacking tuples - readable!
auto [status, message, success] = query();
// Use status, message, success directly!

// Iterating map - elegant
for (auto& [key, value] : users) {
    cout << key << ": " << value;
}

// Struct unpacking - automatic
struct Point { int x, y; };
auto [x, y] = get_point();
// x and y are ready to use!

// Real-world: HTTP parsing
auto [method, path, headers] = parse_request(req);
if (method == "POST" && path == "/api/users") {
    auto [auth, content_type] = extract_headers(headers);
}

Structured Bindings: Unpack tuples, pairs, structs, and arrays in one line – makes code cleaner and more readable!

Compile-Time Programming: Zero Runtime Cost

constexpr: Compute at Compile Time


// Factorial at compile time
constexpr int factorial(int n) {
    return n <= 1 ? 1 : n * factorial(n - 1);
}

// Computed at compile time!
constexpr int f10 = factorial(10);  // 3628800
// No runtime cost - value baked into binary

// String processing at compile time (C++20)
constexpr auto compile_time_parse(const char* str) {
    // Parse, validate, transform at compile time
    return /* result */;
}

// Complex example: Compile-time regex
constexpr auto pattern = ctre::match<"[0-9]+">;
// Regex compiled at compile time, not runtime!

// Benefits:
// - Zero runtime overhead
// - Errors caught at compile time
// - Impossible in Go/C# (no compile-time execution)

constexpr: Run code at compile time, zero runtime cost

C++20 Concepts: Type Constraints


// Old way: Template errors are cryptic
template<typename T>
void process(T value) {
    value.serialize();  // Compile error if no serialize()
}
// Error: 50 lines of template gibberish!

// C++20: Concepts make requirements explicit
template<typename T>
concept Serializable = requires(T t) {
    { t.serialize() } -> std::same_as<std::string>;
};

// Clear constraint, clear errors
void process(Serializable auto value) {
    return value.serialize();
}
// Error: "T does not satisfy Serializable" ✓

// Standard concepts:
void sort_items(std::ranges::random_access_range auto& r) {
    std::ranges::sort(r);
}
// Only accepts vectors, arrays, etc. - not lists!

Concepts: Better template errors, self-documenting code

Unique to C++: Move computation from runtime to compile time – impossible in Go/C#. Pay once at compile time, benefit forever at runtime!

Coroutines: async/await in C++20

C#: async/await


// Accept connections
while (true) {
  var socket = await listener.AcceptSocketAsync();
  
  // Handle each connection
  _ = HandleRequest(socket);
}

// Handle request
async Task HandleRequest(Socket socket) {
  using var stream = new NetworkStream(socket);
  
  while (true) {
    // Read request (async)
    var request = await ReadHttpRequest(stream);
    
    if (request == null) break;
    
    // Process request
    var response = MakeResponse(request);
    
    // Write response (async)
    await WriteHttpResponse(stream, response);
  }
}

C#: GC allocates state machines – runtime overhead

C++20: co_await


// Accept connections
  for (;;) {
  auto [ec, socket] = co_await acceptor.async_accept(
        asio::use_awaitable);
  
  if (!ec) {
    // Spawn handler for each connection
    asio::co_spawn(executor, 
        handle_request(std::move(socket)), 
        asio::detached);
  }
}

// Handle request
asio::awaitable<void> handle_request(tcp::socket socket) {
  beast::tcp_stream stream(std::move(socket));
  
  for (;;) {
    // Read request (async, non-blocking!)
    auto [ec, bytes] = co_await beast::http::async_read(
        stream, buffer, req, asio::use_awaitable);
    
    if (ec) break;
    
    // Process request
    auto response = make_response(req);
    
    // Write response (async!)
    co_await beast::http::async_write(
        stream, response, asio::use_awaitable);
  }
}

C++: Stack-allocated, zero-cost abstractions!

Coroutines: Same ergonomics as C# async/await, but zero-cost abstractions – no GC overhead!

Coroutines: State Machines + C++ Optimizations

Both compile to state machines, but C++ can optimize in specific cases

C# State Machine


// Your code:
async Task<int> Fetch(string url) {
    var resp = await client.GetAsync(url);
    var data = await resp.Content.ReadAsStringAsync();
    return data.Length;
}

// Generated state machine:
class FetchStateMachine : IAsyncStateMachine {
    int state;
    string url;
    HttpResponseMessage resp;
    string data;
    TaskAwaiter awaiter1, awaiter2;
    
    void MoveNext() {
        switch (state) {
            case 0:
                awaiter1 = client.GetAsync(url).GetAwaiter();
                state = 1;
                awaiter1.OnCompleted(MoveNext);
                break;
            case 1:
                resp = awaiter1.GetResult();
                awaiter2 = resp.Content.ReadAsStringAsync()
                    .GetAwaiter();
                state = 2;
                awaiter2.OnCompleted(MoveNext);
                break;
            case 2:
                data = awaiter2.GetResult();
                result.SetResult(data.Length);
                break;
        }
    }
}

Always: Heap + GC (~50-100 bytes/call)

C++ State Machine


// Your code:
task<int> fetch(string url) {
    auto resp = co_await client.get(url);
    auto data = co_await resp.read_body();
    co_return data.size();
}

// Generated coroutine frame:
struct FetchFrame {
    int state = 0;
    string url;
    Response resp;
    string data;
    awaiter_t awaiter1, awaiter2;
    
    void resume() {
        switch (state) {
            case 0:
                awaiter1 = client.get(url).operator co_await();
                state = 1;
                awaiter1.await_suspend(handle);
                break;
            case 1:
                resp = awaiter1.await_resume();
                awaiter2 = resp.read_body().operator co_await();
                state = 2;
                awaiter2.await_suspend(handle);
                break;
            case 2:
                data = awaiter2.await_resume();
                promise.return_value(data.size());
                break;
        }
    }
};

Optimizable: 0 bytes (elided) to ~30 bytes (no GC!)

Key: Similar state machines, but C++ optimizer can: inline away (if never suspends), stack-allocate (HALO), or use custom allocators

Goroutines vs C++ Coroutines: Scheduling Models

Understanding the fundamental difference: Preemptive vs Cooperative scheduling

Go: Preemptive Scheduling


// Go runtime can interrupt goroutines anywhere
func worker() {
    for i := 0; i < 1000000; i++ {
        // No explicit yield needed!
        // Runtime can preempt here
        doWork(i)
    }
}

// ⚠️ Anti-pattern example (but won't deadlock!)
// Even infinite loop won't hang runtime
func badWorker() {
    for {
        x := computeSomething()  // Runtime preempts
    }
}  // Other goroutines still run!

// GOMAXPROCS: M:N scheduling
// 1 million goroutines → 4-8 OS threads
runtime.GOMAXPROCS(8)  // Max 8 parallel threads
go worker()  // Scheduled by runtime
go worker()  // Runtime handles fairness
go worker()  // Can't monopolize CPU

✓ Pros:
• Can't hang runtime (auto preempts)
• Automatic fairness between goroutines
• Works with blocking code automatically

✗ Cons:
• GC overhead for goroutine stacks
• Less predictable timing
• Context switches at runtime's discretion

C++: Cooperative Scheduling


// C++ coroutines: MUST explicitly yield
asio::awaitable<void> worker() {
    for (int i = 0; i < 1000000; i++) {
        // Must co_await to yield!
        co_await asio::post(asio::use_awaitable);
        doWork(i);
    }
}

// ⚠️ DANGER: Anti-pattern example!
// Infinite loop WILL hang everything!
asio::awaitable<void> badWorker() {
    for (;;) {
        int x = computeSomething();
        // Never yields → blocks entire thread!
    }
}  // All other coroutines blocked!

// Thread control: Explicit
asio::io_context io;  // Single event loop
asio::co_spawn(io, worker(), asio::detached);
asio::co_spawn(io, worker(), asio::detached);
io.run();  // Run event loop (single thread)

✓ Pros:
• Zero GC overhead
• Predictable, deterministic timing
• Full control over when to yield

✗ Cons:
• Can block thread if forget co_await
• Manual yield points required
• Blocking code needs thread pool

Key Insight: Go = safer defaults (can't deadlock accidentally), C++ = more control (zero GC overhead). For Go devs moving to C++: remember to co_await or your coroutine will block!

Goroutines in C++? Boost.Fiber Has You Covered

"But Go makes concurrency so easy with goroutines and channels!" — C++ can do that too.

Go: Goroutines & Channels


// Buffered channel
ch := make(chan int, 10)

// Producer goroutine
go func() {
    for i := 0; i < 100; i++ {
        ch <- i  // Send to channel
    }
    close(ch)
}()

// Consumer goroutine
go func() {
    for val := range ch {
        fmt.Println(val)
    }
}()

// Wait for completion
time.Sleep(1 * time.Second)

// Lightweight: millions of goroutines possible

Go: Built-in, elegant syntax

C++: Boost.Fiber (User-Space)


// Buffered channel
boost::fibers::buffered_channel<int> ch(10);

// Producer fiber
boost::fibers::fiber producer([&ch]() {
    for (int i = 0; i < 100; i++) {
        ch.push(i);  // Send to channel
    }
    ch.close();
});

// Consumer fiber
boost::fibers::fiber consumer([&ch]() {
    int val;
    while (ch.pop(val) == 
           boost::fibers::channel_op_status::success) {
        std::cout << val << '\n';
    }
});

// Wait for completion
producer.join();
consumer.join();

// Lightweight: cooperative scheduling, no GC!

C++: Library, similar pattern, no GC overhead

Bonus: C++ also has thread pools, executors (C++23), and async/await (C++20) – pick the right tool for the job!

Error Handling: From Exceptions to std::expected

Old Way: Exceptions


// Traditional exception handling
User parseUser(const std::string& json) {
    try {
        auto data = json::parse(json);
        return User{data["name"], data["age"]};
    } catch (const json::exception& e) {
        throw std::runtime_error("Parse failed");
    }
}

// Usage: Hidden control flow
try {
    auto user = parseUser(input);
    process(user);
} catch (const std::exception& e) {
    log_error(e.what());
}

// Problems:
// - Hidden control flow (where can exceptions come from?)
// - Performance cost (stack unwinding)
// - Not clear from signature that it can fail
// - Exception safety is hard to get right

Exceptions: Hidden control flow, performance cost

Modern: std::expected (C++23)


// Explicit error handling (like Rust's Result)
std::expected<User, ParseError> 
parseUser(const std::string& json) {
    auto data = json::parse(json);
    if (!data.contains("name")) {
        return std::unexpected(ParseError::MissingField);
    }
    return User{data["name"], data["age"]};
}

// Usage: Explicit error handling
auto result = parseUser(input);
if (result) {
    process(*result);  // Success
} else {
    log_error(result.error());  // Explicit handling
}

// Benefits:
// - Clear from signature: can succeed or fail
// - Zero-cost (no stack unwinding)
// - Explicit control flow
// - Composable with and_then, or_else, transform

std::expected: Explicit, zero-cost, composable

Also: std::optional (C++17)
std::optional<User> findUser(int id);
Returns value or nothing (no exceptions)

Async errors:
awaitable<expected<Data, Error>>
Combine with coroutines for clean async errors

JSON: Manual vs Automatic Deserialization

C++: Manual (nlohmann/json)


#include <nlohmann/json.hpp>
using json = nlohmann::json;

// Parse JSON
auto data = json::parse(request_body);

// Manual extraction - explicit
auto name = data["name"].get<string>();
auto age = data["age"].get<int>();
auto items = data["items"].get<vector<string>>();

// Build response
auto response = json{
  {"status", "success"},
  {"user", {{"id", 123}, {"name", name}}}
};
return response.dump();

✓ Pros: Full control, no reflection overhead, zero-cost
✗ Cons: More code, manual field extraction

C#: Automatic (Reflection)


using System.Text.Json;

// Define model
class User {
  public string Name { get; set; }
  public int Age { get; set; }
  public List<string> Items { get; set; }
}

// Automatic via reflection
var user = JsonSerializer.Deserialize<User>(body);

// Use it
Console.WriteLine(user.Name);

✓ Pros: Less code, automatic mapping
✗ Cons: Reflection overhead, runtime type checks

C++ can do automatic too! Using compile-time reflection (next slide) →

Automatic Deserialization: C++, C#, Go

C++: Compile-Time Reflection


#include <boost/describe.hpp>
#include <boost/json.hpp>

struct User {
    std::string name;
    int age;
    std::vector<std::string> items;
};

// Compile-time metadata
BOOST_DESCRIBE_STRUCT(User, (), (name, age, items))

// Automatic deserialization
User user = boost::json::value_to<User>(
    boost::json::parse(request_body)
);

Compile-Time: Macro generates code at compile time
✓ Zero runtime overhead • ✓ Type-safe

C#: Runtime Reflection


using System.Text.Json;

class User {
  public string Name { get; set; }
  public int Age { get; set; }
  public List<string> Items { get; set; }
}

// Runtime reflection
var user = JsonSerializer.Deserialize<User>(body);

Runtime: Inspects types at runtime
✗ Reflection overhead • ✗ Runtime type checks

Go: Struct Tags


import "encoding/json"

type User struct {
    Name  string   `json:"name"`
    Age   int      `json:"age"`
    Items []string `json:"items"`
}

// Runtime reflection via tags
var user User
json.Unmarshal([]byte(body), &user)

Runtime: Uses struct tags + reflection
✗ Runtime overhead • ✓ Built-in, no macros

C++ Advantage: Same convenience as C#/Go, but compile-time = zero runtime cost!

~/presentation/modern-cpp $ questions

									$ cat questions.txt
								
									Questions?
								
									# Feel free to ask about anything
								
									$ echo $CONTACT_INFO
								
											email:
											
												pm@martavoi.by
											
											role:
											Co-CTO @ IDT
										
											name:
											Dzmitry Martavoi
										
									$

						✓ Modern C++ = Performance + Safety + Developer Happiness