VictoriaTraces Jaeger API Bug: Incomplete Traces

by Admin 49 views
VictoriaTraces Jaeger API Bug: Incomplete Traces

Hey everyone! Today, we're diving deep into a tricky bug we encountered with the VictoriaTraces Jaeger API. This issue caused incomplete traces to be returned, which can be a real headache when trying to monitor and debug applications. Let's break down what happened, how we reproduced it, and what the impact is.

The Problem: Incomplete Traces

So, what's the deal? We discovered that the Jaeger API /select/jaeger/api/traces/{traceID} in VictoriaTraces was returning incomplete traces. Specifically, it seemed like only spans that ended within approximately 30 to 60 seconds of the trace's start time were being included in the results. Anything beyond that timeframe was MIA. This is a critical issue because it means we're not getting the full picture of what's happening in our applications, especially for long-running operations.

To put it simply, imagine you're trying to watch a movie, but you only get to see the first 30 minutes. You'd miss the climax, the resolution, and all the good stuff in between! Similarly, with incomplete traces, we're missing crucial parts of the application's execution, making it difficult to pinpoint bottlenecks or errors. This makes understanding the entire flow of requests and operations a significant challenge.

The inconsistency between what we expected and what we received was quite alarming. We knew something was amiss when we saw discrepancies in the number of spans returned by different queries. This led us to investigate further and eventually uncover the time-window limitation within the Jaeger API. We need complete traces to effectively monitor and troubleshoot our systems, and this bug was preventing us from doing so.

Evidence: Seeing is Believing

We gathered some solid evidence to back up our claim. Take a look at these examples:

  • Production Trace 4e4a1b65edb8143ce2fa90ee1df9601b:
    • LogsQL (VictoriaMetrics' query language) showed 10 spans.
    • The Jaeger API, however, only returned 3 spans.
  • Test Trace (using the attached script):
    • LogsQL: 5 spans
    • Jaeger API: 3 spans

These discrepancies clearly indicated that something was wrong. The Jaeger API was consistently missing spans, especially those occurring later in the trace. This was a major red flag, prompting us to dig deeper and create a reproducible test case.

The fact that the LogsQL query returned the correct number of spans while the Jaeger API did not, highlighted a clear issue within the Jaeger API's handling of trace data. This discrepancy was not just a minor inconvenience; it represented a significant gap in our ability to observe and understand the behavior of our applications. It was essential to reproduce this issue reliably to ensure a proper fix could be implemented.

Reproducing the Bug: Let's Get Our Hands Dirty

To make sure we could consistently reproduce the issue, we created a Python script (reproduce_victoria_traces_bug.py). This script simulates a trace with five spans, spanning over 90 seconds. Here's how you can run it:

pip install requests
python3 reproduce_victoria_traces_bug.py

This script does the following:

  1. Generates a trace with a parent span lasting 90 seconds and four child spans at 20-second intervals.
  2. Sends these spans to VictoriaTraces using the OpenTelemetry (OTLP) protocol.
  3. Queries VictoriaTraces using both LogsQL and the Jaeger API.
  4. Compares the results.

The expected outcome is that both LogsQL and the Jaeger API should return all five spans. However, the actual result showed that LogsQL returned all five spans, while the Jaeger API only returned three. This perfectly demonstrates the bug: the Jaeger API is missing spans from longer traces.

Creating a reproducible test case like this is crucial for bug reporting. It allows developers to quickly understand the issue, replicate it in their environment, and verify that their fix is effective. Without a solid way to reproduce the bug, it can be challenging to diagnose and resolve the root cause.

The Impact: Why This Matters

This bug has a significant impact on our ability to monitor and troubleshoot applications. Any trace that lasts longer than one minute appears incomplete in Grafana, which is our primary monitoring tool. This affects all long-running operations, such as complex database queries, multi-stage data processing pipelines, or any workflow that takes more than a minute to complete. Essentially, we're flying blind for a significant portion of our application's activity.

Imagine trying to debug a slow API request that takes two minutes to complete. If the Jaeger API only returns spans for the first minute, we're missing half the story! This makes it incredibly difficult to identify the root cause of the slowness. This limitation can lead to delayed problem resolution, increased debugging time, and ultimately, a degraded user experience.

Environment: Where Did This Happen?

We encountered this bug in VictoriaTraces version v0.4.0. This information is essential for anyone else experiencing the same issue, as it helps narrow down the potential causes and ensures that the fix is applied to the correct version. Providing specific version details is a crucial step in any bug report, as it facilitates targeted debugging and prevents wasted effort on issues that may have already been addressed in newer releases.

The Request: Our Plea for Help

Our request is simple: Please fix the Jaeger API to return all spans for a trace ID, not just those ending within the first 60 seconds. We need a complete view of our traces to effectively monitor and debug our applications. This is not just a minor inconvenience; it's a critical requirement for maintaining the reliability and performance of our systems.

Diving Deeper: The Reproduction Script

Let's take a closer look at the Python script we used to reproduce the bug. This script is designed to be self-contained and easy to run, making it a valuable tool for anyone wanting to investigate this issue further.

#!/usr/bin/env python3
"""
VictoriaTraces Jaeger API Bug - Standalone Reproduction

Bug: Jaeger API only returns spans ending within ~30 seconds of trace start.
Runtime: ~90 seconds
Dependencies: requests (pip install requests)

Usage:
    python3 reproduce_victoria_traces_bug.py

Expected: LogsQL and Jaeger API both return all 5 spans
Actual:   LogsQL returns 5 spans, Jaeger API returns only 2 spans
"""

import json
import random
import time
import requests


# Configuration - CHANGE THIS to your VictoriaTraces endpoint
VICTORIA_TRACES_URL = "https://traces-dev.my-company.com"  # Replace with your actual URL
VERIFY_SSL = False  # Set to True if you have valid SSL certs


def generate_id(length=16):
    """Generate random hex ID"""
    return ''.join(random.choice('0123456789abcdef') for _ in range(length))


def send_span_otlp(trace_id, span_id, parent_span_id, name, start_time_ns, duration_ns):
    """Send a single span to VictoriaTraces via OTLP"""
    end_time_ns = start_time_ns + duration_ns

    span = {
        "traceId": trace_id,
        "spanId": span_id,
        "name": name,
        "startTimeUnixNano": str(start_time_ns),
        "endTimeUnixNano": str(end_time_ns),
        "kind": 1,
        "attributes": [
            {"key": "service.name", "value": {"stringValue": "bug-repro"}}
        ],
        "status": {"code": 0}
    }

    if parent_span_id:
        span["parentSpanId"] = parent_span_id

    payload = {
        "resourceSpans": [{
            "resource": {
                "attributes": [
                    {"key": "service.name", "value": {"stringValue": "bug-repro"}}
                ]
            },
            "scopeSpans": [{
                "spans": [span]
            }]
        }]
    }

    try:
        response = requests.post(
            f"{VICTORIA_TRACES_URL}/insert/opentelemetry/v1/traces",
            json=payload,
            verify=VERIFY_SSL,
            timeout=5
        )
        response.raise_for_status()
        return True
    except Exception as e:
        print(f"  ERROR sending span {name}: {e}")
        return False


def main():
    print("=" * 80)
    print("VictoriaTraces Jaeger API Bug Reproduction")
    print("=" * 80)
    print()

    # Generate trace and span IDs
    trace_id = generate_id(32)
    parent_id = generate_id(16)
    child_ids = [generate_id(16) for _ in range(4)]

    print(f"Trace ID: {trace_id}")
    print()
    print("Creating trace with 5 spans over 90 seconds...")
    print("  - 1 parent span (0s to 90s)")
    print("  - 4 child spans at 20s intervals")
    print()

    base_time = int(time.time() * 1_000_000_000)  # Current time in nanoseconds

    # Send child spans at 20-second intervals
    for i in range(4):
        elapsed = i * 20
        print(f"[{elapsed:3d}s] Sending child_span_{i+1}...")

        start_time = base_time + (i * 20_000_000_000)
        send_span_otlp(
            trace_id, child_ids[i], parent_id,
            f"child_span_{i+1}",
            start_time,
            100_000_000  # 100ms duration
        )

        if i < 3:  # Don't wait after last child
            time.sleep(20)

    # Send parent span (covers entire trace)
    print(f"[ 90s] Sending parent_span (covers full 90s)...")
    send_span_otlp(
        trace_id, parent_id, None,
        "parent_span",
        base_time,
        90_000_000_000  # 90 seconds
    )

    print()
    print("✓ All 5 spans sent to VictoriaTraces")
    print()

    # Wait for ingestion
    print("Waiting 5 seconds for data ingestion...")
    time.sleep(5)
    print()

    # Query LogsQL
    print("=" * 80)
    print("Querying VictoriaTraces...")
    print("=" * 80)
    print()

    print("[1] LogsQL Query (storage backend):")
    try:
        response = requests.get(
            f"{VICTORIA_TRACES_URL}/select/logsql/query",
            params={"query": f'"trace_id":"{trace_id}"', "limit": 100},
            verify=VERIFY_SSL,
            timeout=10
        )

        lines = response.text.strip().split('\n')
        span_names = set()
        for line in lines:
            try:
                entry = json.loads(line)
                if entry.get('span_id'):
                    span_names.add(entry.get('name', 'unknown'))
            except:
                pass

        print(f"    Returned: {len(span_names)} spans")
        for name in sorted(span_names):
            print(f"      - {name}")
    except Exception as e:
        print(f"    ERROR: {e}")
        span_names = set()

    print()

    # Query Jaeger API
    print("[2] Jaeger API Query:")
    try:
        response = requests.get(
            f"{VICTORIA_TRACES_URL}/select/jaeger/api/traces/{trace_id}",
            verify=VERIFY_SSL,
            timeout=10
        )

        data = response.json()
        jaeger_spans = data.get('data', [{}])[0].get('spans', [])
        jaeger_names = [s['operationName'] for s in jaeger_spans]

        print(f"    Returned: {len(jaeger_spans)} spans")
        for name in sorted(jaeger_names):
            print(f"      - {name}")
    except Exception as e:
        print(f"    ERROR: {e}")
        jaeger_names = []

    print()

    # Results
    print("=" * 80)
    print("RESULTS")
    print("=" * 80)
    print()

    logsql_count = len(span_names)
    jaeger_count = len(jaeger_names)

    print(f"Expected spans:  5")
    print(f"LogsQL returned: {logsql_count} spans")
    print(f"Jaeger API returned: {jaeger_count} spans")
    print()

    if logsql_count == jaeger_count == 5:
        print("✓ SUCCESS: Both APIs returned all spans - no bug detected")
        print()
        print("This could mean:")
        print("  - The bug has been fixed")
        print("  - Test duration too short (try longer trace)")
        print("  - Different VictoriaTraces version")

    elif logsql_count == 5 and jaeger_count < 5:
        print(f"✗ BUG CONFIRMED: Jaeger API missing {5 - jaeger_count} spans!")
        print()

        missing = set(span_names) - set(jaeger_names)
        print("Missing from Jaeger API:")
        for name in sorted(missing):
            print(f"  - {name}")

        print()
        print("DIAGNOSIS:")
        print("  VictoriaTraces Jaeger API has ~30 second time window")
        print("  Spans ending after ~30s from trace start are filtered out")
        print()
        print("Expected: child_span_3, child_span_4, parent_span are missing")
        print("  (They end at 60s, 80s, and 90s respectively)")

    else:
        print(f"⚠ UNEXPECTED: LogsQL={logsql_count}, Jaeger={jaeger_count}")
        print("  Data may not have been ingested properly")

    print()
    print("View in VictoriaTraces UI:")
    print(f"  {VICTORIA_TRACES_URL}/select/vmui/?#/?query=%22trace_id%22%3A%22{trace_id}%22")
    print()


if __name__ == "__main__":
    # Disable SSL warnings
    import urllib3
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    main()

Key Components

  • generate_id(length=16): Generates a random hexadecimal ID for spans and traces. This is essential for creating unique identifiers.
  • send_span_otlp(trace_id, span_id, parent_span_id, name, start_time_ns, duration_ns): Sends a single span to VictoriaTraces via the OTLP endpoint. This function constructs the OTLP payload and uses the requests library to send the data.
  • main(): The main function orchestrates the entire process. It generates trace and span IDs, sends the spans, queries VictoriaTraces using LogsQL and the Jaeger API, and then compares the results.

The script first sends a parent span that lasts for 90 seconds. Then, it sends four child spans at 20-second intervals. This creates a trace with spans that end at different times, allowing us to observe the time-window limitation of the Jaeger API. The use of time.sleep() is crucial to simulate realistic trace durations and ensure that spans fall outside the suspected 30-60 second window.

Configuration

Before running the script, you'll need to modify the VICTORIA_TRACES_URL variable to point to your VictoriaTraces endpoint. You may also need to adjust the VERIFY_SSL variable depending on your SSL certificate configuration. This flexibility in configuration allows the script to be used in various environments without requiring significant modifications.

Results and Diagnosis

The script's output clearly shows the discrepancy between LogsQL and the Jaeger API. It highlights the number of spans returned by each query and identifies the missing spans. The diagnosis section of the output confirms that the Jaeger API has a time window limitation, filtering out spans that end after approximately 30 seconds from the trace start. The script's clear and concise output makes it easy to understand the issue and verify the bug.

Version Information

We were running VictoriaTraces version victoria-traces-20251014-032256-tags-v0.4.0-0-g8a5f1b618 when we encountered this bug. This specific version information helps in tracking down the exact source of the issue and ensures that any fixes are targeted appropriately.

Logs and Screenshots

Unfortunately, we don't have access to the server logs directly. However, the provided Python script should allow you to reproduce the issue in your own environment and examine the logs there. Screenshots are not applicable in this case, as the bug is best demonstrated through the script's output and the comparison of span counts.

Command-Line Flags

We didn't use any special command-line flags when running VictoriaTraces. The default configuration should be sufficient to reproduce the bug.

Additional Information

We've provided all the necessary information to reproduce and understand this bug. We hope this helps in resolving the issue quickly.

Conclusion: The Path to Complete Traces

In summary, the VictoriaTraces Jaeger API bug, where only spans ending within a short timeframe are returned, poses a significant challenge to effective application monitoring and debugging. By providing a clear problem description, reproducible test case, and detailed version information, we aim to facilitate a swift resolution. Complete traces are essential for gaining comprehensive insights into application behavior, and we look forward to a fix that addresses this limitation.

We hope this detailed explanation has been helpful! Let us know if you have any questions or if you encounter similar issues. Let's work together to make our tracing systems more reliable and insightful.