Help & Metric Reference

About This System

This is a diagnostic monitoring service designed to pinpoint the cause of intermittent HTTPS availability failures on a self-hosted GitLab instance.

The problem: GitLab periodically becomes unreachable over HTTPS (port 443), while ICMP, SSH, and HTTP/80 continue to work. This pattern suggests the failure occurs within a specific layer of the HTTPS stack — but without continuous measurement data captured during an outage, it's impossible to determine which layer is responsible.

This service solves that by continuously probing the HTTPS endpoint every 15 seconds, breaking down each connection into individual phases (DNS, TCP, TLS, server processing), and recording the timing data. When an outage occurs, the recorded data immediately reveals whether the failure is at:

The web dashboards provide real-time visibility during outages and historical analysis for identifying patterns like gradual degradation leading up to failures.

Connection Lifecycle

Each probe establishes a fresh HTTPS connection and measures how long each phase takes:

Client Server
  │── DNS ──────────────────────────────│ Resolve hostname to IP
  │── Connect ──────────────────────────│ TCP 3-way handshake
  │── TLS ──────────────────────────────│ TLS handshake (certificates, keys)
  │── Server Processing ─────────────────│ Request sent → first byte received
  │──────────────────────────────────────│
  └── TTFB = DNS + Connect + TLS + Server (cumulative)

Metrics

MetricWhat it measuresHealthy range
DNSTime to resolve the hostname to an IP address< 50ms
ConnectTime for the TCP handshake (SYN → SYN-ACK → ACK)< 50ms (same region)
TLSTime for the TLS handshake (ClientHello through Finished)< 100ms
Server ProcessingTime between TLS completion and receiving the first response byte< 200ms
TTFBTotal time from start to first byte (cumulative: DNS+Connect+TLS+Server)< 300ms
TotalComplete request duration including response body< 500ms
TCP:80TCP connect time to port 80 (HTTP) — no TLS, just the handshake< 50ms
H2 TTFBFull TTFB using HTTP/2 instead of HTTP/1.1< 300ms

Comparative Probes

Each probe cycle runs three tests in parallel to help isolate failure layers:

TCP:80 — Port 80 Control Probe

A bare TCP connection to port 80 (no TLS, no HTTP request). This answers: "Can the host accept connections at all?"

TCP:80 ResultPort 443 ResultInterpretation
FastTimeoutProblem is specific to HTTPS — Apache's port 443 listener or SSL workers are exhausted, but the host itself is fine
TimeoutTimeoutProblem is system-wide — host overloaded, network issue, or Apache entirely unresponsive
FastFastEverything healthy

H2 — HTTP/2 Comparative Probe

A full HTTPS request using HTTP/2 (with ALPN h2 negotiation). This answers: "Is HTTP/2 behaving differently than HTTP/1.1?"

HTTP/1.1 ResultH2 ResultInterpretation
OKFailingHTTP/2-specific issue — likely mod_http2 bug or resource leak
FailingOKUnusual — possibly HTTP/1.1 worker exhaustion while H2 multiplexes on existing connections
FailingFailingGeneral HTTPS failure (both protocols affected)
OK, H2 slowerSlowH2 degradation in progress — potential early warning of mod_http2 resource leak

The H2 probe records full timing breakdown (DNS, Connect, TLS, Server, TTFB) so you can see exactly where the HTTP/2 path diverges from HTTP/1.1.

Additional Metadata

FieldPurpose
TLS VersionWhich TLS version was negotiated (TLS1.2 vs TLS1.3) — detects fallbacks under load
TLS CipherNegotiated cipher suite — detects unexpected changes
Cert ExpiryCertificate expiration timestamp — catch renewal failures before they cause outages
Response SizeDetect truncated responses or error pages being served instead of normal content
Server HeaderDetect if traffic is being routed to a different backend (CDN, failover)
Consecutive FailuresNumber of sequential failed probes — distinguishes one-off blips from sustained outages
Outage StartTimestamp when the current failure streak began

Phase Detection

When a probe fails, the system identifies which phase failed:

PhaseMeaningLikely causes
TCPCould not establish a TCP connectionApache listener saturated, accept queue full, firewall, host down
TLSTCP connected but TLS handshake failed/timed outmod_ssl issues, certificate problems, HTTP/2 bugs, SSL worker exhaustion
HTTPTLS completed but no HTTP response receivedWorkhorse/Puma overloaded, backend timeout, application crash
ERROther error (could not classify)DNS failure, unexpected network error
OKRequest completed successfully

Interpreting Outages

TCP Connect Times Out

The most critical failure pattern. If Connect is at the timeout value (5s) and no further phases complete:

TLS Handshake Stalls

If Connect is fast but TLS is very high or times out:

Server Processing Is Slow

If DNS, Connect, and TLS are all fast but Server Processing is high:

Gradual Degradation

If metrics slowly increase over hours before a full outage:

Color Coding (Events Table)

ColorMeaningThresholds
YellowWarning — slower than normalDNS >0.5s, Connect >1s, TLS >1s, Server >0.5s, TTFB >1s, Total >2s
RedCritical — likely failureDNS >2s, Connect >3s, TLS >3s, Server >3s, TTFB >5s, Total >10s

Chart Modes

Line: Shows each metric as a separate line. Best for identifying which specific phase is degrading.

Stacked Area: Stacks all phase durations. The top of the stack represents the approximate TTFB. Best for seeing the overall time budget and which phase is consuming the most.

Tips