Help & Metric Reference

About This System

This is a diagnostic monitoring service designed to pinpoint the cause of intermittent HTTPS availability failures on a self-hosted GitLab instance.

The problem: GitLab periodically becomes unreachable over HTTPS (port 443), while ICMP, SSH, and HTTP/80 continue to work. This pattern suggests the failure occurs within a specific layer of the HTTPS stack — but without continuous measurement data captured during an outage, it's impossible to determine which layer is responsible.

This service solves that by continuously probing the HTTPS endpoint every 15 seconds, breaking down each connection into individual phases (DNS, TCP, TLS, server processing), and recording the timing data. When an outage occurs, the recorded data immediately reveals whether the failure is at:

TCP level — Apache can't accept connections (worker/backlog exhaustion)
TLS level — SSL handshake stalls (mod_ssl/HTTP2 issues)
Application level — Backend is slow (Workhorse/Puma/database)

The web dashboards provide real-time visibility during outages and historical analysis for identifying patterns like gradual degradation leading up to failures.

Connection Lifecycle

Each probe establishes a fresh HTTPS connection and measures how long each phase takes:

Client Server
  │── DNS ──────────────────────────────│ Resolve hostname to IP
  │── Connect ──────────────────────────│ TCP 3-way handshake
  │── TLS ──────────────────────────────│ TLS handshake (certificates, keys)
  │── Server Processing ─────────────────│ Request sent → first byte received
  │──────────────────────────────────────│
  └── TTFB = DNS + Connect + TLS + Server (cumulative)

Metrics

Metric	What it measures	Healthy range
DNS	Time to resolve the hostname to an IP address	< 50ms
Connect	Time for the TCP handshake (SYN → SYN-ACK → ACK)	< 50ms (same region)
TLS	Time for the TLS handshake (ClientHello through Finished)	< 100ms
Server Processing	Time between TLS completion and receiving the first response byte	< 200ms
TTFB	Total time from start to first byte (cumulative: DNS+Connect+TLS+Server)	< 300ms
Total	Complete request duration including response body	< 500ms
TCP:80	TCP connect time to port 80 (HTTP) — no TLS, just the handshake	< 50ms
H2 TTFB	Full TTFB using HTTP/2 instead of HTTP/1.1	< 300ms

Comparative Probes

Each probe cycle runs three tests in parallel to help isolate failure layers:

TCP:80 — Port 80 Control Probe

A bare TCP connection to port 80 (no TLS, no HTTP request). This answers: "Can the host accept connections at all?"

TCP:80 Result	Port 443 Result	Interpretation
Fast	Timeout	Problem is specific to HTTPS — Apache's port 443 listener or SSL workers are exhausted, but the host itself is fine
Timeout	Timeout	Problem is system-wide — host overloaded, network issue, or Apache entirely unresponsive
Fast	Fast	Everything healthy

H2 — HTTP/2 Comparative Probe

A full HTTPS request using HTTP/2 (with ALPN h2 negotiation). This answers: "Is HTTP/2 behaving differently than HTTP/1.1?"

HTTP/1.1 Result	H2 Result	Interpretation
OK	Failing	HTTP/2-specific issue — likely `mod_http2` bug or resource leak
Failing	OK	Unusual — possibly HTTP/1.1 worker exhaustion while H2 multiplexes on existing connections
Failing	Failing	General HTTPS failure (both protocols affected)
OK, H2 slower	Slow	H2 degradation in progress — potential early warning of `mod_http2` resource leak

The H2 probe records full timing breakdown (DNS, Connect, TLS, Server, TTFB) so you can see exactly where the HTTP/2 path diverges from HTTP/1.1.

Additional Metadata

Field	Purpose
TLS Version	Which TLS version was negotiated (TLS1.2 vs TLS1.3) — detects fallbacks under load
TLS Cipher	Negotiated cipher suite — detects unexpected changes
Cert Expiry	Certificate expiration timestamp — catch renewal failures before they cause outages
Response Size	Detect truncated responses or error pages being served instead of normal content
Server Header	Detect if traffic is being routed to a different backend (CDN, failover)
Consecutive Failures	Number of sequential failed probes — distinguishes one-off blips from sustained outages
Outage Start	Timestamp when the current failure streak began

Phase Detection

When a probe fails, the system identifies which phase failed:

Phase	Meaning	Likely causes
TCP	Could not establish a TCP connection	Apache listener saturated, accept queue full, firewall, host down
TLS	TCP connected but TLS handshake failed/timed out	mod_ssl issues, certificate problems, HTTP/2 bugs, SSL worker exhaustion
HTTP	TLS completed but no HTTP response received	Workhorse/Puma overloaded, backend timeout, application crash
ERR	Other error (could not classify)	DNS failure, unexpected network error
OK	Request completed successfully	—

Interpreting Outages

TCP Connect Times Out

The most critical failure pattern. If Connect is at the timeout value (5s) and no further phases complete:

Apache cannot accept new connections on port 443
The kernel listen backlog is full
All Apache workers are busy (MaxRequestWorkers exhausted)
Check: ss -ltnp '( sport = :443 )' and apachectl fullstatus

TLS Handshake Stalls

If Connect is fast but TLS is very high or times out:

mod_ssl workers are exhausted
HTTP/2 (mod_http2) resource leak
Certificate chain or OCSP issues
Check: Apache error logs for SSL/HTTP2 errors

Server Processing Is Slow

If DNS, Connect, and TLS are all fast but Server Processing is high:

GitLab Workhorse is overloaded or unresponsive
Puma workers are exhausted
Backend dependencies (PostgreSQL, Redis) are slow
Check: gitlab-ctl status and Puma/Workhorse logs

Gradual Degradation

If metrics slowly increase over hours before a full outage:

Resource leak (likely HTTP/2 or KeepAlive related)
Worker pool slowly filling up
Use the Summary view with a 3d–7d window to spot trends

Color Coding (Events Table)

Color	Meaning	Thresholds
Yellow	Warning — slower than normal	DNS >0.5s, Connect >1s, TLS >1s, Server >0.5s, TTFB >1s, Total >2s
Red	Critical — likely failure	DNS >2s, Connect >3s, TLS >3s, Server >3s, TTFB >5s, Total >10s

Chart Modes

Line: Shows each metric as a separate line. Best for identifying which specific phase is degrading.

Stacked Area: Stacks all phase durations. The top of the stack represents the approximate TTFB. Best for seeing the overall time budget and which phase is consuming the most.

Tips

Keep the Live view open during known-unstable periods to catch failures in real-time
Use the Events table sorted by Connect or TLS (descending) to find the worst probes
Compare data from multiple probe locations to distinguish server-side from network issues
The probe forces HTTP/1.1 and disables connection reuse — each measurement is an independent connection