This is a diagnostic monitoring service designed to pinpoint the cause of intermittent HTTPS availability failures on a self-hosted GitLab instance.
The problem: GitLab periodically becomes unreachable over HTTPS (port 443), while ICMP, SSH, and HTTP/80 continue to work. This pattern suggests the failure occurs within a specific layer of the HTTPS stack — but without continuous measurement data captured during an outage, it's impossible to determine which layer is responsible.
This service solves that by continuously probing the HTTPS endpoint every 15 seconds, breaking down each connection into individual phases (DNS, TCP, TLS, server processing), and recording the timing data. When an outage occurs, the recorded data immediately reveals whether the failure is at:
The web dashboards provide real-time visibility during outages and historical analysis for identifying patterns like gradual degradation leading up to failures.
Each probe establishes a fresh HTTPS connection and measures how long each phase takes:
| Metric | What it measures | Healthy range |
|---|---|---|
| DNS | Time to resolve the hostname to an IP address | < 50ms |
| Connect | Time for the TCP handshake (SYN → SYN-ACK → ACK) | < 50ms (same region) |
| TLS | Time for the TLS handshake (ClientHello through Finished) | < 100ms |
| Server Processing | Time between TLS completion and receiving the first response byte | < 200ms |
| TTFB | Total time from start to first byte (cumulative: DNS+Connect+TLS+Server) | < 300ms |
| Total | Complete request duration including response body | < 500ms |
| TCP:80 | TCP connect time to port 80 (HTTP) — no TLS, just the handshake | < 50ms |
| H2 TTFB | Full TTFB using HTTP/2 instead of HTTP/1.1 | < 300ms |
Each probe cycle runs three tests in parallel to help isolate failure layers:
A bare TCP connection to port 80 (no TLS, no HTTP request). This answers: "Can the host accept connections at all?"
| TCP:80 Result | Port 443 Result | Interpretation |
|---|---|---|
| Fast | Timeout | Problem is specific to HTTPS — Apache's port 443 listener or SSL workers are exhausted, but the host itself is fine |
| Timeout | Timeout | Problem is system-wide — host overloaded, network issue, or Apache entirely unresponsive |
| Fast | Fast | Everything healthy |
A full HTTPS request using HTTP/2 (with ALPN h2 negotiation). This answers: "Is HTTP/2 behaving differently than HTTP/1.1?"
| HTTP/1.1 Result | H2 Result | Interpretation |
|---|---|---|
| OK | Failing | HTTP/2-specific issue — likely mod_http2 bug or resource leak |
| Failing | OK | Unusual — possibly HTTP/1.1 worker exhaustion while H2 multiplexes on existing connections |
| Failing | Failing | General HTTPS failure (both protocols affected) |
| OK, H2 slower | Slow | H2 degradation in progress — potential early warning of mod_http2 resource leak |
The H2 probe records full timing breakdown (DNS, Connect, TLS, Server, TTFB) so you can see exactly where the HTTP/2 path diverges from HTTP/1.1.
| Field | Purpose |
|---|---|
| TLS Version | Which TLS version was negotiated (TLS1.2 vs TLS1.3) — detects fallbacks under load |
| TLS Cipher | Negotiated cipher suite — detects unexpected changes |
| Cert Expiry | Certificate expiration timestamp — catch renewal failures before they cause outages |
| Response Size | Detect truncated responses or error pages being served instead of normal content |
| Server Header | Detect if traffic is being routed to a different backend (CDN, failover) |
| Consecutive Failures | Number of sequential failed probes — distinguishes one-off blips from sustained outages |
| Outage Start | Timestamp when the current failure streak began |
When a probe fails, the system identifies which phase failed:
| Phase | Meaning | Likely causes |
|---|---|---|
| TCP | Could not establish a TCP connection | Apache listener saturated, accept queue full, firewall, host down |
| TLS | TCP connected but TLS handshake failed/timed out | mod_ssl issues, certificate problems, HTTP/2 bugs, SSL worker exhaustion |
| HTTP | TLS completed but no HTTP response received | Workhorse/Puma overloaded, backend timeout, application crash |
| ERR | Other error (could not classify) | DNS failure, unexpected network error |
| OK | Request completed successfully | — |
The most critical failure pattern. If Connect is at the timeout value (5s) and no further phases complete:
ss -ltnp '( sport = :443 )' and apachectl fullstatusIf Connect is fast but TLS is very high or times out:
If DNS, Connect, and TLS are all fast but Server Processing is high:
gitlab-ctl status and Puma/Workhorse logsIf metrics slowly increase over hours before a full outage:
| Color | Meaning | Thresholds |
|---|---|---|
| Yellow | Warning — slower than normal | DNS >0.5s, Connect >1s, TLS >1s, Server >0.5s, TTFB >1s, Total >2s |
| Red | Critical — likely failure | DNS >2s, Connect >3s, TLS >3s, Server >3s, TTFB >5s, Total >10s |
Line: Shows each metric as a separate line. Best for identifying which specific phase is degrading.
Stacked Area: Stacks all phase durations. The top of the stack represents the approximate TTFB. Best for seeing the overall time budget and which phase is consuming the most.