Skip to content

Validating the Platform Under Change

BLUF

The previous phases of this series focused on designing the architecture and assembling the initial infrastructure for the lab.

This phase focused on something different:

validating whether the architecture actually behaves correctly under disruption, rebuild, and operational lifecycle conditions.

The turning point came unexpectedly through hardware failure. A motherboard replacement forced the system through its first meaningful disruption while scaling toward 256GB of RAM. Rather than simply restoring the machine, I used the rebuild to validate whether the system architecture could survive change without requiring reconstruction.

That process exposed hidden configuration drift, validated the durability boundaries between storage and compute, and ultimately led to a broader operational maturity effort across the stack.

The work evolved through three repositories with intentionally separated responsibilities:

  • infra-validate
  • validates Linux host health, ZFS storage integrity, Kubernetes readiness, and stateful workload behavior

  • data-platform

  • deploys and verifies MinIO and Spark on the existing K3s substrate through repeatable deployment, teardown, rebuild, and verification workflows

  • data-lab

  • consumes the platform runtime for interactive notebooks, Spark experimentation, and workload validation without mutating infrastructure behavior

The goal throughout this phase was consistent:

reduce operational ambiguity before increasing architectural complexity.

Each layer was only expanded after the previous layer became operationally verifiable.

This is intentionally still a constrained single-node environment. The goal is not high availability or cloud-scale throughput. The goal is developing operational understanding through explicit contracts, repeatable validation, and deterministic behavior.


Hardware Failure as Architectural Validation

BLUF

The motherboard replacement unintentionally became the first real resilience test of the architecture.

The important outcome was not that the machine came back online. The important outcome was that durable state survived independently of runtime infrastructure and compute services.

That behavior validated one of the core architectural goals established earlier in the series:

storage should remain durable while compute and orchestration remain replaceable and recoverable.

Identifying the Failure

The original issue surfaced while attempting to scale the system beyond 128GB of RAM.

Symptoms included:

  • inconsistent DIMM behavior
  • unstable memory channel population
  • intermittent recognition failures

After isolating DIMMs and validating CPU behavior, the issue ultimately traced back to motherboard instability rather than memory or processor failure.

The replacement board immediately stabilized memory behavior.

From there, validation proceeded incrementally:

  • 128GB
  • 192GB
  • progressing toward 256GB

Rather than assuming success because the machine booted, the system was validated under sustained load using:

  • incremental DIMM population
  • stress-ng
  • dmesg inspection
  • channel population verification

One important lesson became immediately clear:

High-capacity memory systems are validated under load—not at boot.


Discovering Configuration Drift

BLUF

The most valuable problem uncovered during the hardware swap was not hardware-related.

It was operational drift that had accumulated silently over time.

During post-swap validation, running:

mount -a

produced:

Structure needs cleaning on /mnt/data

At first glance this appeared to be a storage failure.

It was not.

The actual durable storage layer (tank) remained healthy and fully intact.

The issue was a stale XFS mount configuration tied to a previously removed 1TB drive that still existed inside /etc/fstab.

The architecture itself had survived correctly.

The configuration surrounding it had not.

Resolution involved:

  • removing orphaned mount entries
  • validating clean mount state
  • confirming ZFS as the single durable source of truth

This reinforced an important operational reality:

Hardware changes expose assumptions and configuration drift that otherwise remain hidden indefinitely.

More importantly, it validated the architectural separation established earlier in the project:

Layer Responsibility
ZFS (tank) Durable state
NVMe (/fast) Performance / staging
K3s + containers Recoverable runtime compute
OS install Replaceable orchestration substrate

The system behaved according to design intent:

  • data survived
  • compute was recoverable
  • runtime services were reproducible
  • recovery remained deterministic

That was the real milestone of this phase.


infra-validate: Proving Stateful Readiness

BLUF

Once the hardware stabilized, the focus shifted from:

“the system runs”

to:

“the system is operationally ready to safely support stateful services.”

This led to the continued expansion of infra-validate.

infra-validate does not provision infrastructure or install Kubernetes.

Its purpose is validating whether the Linux, ZFS, and K3s substrate is behaving according to expected operational assumptions before platform services are deployed.

Moving Beyond “Healthy”

A major realization during this phase was that Kubernetes status alone is not a sufficient readiness signal.

A cluster can report:

  • healthy nodes
  • bound PVCs
  • mounted storage

…and still fail under real lifecycle conditions.

The validation layer therefore expanded into capability-based checks rather than configuration checks.

The primary readiness gate became:

python -m infra_validate run --config config/lab.yaml

Validation coverage now includes:

Host / System

  • hostname
  • uptime
  • memory thresholds
  • disk free thresholds
  • required systemd services

ZFS / Storage

  • pool health
  • dataset existence
  • required mounts
  • filesystem type validation
  • expected storage paths

Kubernetes Readiness

  • cluster reachability
  • node readiness
  • namespace validation
  • workload readiness
  • warning-event hygiene

Stateful Workload Validation

The most important addition was persistence validation.

A dedicated durable storage smoke workflow validates:

  • PVC provisioning
  • pod mount behavior
  • read/write behavior
  • persistence across restart
./scripts/durable_smoke_run.sh

This intentionally validates real workload lifecycle behavior rather than merely checking resource existence.

The key shift in thinking was:

readiness should be proven behaviorally—not assumed from configuration state.


data-platform: Deterministic Stateful Services

BLUF

Once the substrate became operationally verifiable, the next step was validating whether stateful platform services could be deployed, rebuilt, and verified repeatably on top of it.

This work lives inside data-platform.

The repository does not create the K3s substrate itself.

Instead, it assumes a validated substrate exists and focuses on deterministic service lifecycle management.

The operational progression became:

deploy
→ validate
→ teardown
→ rebuild
→ verify

Establishing Stateful Service Patterns with MinIO

MinIO became the first fully validated stateful platform service.

More importantly, it established the initial operational patterns later services will reuse.

The deployment flow intentionally separates:

  • operators
  • platform workloads

using layered Helmfile composition.

Deployment sequence:

helmfile -f releases/helmfile.yaml -l layer=operators apply
services/minio/scripts/sync-root-secret.sh
helmfile -f releases/helmfile.yaml -l layer=platform-core apply

This phase also introduced:

  • SOPS-managed secrets
  • runtime Kubernetes secret materialization
  • durable storage contracts
  • repeatable service verification

The durable storage boundary was formalized through:

storageClassName: durable

This directly aligns MinIO persistence with the validated ZFS durability layer established earlier.

Verification Over Assumption

A major operational principle emerged here:

deployment success is not service acceptance.

MinIO verification explicitly validates:

  • tenant readiness
  • bucket existence
  • object CRUD behavior
  • credential export paths
  • optional persistence validation

Baseline buckets are also now established and verified automatically:

  • platform-validation
  • lake-bronze
  • lake-silver
  • lake-gold

This created the first durable storage boundaries for the platform.


Spark: Eliminating Runtime Ambiguity

BLUF

With stateful object storage validated, the next step was establishing deterministic distributed compute behavior through Spark running on Kubernetes against MinIO.

The goal was not simply “getting Spark to run.”

The goal was reducing ambiguity in:

  • runtime selection
  • image versioning
  • TLS trust
  • storage boundaries
  • workload verification
  • rebuild behavior

Why Spark?

Spark exists in the architecture as the bridge between:

  • interactive experimentation
  • distributed compute validation
  • future orchestration workflows

It provides the first scalable compute layer capable of validating:

  • distributed dataframe operations
  • numerical workloads
  • object storage integration
  • future promotion workflows

Deterministic Runtime Behavior

Several subtle issues surfaced during implementation:

  • stale image reuse
  • S3A configuration mismatches
  • TLS trust failures
  • implicit bucket assumptions
  • brittle verification logic

None were catastrophic individually.

Together, they exposed a broader problem:

distributed systems fail through ambiguity far more often than catastrophic failure.

To reduce drift, Spark workloads now rely on:

  • timestamped versioned images
  • deterministic image selection
  • explicit truststore synchronization
  • platform-owned runtime configuration

Example build flow:

services/spark/scripts/build-versioned-image.sh

Truststore synchronization:

services/spark/scripts/sync-minio-truststore.sh

Validation Boundaries

Validation outputs are intentionally isolated from future medallion data layers.

Spark validation workloads write only to:

s3a://platform-validation/spark/...

This separation ensures:

  • validation workloads remain disposable
  • medallion layers remain protected
  • operational boundaries stay explicit

The platform also moved beyond smoke tests into deterministic numerical validation workloads involving:

  • matrix operations
  • joins
  • pivots
  • aggregate validation

At this point the question began shifting from:

“Does distributed compute execute?”

to:

“Does distributed compute produce correct and repeatable results?”


data-lab: Runtime Contract Convergence

BLUF

Once the infrastructure and platform layers became operationally repeatable, the next challenge was interactive development.

The goal was enabling notebook-driven experimentation without accidentally creating a second platform with duplicated runtime behavior.

This work lives in data-lab.

Interactive Workflows Without Platform Drift

A major design decision during this phase was intentionally not deploying a full notebook platform like JupyterHub.

This remains a constrained single-node environment optimized for operational clarity and deterministic validation—not multi-tenant orchestration complexity.

Instead, the workflow intentionally stays lightweight:

VS Code
  → Remote SSH
    → Ivaldi (Spark driver)
      → K3s executors
        → MinIO

The Spark driver remains colocated with the cluster to avoid:

  • callback routing complexity
  • VPN/firewall fragility
  • TLS inconsistency
  • networking ambiguity

Shared Runtime Contracts

The most important architectural decision in this layer was eliminating runtime duplication.

data-lab consumes runtime truth directly from data-platform, including:

  • Spark shared configuration
  • Spark mode configuration
  • MinIO tenant configuration
  • MinIO NodePort configuration

This creates a single runtime contract shared between:

  • interactive notebook workflows
  • operator-driven Spark workloads

That convergence became one of the strongest architectural outcomes of the phase.

Interactive and operator execution now differ primarily in topology—not runtime truth.

Notebook Validation Workflows

data-lab also introduced repeatable notebook validation workflows.

Fast verification:

scripts/verify-repo.sh
scripts/verify-notebook-integration.sh
scripts/verify-interactive-spark.sh

Full notebook acceptance validation:

scripts/test-platform-integration-notebooks.sh

These workflows validate:

  • SparkApplication submission
  • interactive Spark behavior
  • MinIO integration
  • notebook execution
  • platform contract alignment

The notebooks themselves intentionally model the promotion path expected later in the platform lifecycle:

interactive exploration
→ deterministic notebook validation
→ promoted Spark workload
→ integrated platform verification
→ future orchestration

The important realization here was:

interactive tooling should accelerate validation—not become a parallel platform.


Next Steps

BLUF

At this point, the platform foundation is intentionally stable enough to stop adding infrastructure complexity and begin exercising the system through a small real-world workflow.

The next phase will focus on implementing a constrained data refinement experiment designed to expose the actual operational experience of working inside the platform before introducing additional architectural layers.

Rather than continuing to expand the stack prematurely, the goal is now to let future platform adjustments emerge from real workload pressure and operational friction.

The focus shifts from:

“Can the platform support distributed systems concepts?”

to:

“What actually becomes painful, ambiguous, or limiting during real usage?”

The planned workflow will intentionally remain small in scope while exercising the existing system end-to-end:

raw data ingestion
→ refinement / transformation
→ validation
→ persisted outputs
→ interactive analysis
→ promoted distributed execution

This phase is expected to surface:

  • workflow friction
  • runtime assumptions
  • storage boundary issues
  • orchestration gaps
  • metadata and lineage needs
  • reproducibility concerns
  • promotion workflow weaknesses

Most importantly, it will allow future architectural decisions to be driven by demonstrated need rather than theoretical completeness.

At this stage, resisting unnecessary complexity is more valuable than adding additional services.


Closing Thoughts

This phase marked the transition from:

designed architecture

to:

operationally validated architecture.

The most important outcome was not deploying additional services.

It was reducing ambiguity across the system before increasing complexity.

Each layer now has explicit operational boundaries:

Repository Responsibility
infra-validate substrate readiness validation
data-platform deterministic service lifecycle
data-lab interactive experimentation and workload validation

The system is still intentionally constrained:

  • single-node
  • local-storage-backed
  • operationally simple by design

The goal is not cloud-scale throughput or production HA.

The goal is understanding and validating the operational behavior of modern data platform patterns through explicit contracts, repeatable rebuilds, and deterministic workflows.

At this point, the platform is no longer merely assembled.

It is operationally verifiable.

And that changes the nature of every future layer built on top of it.


Series: Building a Personal Data Lab

→ Next: Data Lab Test Dive (coming next)