TFX for MLOps: Production ML Pipelines

November 17, 2025·16 min read·Frameworks and Librariesintermediate

Why productionizing models demands more than a notebook

diagram showing a linear MLOps pipeline with stages for data ingestion, validation, transformation, training, evaluation, and serving with feedback loop to monitoring

If you’ve ever trained a promising model locally and felt a sinking feeling when you tried to move it into production, you’re not alone. Moving from a Jupyter notebook to a live, reliable service involves a lot more than packaging weights. You need data validation, preprocessing that matches training, reproducible pipelines, model versioning, and monitoring for drift. TensorFlow Extended (TFX) is Google’s answer to these challenges, built around a pipeline-first mindset that treats the full lifecycle—from data ingestion to serving—as a connected, automated system.

In this article, I’ll walk through how TFX fits into MLOps, what problems it solves, and how to structure real projects with it. I’ll include code examples and folder structures you can adapt, and I’ll share lessons I’ve learned when using TFX on production pipelines. We’ll also evaluate when TFX is a good fit and when simpler tooling may be better. If you’re a developer building machine learning systems—not just models—this is the scaffolding you need.

Where TFX fits in the modern MLOps landscape

TFX is an end-to-end platform for building and running ML pipelines in production. It’s best viewed as a toolkit for orchestrating the steps involved in training, validating, and deploying models consistently. While many teams start with ad hoc scripts, TFX encourages declarative pipelines where each component has a clear contract: inputs, outputs, and execution semantics.

You’ll typically see TFX used by:

Teams already invested in TensorFlow that want standardized pipelines and tooling.
Organizations needing reproducibility across environments, from development to staging and production.
Data engineers and ML engineers collaborating on automated workflows that require validation and governance.

At a high level, TFX competes with broader workflow orchestrators (like Kubeflow Pipelines or Airflow) and other MLOps frameworks (like MLflow). Where TFX stands out is its tight integration with TensorFlow components and its opinionated building blocks for data validation, schema inference, and transformation. However, TFX also works with non-TensorFlow models via custom components and Apache Beam for scalable data processing, so you don’t have to be 100% TensorFlow to use it.

The closest alternatives:

Kubeflow Pipelines (KFP): Great for orchestration, especially in Kubernetes environments. More flexible but less opinionated about ML-specific steps.
MLflow: Strong in experiment tracking, model registry, and serving. Often paired with orchestration tools for full pipelines.
Airflow/Dagster/Prefect: General-purpose orchestration; excellent for scheduling but require more custom wiring for ML-specific tasks.

TFX’s advantage is an integrated stack: ExampleGen for data ingestion, StatisticsGen and SchemaGen for data understanding, Transform for feature engineering, Trainer for model training, and Evaluator for threshold-based model validation. These components share common interfaces (e.g., artifacts and contexts) that make it easier to automate the entire workflow.

Core concepts and practical patterns

TFX pipelines are built around components that produce and consume artifacts (data, models, metrics). The framework supports different orchestrators, including Apache Beam (for portable, scalable execution) and Kubernetes-based orchestrators like Kubeflow Pipelines.

A minimal pipeline structure

Conceptually, a TFX pipeline looks like this:

ExampleGen: Ingests data (CSV, TFRecords, BigQuery, etc.) and splits it (train/eval).
StatisticsGen: Computes dataset statistics.
SchemaGen: Infers a schema from the data.
ExampleValidator: Validates data against the schema (catches anomalies).
Transform: Performs feature engineering using TensorFlow Transform.
Trainer: Trains the model using TensorFlow (or custom code).
Evaluator: Computes model metrics and slices performance (can block deployment if below threshold).
Pusher: Pushes the model to a serving destination (TF Serving, etc.).

In practice, you’ll also add:

Tuner for automated hyperparameter tuning.
InfraValidator to test the model in a sandbox before full deployment.
BulkInferrer for batch predictions.

Example project structure

Here’s a typical folder layout for a TFX project. It separates pipeline definitions, component code, and environment configuration.

tfx_project/
├── pipeline/
│   ├── __init__.py
│   ├── pipeline.py           # pipeline definition and orchestrator config
│   ├── configs.py            # environment-specific config (GCP, local Beam)
│   └── pipeline_root.py      # centralized path management
├── components/
│   ├── __init__.py
│   ├── trainer.py            # Trainer component logic
│   ├── transform.py          # Transform component logic
│   └── custom_components.py  # any custom TFX components
├── data/
│   └── schema/               # baseline schema (optional, often auto-generated)
├── schemas/
│   └── default_schema.pbtxt  # published schema for validation
├── models/
│   └── serving/              # SavedModel artifacts (auto-managed by Pusher)
├── scripts/
│   └── run_pipeline.py       # CLI entry point to run pipelines
├── tests/
│   └── test_trainer.py       # unit tests for component logic
├── pyproject.toml            # project dependencies
└── README.md                 # project overview

Configuration and runtime setup

TFX pipelines can run locally with Apache Beam or in orchestrated environments like GCP Vertex AI or on-prem Kubernetes. The configuration often controls which orchestrator, data sources, and serving targets are used.

Below is a pyproject.toml showing typical dependencies. It includes TFX, Apache Beam (for local portability), and optional cloud integrations.

[project]
name = "tfx_mlops_project"
version = "0.1.0"
description = "TFX pipeline for production MLOps"
requires-python = ">=3.9"
dependencies = [
    "tensorflow>=2.12",
    "tfx>=1.14",
    "apache-beam[gcp]>=2.48",   # Beam for portable pipeline execution
    "tensorflow-transform>=1.14",
    "tensorflow-serving-api>=2.12",
    "google-cloud-aiplatform>=1.36",  # Vertex AI integration (optional)
    "pyarrow>=11.0",           # for Parquet/Arrow data formats
    "pandas>=2.0",             # data exploration and debugging
]

For pipeline execution, you’ll often have a script that builds and runs the pipeline. The following run_pipeline.py shows a local Beam runner and a minimal config.

# scripts/run_pipeline.py
import os
from pipeline.pipeline import create_pipeline
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner
from pipeline.configs import PIPELINE_NAME, DATA_ROOT, PIPELINE_ROOT

def run_local():
    # Local pipeline root: where artifacts and metadata are stored
    os.makedirs(PIPELINE_ROOT, exist_ok=True)
    pipeline = create_pipeline(
        pipeline_name=PIPELINE_NAME,
        data_root=DATA_ROOT,
        pipeline_root=PIPELINE_ROOT,
    )
    # Execute using Apache Beam (local)
    BeamDagRunner().run(pipeline)
    print(f"Pipeline finished. Artifact root: {PIPELINE_ROOT}")

if __name__ == "__main__":
    run_local()

Defining the pipeline (with Transform and Trainer)

The pipeline definition wires components together via artifacts. Transform uses TensorFlow Transform to create features consistent between training and serving. Trainer uses the transformed examples to train a model.

Below is a simplified pipeline.py showing the flow and a configuration file configs.py with key paths.

# pipeline/configs.py
import os

PIPELINE_NAME = "churn_prediction"
DATA_ROOT = os.path.join(os.path.dirname(__file__), "..", "data")
PIPELINE_ROOT = os.path.join(os.path.dirname(__file__), "..", "pipeline_root")
METADATA_PATH = os.path.join(PIPELINE_ROOT, "metadata.db")
ENABLE_CACHE = True

# pipeline/pipeline.py
from tfx import v1 as tfx
from components.transform import preprocessing_fn
from components.trainer import train_fn

def create_pipeline(pipeline_name, data_root, pipeline_root):
    # 1. Ingest examples and split train/eval
    example_gen = tfx.components.CsvExampleGen(input_base=data_root)

    # 2. Generate statistics and schema
    statistics_gen = tfx.components.StatisticsGen(
        examples=example_gen.outputs["examples"]
    )
    schema_gen = tfx.components.SchemaGen(
        statistics=statistics_gen.outputs["statistics"],
        infer_feature_shape=False,
    )

    # 3. Validate examples against schema
    example_validator = tfx.components.ExampleValidator(
        statistics=statistics_gen.outputs["statistics"],
        schema=schema_gen.outputs["schema"],
    )

    # 4. Transform features consistently for training/serving
    transform = tfx.components.Transform(
        examples=example_gen.outputs["examples"],
        schema=schema_gen.outputs["schema"],
        preprocessing_fn=preprocessing_fn,
    )

    # 5. Train model
    trainer = tfx.components.Trainer(
        examples=transform.outputs["transformed_examples"],
        transform_graph=transform.outputs["transform_graph"],
        schema=schema_gen.outputs["schema"],
        train_args=tfx.proto.TrainArgs(num_steps=2000),
        eval_args=tfx.proto.EvalArgs(num_steps=500),
        module_file="components/trainer.py",  # where train_fn is defined
    )

    # 6. Evaluate model quality
    evaluator = tfx.components.Evaluator(
        examples=example_gen.outputs["examples"],
        model=trainer.outputs["model"],
        eval_config=tfx.proto.EvalConfig(
            model_specs=[tfx.proto.ModelSpec(label_key="churn")],
            slicing_specs=[
                tfx.proto.SlicingSpec(feature_keys=["subscription_type"]),
                tfx.proto.SlicingSpec(feature_keys=["region"]),
            ],
            metrics_specs=tfx.proto.MetricsSpec(
                metrics=[
                    tfx.proto.MetricConfig(
                        class_name="AUC",
                        config={"curve": "ROC"},
                    ),
                    tfx.proto.MetricConfig(
                        class_name="ExampleCount",
                    ),
                ]
            ),
        ),
    )

    # 7. Push model if it meets thresholds
    pusher = tfx.components.Pusher(
        model=trainer.outputs["model"],
        model_blessing=evaluator.outputs["blessing"],
        push_destination=tfx.proto.PushDestination(
            filesystem=tfx.proto.PushDestination.Filesystem(
                base_directory=os.path.join(pipeline_root, "serving_models")
            )
        ),
    )

    components = [
        example_gen,
        statistics_gen,
        schema_gen,
        example_validator,
        transform,
        trainer,
        evaluator,
        pusher,
    ]

    return tfx.dsl.Pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        components=components,
        enable_cache=ENABLE_CACHE,
        metadata_connection_config=tfx.proto.MetadataConnectionConfig(
            sqlite_metadata_connection_config=tfx.proto.SqliteMetadataConnectionConfig(
                database_path=os.path.join(pipeline_root, "metadata.db")
            )
        ),
    )

Feature engineering with TensorFlow Transform

The key advantage of Transform is that it encodes feature logic once and executes it during training and serving. The preprocessing_fn runs in two contexts: when materializing transformed training data and when serving via TF Serving (via the transform graph).

# components/transform.py
import tensorflow as tf
import tensorflow_transform as tft

# Example feature engineering for tabular churn data
def preprocessing_fn(inputs):
    """Preprocessing function used by TFX Transform."""
    outputs = {}

    # Example: normalize numeric columns
    for col in ["tenure", "monthly_spend"]:
        if col in inputs:
            x = tf.cast(inputs[col], tf.float32)
            x_mean = tft.mean(x)
            x_std = tft.std(x) + 1e-6
            outputs[f"{col}_norm"] = (x - x_mean) / x_std

    # Example: bucketize numeric into categorical for tree-based models
    if "tenure" in inputs:
        tenure = tf.cast(inputs["tenure"], tf.float32)
        # 0-6, 6-12, 12-24, 24+
        buckets = tft.bucketize(tenure, boundaries=[6.0, 12.0, 24.0])
        outputs["tenure_bucket"] = buckets

    # Example: one-hot encode a categorical
    if "subscription_type" in inputs:
        sub = tf.strings.strip(inputs["subscription_type"])
        # tft.compute_and_apply_vocabulary returns integer IDs
        outputs["sub_type_id"] = tft.compute_and_apply_vocabulary(sub)

    # Labels and pass-through columns
    if "churn" in inputs:
        outputs["churn"] = tf.cast(inputs["churn"], tf.int64)

    return outputs

Training logic with error handling and metrics

Trainer typically references a module_file that defines the model_fn and train_fn. TFX uses the Transform graph and transformed examples. Below is a simple but realistic training script with error handling, a baseline Keras model, and custom metrics.

# components/trainer.py
import tensorflow as tf
from tensorflow import keras
from tfx import v1 as tfx

def _build_model(input_features):
    # Simple MLP suitable for tabular data
    inputs = []
    for name, dtype in input_features.items():
        inputs.append(keras.Input(name=name, shape=(1,), dtype=dtype))

    # Concatenate numeric features (assumes normalized inputs)
    # This is a simplified demo; in practice, handle sparse and embedding features.
    x = keras.layers.Concatenate()(inputs)
    x = keras.layers.Dense(64, activation="relu")(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation="relu")(x)
    output = keras.layers.Dense(1, activation="sigmoid", name="churn")(x)

    model = keras.Model(inputs=inputs, outputs=output)
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss=keras.losses.BinaryCrossentropy(),
        metrics=[
            keras.metrics.AUC(name="auc"),
            keras.metrics.Accuracy(name="accuracy"),
        ],
    )
    return model

def _get_serve_tf_examples_fn(model, transform_graph):
    """Generate a function for TF Serving that accepts serialized tf.Examples."""
    model.tft_layer = transform_graph

    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        feature_spec = tf.io.parse_single_example(
            serialized_tf_examples,
            transform_graph.raw_feature_spec,
        )
        transformed_features = transform_graph.transform_feature_spec(feature_spec)
        # Ensure shape consistency for serving
        for k, v in transformed_features.items():
            transformed_features[k] = tf.expand_dims(v, 0)
        return model(transformed_features)

    return serve_tf_examples_fn

def train_fn(run_fn, schema, train_args, eval_args):
    """Entry point called by TFX Trainer."""
    try:
        # Load transformed data via TFX harness
        train_schema = schema
        train_transform_graph = run_fn.get_transform_graph()
        train_dataset = run_fn.get_train_dataset()
        eval_dataset = run_fn.get_eval_dataset()

        # Build input signature from transform output
        input_features = {}
        for feature, spec in train_transform_graph.output_signature.items():
            if spec.dtype:
                input_features[feature] = spec.dtype

        model = _build_model(input_features)

        # Add callbacks for robust training
        callbacks = [
            keras.callbacks.EarlyStopping(
                monitor="val_auc",
                mode="max",
                patience=5,
                restore_best_weights=True,
            ),
            keras.callbacks.ReduceLROnPlateau(
                monitor="val_auc",
                factor=0.5,
                patience=3,
                min_lr=1e-6,
            ),
        ]

        model.fit(
            train_dataset,
            steps_per_epoch=train_args.num_steps,
            validation_data=eval_dataset,
            validation_steps=eval_args.num_steps,
            epochs=10,
            callbacks=callbacks,
        )

        # Export SavedModel with transform graph for consistent serving
        signatures = {
            "serving_default": _get_serve_tf_examples_fn(
                model, train_transform_graph
            ).get_concrete_function(
                tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")
            )
        }
        model.save(
            run_fn.get_model_dir(),
            save_format="tf",
            signatures=signatures,
        )

    except Exception as e:
        # In production, consider structured logging and alerting
        raise RuntimeError(f"Trainer failed with error: {e}")

# TFX Trainer expects a module file with a `run_fn` attribute
def run_fn(fn_args):
    # fn_args includes train_steps, eval_steps, etc.
    # For brevity, we are skipping full plumbing; in real TFX, Trainer passes these.
    # Here you would typically read schema, dataset, and transform graph.
    # This stub indicates where the training logic resides.
    pass

Evaluator for canary and multi-model comparisons

Evaluator can compare two models, or compare slices of data. It produces a “blessing” artifact that Pusher uses to decide whether to deploy. This allows rule-based gating (e.g., AUC > 0.85 on the “subscription_type” slice) or canary comparisons.

# Example: Model comparison (champion vs challenger)
evaluator = tfx.components.Evaluator(
    examples=example_gen.outputs["examples"],
    model=trainer.outputs["model"],
    model_blessing=pusher.outputs["model_blessing"],  # from previous run
    eval_config=tfx.proto.EvalConfig(
        model_specs=[
            tfx.proto.ModelSpec(name="challenger", label_key="churn"),
            tfx.proto.ModelSpec(name="champion", label_key="churn"),
        ],
        metrics_specs=tfx.proto.MetricsSpec(
            metrics=[
                tfx.proto.MetricConfig(class_name="AUC", config={"curve": "ROC"}),
            ]
        ),
        slicing_specs=[tfx.proto.SlicingSpec()],
    ),
)

Honest evaluation: strengths, weaknesses, and tradeoffs

Where TFX shines

Reproducibility: TFX components codify data splits, preprocessing, and training in a pipeline. You can rerun the same pipeline and get identical artifacts if caching is enabled and inputs are deterministic.
Data quality: Schema generation and validation catch dataset anomalies early. In practice, this prevents “silent” training on corrupted data.
Consistent serving: Transform ensures feature logic matches training and serving, reducing skew bugs.
Production readiness: Components like Evaluator and InfraValidator give you gates before deployment, enabling policy-based releases.

Where it may not be the best fit

High overhead for small projects: If you’re building one-off models, a full TFX pipeline can feel heavy. MLflow or a simple script with DVC may be enough.
Learning curve: Understanding artifacts, metadata, and orchestrator configuration takes time, especially if you’re new to pipelines.
TensorFlow-first bias: While custom components exist and TFX can work with other frameworks, the path of least resistance is TensorFlow/Keras. For PyTorch-heavy teams, KFP or plain orchestration with custom containers may be simpler.

Tradeoffs to consider

Orchestrator choice: Beam is portable (local, Dataflow, Flink), but might be overkill for small datasets. Kubeflow Pipelines on Kubernetes scales better for larger teams but adds infra complexity.
Schema management: Auto-inferred schemas are convenient but may miss domain-specific rules. Maintain baseline schemas and update them carefully.
Artifact storage: For cloud deployments, consider GCS or S3 for artifact roots; local filesystems don’t scale in multi-user environments.

Personal experience: what actually trips people up

I’ve built TFX pipelines where the first run felt like a triumph, and the second run surfaced hidden assumptions. The most common pitfalls I’ve seen:

Feature skew: Teams tweak preprocessing in training but forget to update Transform. TFX helps, but only if you keep the pipeline canonical. If you bypass Transform and hand-engineer features in Trainer, you lose consistency.
Schema drift: A new categorical value appears in production data. ExampleValidator catches it, but only if the schema is up to date. Keep schema change management deliberate.
Caching surprises: Caching speeds up reruns, but it can mask issues when upstream data changes. Disable caching intentionally when you expect a different dataset.
Orchestrator differences: Local Beam runs fast, but cloud runs behave differently (permissions, network, data paths). Test pipeline staging on the target orchestrator early.
SavedModel signatures: Serving fails if the model’s signature doesn’t match expectations. Always validate serving signatures with a small sample before deployment.

The moment TFX proved especially valuable was during a churn model migration. A subtle bug in one-hot encoding produced inconsistent IDs between training and serving. With Transform in place, the bug was obvious: the vocabulary generation in training matched serving, and the schema flagged a new category in staging. Evaluator blocked deployment until we updated the schema and retrained. That single control saved us from a noisy incident.

Getting started: workflow and mental models

Start with a mental model: pipeline as code. Each component is a well-defined step that reads inputs, produces outputs, and logs metadata. Your job is to design artifacts, keep transformations declarative, and set guardrails (validation, evaluation, blessed deployments).

1. Project workflow

Define your data source and split strategy (ExampleGen).
Infer and publish a baseline schema (SchemaGen + ExampleValidator).
Centralize feature logic in Transform (preprocessing_fn).
Write Trainer that consumes transformed examples and exports SavedModel with signatures.
Gate deployments with Evaluator thresholds.
Push to a serving destination only if blessed.

2. Folder structure recap

tfx_project/
├── pipeline/
│   ├── pipeline.py           # DSL wiring
│   ├── configs.py            # paths, orchestrator flags
├── components/
│   ├── transform.py          # feature logic (TFT)
│   ├── trainer.py            # model training
├── schemas/
│   └── default_schema.pbtxt  # baseline schema
├── scripts/
│   └── run_pipeline.py       # runner entry point
├── pyproject.toml            # dependencies
└── README.md                 # setup and run instructions

3. Running the pipeline (mental model)

Local Beam: Great for dev and testing; uses local metadata and artifact store.
Cloud orchestrator: For production; requires artifact storage (GCS/S3), metadata store (e.g., Cloud SQL), and possibly secret management.
Hybrid: Develop locally with small data; run full-scale pipelines on the target orchestrator.

4. Example: minimal command flow (conceptual)

Install dependencies listed in pyproject.toml.
Prepare a data folder with CSV files (with a “churn” label column).
Run python scripts/run_pipeline.py for a local Beam run.
Inspect artifacts under pipeline_root (statistics, schema, models).
If Evaluator blesses, the Pusher component writes the model to pipeline_root/serving_models.

Why TFX stands out: distinguishing features

First-class feature engineering: TensorFlow Transform ensures that feature logic is identical at training and serving time. This is a major reliability win.
Data validation built-in: Schema inference and example validation prevent entire classes of bugs.
Reusable components: TFX’s standardized interfaces make it easier to collaborate across data, ML, and platform teams.
Strong cloud integration: GCP Vertex AI and Dataflow integration streamlines scaling; Kubeflow Pipelines on Kubernetes offers flexibility for hybrid stacks.
Artifact-centric workflow: The artifact/metadata model encourages reproducibility and traceability, which pays dividends during audits and postmortems.

Developer experience is solid once you climb the learning curve. The pipeline-first approach nudges you toward clean boundaries: data prep in Transform, training in Trainer, and gating in Evaluator. This separation tends to improve maintainability compared to monoliths where everything happens in a single script.

Free learning resources

TFX official guide: https://www.tensorflow.org/tfx - Start here for component overviews and architecture. It includes concise explanations and examples.
TFX tutorials (TensorFlow Examples): https://github.com/tensorflow/tfx - Official repository with sample pipelines and notebooks. Useful for seeing real configurations and wiring.
Apache Beam documentation: https://beam.apache.org/documentation/ - Essential for understanding how TFX pipelines execute on Beam, including runners and windowing concepts.
Kubeflow Pipelines docs: https://www.kubeflow.org/docs/components/pipelines/ - If you plan to deploy TFX on Kubernetes/KFP, this provides context on containerized runs.
Google Cloud Vertex AI Pipelines: https://cloud.google.com/vertex-ai/docs/pipelines - Helpful for teams targeting GCP with managed orchestration and artifact storage.

These resources are practical and trustworthy. The TFX repo, in particular, offers real pipeline examples you can adapt to your data and domain.

Summary: who should use TFX, and who might skip it

Use TFX if:

You’re building production ML systems that require consistent preprocessing, validation, and deployment gates.
Your team uses TensorFlow or is comfortable adopting it for the sake of a standardized pipeline framework.
You need reproducible workflows with clear lineage from data to model to serving.
You plan to scale from local experimentation to orchestrated runs (Beam, KFP, or cloud services).

Consider skipping or delaying TFX if:

Your workload is a small, one-off model where the overhead outweighs the benefits.
You’re primarily PyTorch-centric and prefer orchestration tools with fewer TensorFlow dependencies (KFP or MLflow may be simpler).
Your organization doesn’t have the infra to maintain metadata stores and artifact storage for multi-user pipelines.

The core takeaway is pragmatic: TFX is best when your bottleneck is reliability, reproducibility, and deployment safety, not when your bottleneck is “get a model running once.” It codifies the right habits—validate your data, centralize your feature logic, and gate your releases—and provides the plumbing to automate them. That’s what turns promising notebooks into dependable services.