Skip to content
C Codeloom
AWS

AWS Glue ETL Tutorial: Serverless Spark for Data Pipelines

Build serverless ETL jobs with AWS Glue. Learn the Data Catalog, crawlers, Spark and Python shell jobs, partitioning, bookmarks, and how to avoid surprise DPU bills.

·4 min read · By Codeloom
Intermediate 11 min read

What you'll learn

  • What Glue is, and how the Data Catalog ties everything together
  • How crawlers infer schema and partitions from S3
  • How to author a Spark ETL job that reads, transforms, and writes Parquet
  • How job bookmarks make incremental loads safe
  • Cost pitfalls and production patterns

Prerequisites

  • Comfort with SQL and basic Python; familiarity with S3 and Athena helps

AWS Glue is the serverless data-integration backbone of AWS. It bundles a metadata catalog, schema-inference crawlers, and a managed Spark runtime into one service. You pay per DPU-hour while jobs run — no clusters to keep warm — and the Data Catalog acts as a Hive-compatible metastore for Athena, Redshift Spectrum, and EMR.

What and Why

Three pieces matter.

  • Data Catalog: a central metastore of databases, tables, columns, and partitions. Most AWS analytics services read it.
  • Crawlers: scheduled jobs that scan S3 (or JDBC sources), infer schema, and write tables into the Catalog.
  • Jobs: serverless Spark or Python shell scripts that read sources, transform data, and write outputs.

Why pick Glue? You get Spark without operating it, integrated lineage and bookmarks for incremental processing, and Catalog-driven discovery so a downstream Athena query “just works” after a crawler runs. Why avoid it? Cold-start times of a minute or more, opinionated DynamicFrame APIs that differ from plain Spark, and DPU costs that surprise teams who treat it like Lambda.

Mental Model

Imagine a three-lane assembly line.

  1. Raw files land in S3 under s3://lake/raw/<source>/dt=YYYY-MM-DD/.
  2. A crawler runs once a day, registering new partitions in the Catalog.
  3. A Glue job reads those new partitions (using a bookmark), transforms them, and writes Parquet to s3://lake/curated/<table>/.

The bookmark is the secret sauce: it tracks which files have already been processed so you can rerun safely.

Hands-on Example

A minimal PySpark Glue job that reads JSON, cleans nulls, and writes partitioned Parquet:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glue = GlueContext(sc)
job = Job(glue); job.init(args["JOB_NAME"], args)

src = glue.create_dynamic_frame.from_catalog(
    database="lake_raw", table_name="events",
    transformation_ctx="src")  # bookmark key

cleaned = DropNullFields.apply(frame=src)

glue.write_dynamic_frame.from_options(
    frame=cleaned,
    connection_type="s3",
    connection_options={"path": "s3://lake/curated/events/",
                        "partitionKeys": ["dt"]},
    format="parquet",
    transformation_ctx="dst")

job.commit()

Create the job with --job-bookmark-option job-bookmark-enable and schedule it via a Glue trigger or EventBridge cron.

S3 raw/events/dt=2026-06-28/*.json
     |
     v
Glue Crawler  --->  Data Catalog (lake_raw.events)
                          |
                          v
                   Glue ETL Job (Spark)
                     reads via bookmark
                     drops nulls, casts
                          |
                          v
S3 curated/events/dt=2026-06-28/*.parquet
     |
     v
Athena / Redshift Spectrum query directly
Glue pipeline from raw S3 to curated Parquet

After the job runs, point Athena at the Catalog and query SELECT count(*) FROM lake_curated.events WHERE dt='2026-06-28' — no infrastructure touched.

Common Pitfalls

  • Crawlers running every hour on huge buckets. Crawler runtime scales with file count. Use exclusion patterns and partition projection in Athena instead of crawling for high-frequency partitions.
  • Bookmarks disabled, jobs reprocessing everything. Without transformation_ctx on both source and sink, the bookmark does nothing.
  • Tiny output files. Spark writes one file per partition per executor by default. Coalesce or repartition before write to avoid Athena performance death by a thousand small files.
  • Mixing DynamicFrames and DataFrames carelessly. Schema inference differs. Convert with toDF() and fromDF() explicitly when you need exact control.
  • Default 10 DPUs. A small job runs fine on 2 DPUs. Ten DPUs cost five times as much and may not be faster if the data is small.

Production Tips

Pick the right worker type. G.1X for general ETL, G.2X for memory-hungry joins, Python shell for tiny orchestration scripts that just call APIs (it’s cheap and starts fast). For sub-minute starts, use Glue 4.0 with the streaming runtime or move to EMR Serverless.

Always write Parquet with Snappy compression and a sane partition strategy (date or tenant). Add data-quality checks with Glue Data Quality rules so bad inputs fail the job rather than silently corrupting downstream tables.

Tag jobs with cost-center and set CloudWatch alarms on job duration; a job that suddenly doubles in runtime is the signal that someone uploaded a massive dump or a join key went skewed.

Wrap-up

Glue gives you Spark, a metastore, and bookmarks without managing clusters. Land raw data in partitioned S3, register it with a crawler, run a bookmarked Glue job to produce curated Parquet, and query with Athena. Mind your DPU choice and file sizes, and you’ll have a tidy serverless lakehouse pipeline.