Sophie Nguyen

Bioinformatics Specialist & Aspiring Data Engineer

Citi Bike NYC - A Data Engineering Approach | Sophie Nguyen

Citi Bike NYC - A Data Engineering Approach

September 27, 2025

Key objectives of the project include:

  1. Ingest historical weather data from Open-Meteo API
  2. Ingest Citi Bike data from NYC Trip Data
    • Supports both full load and monthly incremental load
  3. Upload data to S3 raw zone
  4. Run PySpark data transformations via AWS Glue:
    • Convert datetime columns to timestamp
    • Filter out trips < 5 minutes
    • Remove trips starting and ending at the same station
  5. Save cleaned data to S3 clean zone as Parquet
  6. Populate AWS Glue Data Catalog using crawlers
  7. Run dbt to build the citibike_facts table by joining trip data with weather

Analysis & Visualization

Installation & Deployment

  1. git clone https://github.com/sophie3101/data_projects.git

  2. cd data_projects/03_nyc_citi_bike

  3. Provision infrastructure:
    • cd terraform
    • terraform init
    • terraform apply -var-file=”secret.tfvars”
  4. Start airflow scheduler:
    • cd .. # back to the current directory of the project
    • astro dev init
    • astro dev start
    • Then start the pipeline dag
  5. to generate citibike_facts, use dbt:
    • cd dbt_athena
    • dbt_init
    • dbt run

📍 More details of the project can be found: github_link