Citi Bike NYC - A Data Engineering Approach
Key objectives of the project include:
- Ingest historical weather data from Open-Meteo API
- Ingest Citi Bike data from NYC Trip Data
- Supports both full load and monthly incremental load
- Upload data to S3 raw zone
- Run PySpark data transformations via AWS Glue:
- Convert datetime columns to timestamp
- Filter out trips < 5 minutes
- Remove trips starting and ending at the same station
- Save cleaned data to S3 clean zone as Parquet
- Populate AWS Glue Data Catalog using crawlers
- Run dbt to build the
citibike_factstable by joining trip data with weather
Analysis & Visualization
- Use Athena or Redshift Spectrum to query
- Visualize insights via Tableau
Installation & Deployment
-
git clone https://github.com/sophie3101/data_projects.git
-
cd data_projects/03_nyc_citi_bike
- Provision infrastructure:
- cd terraform
- terraform init
- terraform apply -var-file=”secret.tfvars”
- Start airflow scheduler:
- cd .. # back to the current directory of the project
- astro dev init
- astro dev start
- Then start the pipeline dag
- to generate
citibike_facts, use dbt:- cd dbt_athena
- dbt_init
- dbt run
📍 More details of the project can be found: github_link