Skip to content

ETL (Transform)

Extract Transform Load is the process whereby some data is obtained, (extracted) cleaned, wrangled (transformed), and placed into a user-friendly data structure like a data frame (loaded).

Transforming is


Pandas

JSON

Documentation

Often JSON files directly translate to pd.DataFrame but nested JSON usually leave the JSON array in a single column

import json 
import pandas as pd 
from pandas.io.json import json_normalize #package for flattening json in pd

df = pd.DataFrame(json_file)
works_data = json_normalize(data = df['programs'],
                            record_path ='works', 
                            meta =['id', 'orchestra', 'programID', 'season'])

# record_path is the column name you want to "flatten"
# by that it makes JSON keys as a new
# column name and places the value as the row value.
# Also pass the parent metadata we wanted to append

Get only numeric columns from dataframe

df = pd.read_csv("data.csv")
numeric_columns = df.select_dtypes(include=['number'])

JSONL

Documentation

Pandas can work with JSONL as well

import json 
import pandas as pd 

df = pd.read_json(jsonl_file, lines=True)

# using lines=True lets Pandas read each line as a valid JSON file
Additional Info


Pyspark

Add new/rename columns

import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql.dataframe
import datetime

df = df.withColumn('Todays_Date', F.lit(datetime.datetime.now()))

## Lowercase column names
for col in df.columns:
    df = df.withColumnRenamed(col, col.lower())