ETL (Load)
Flat Files
Python Print Statement
Give print a file keyword argument, where the value of the argument is a file stream. We can create a file stream using the open function
# use file=open() to direct the output to the file location
# using "a" appends or creates the file
print("Hello World!", file=open("output.txt", "a"))
some_text = "This is a string of something I want to save"
print(some_text, file=open("output.txt", "a"))
a
with python's open()
command will append the data to a new line separated by a line break
this actually is exactly how JSONL
documents are formated, where each new line is valid JSON. So running the
following will build valid JSONL
documents.
some_dict = {'Name':"Alice"}
print(json.dumps(some_dict), file=open("logs.jsonl", "a"))
Pyspark
Write to delta table
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql.dataframe
# read table
df = (spark.read.table("database.table_name"))
## Do transformations
# write it back
(
df.write
.format('delta')
.mode('append') # or overwrite
.option('overwriteSchema', 'true')
.saveAsTable('database.table_name')
)
# Create table if it doesn't exists and point towards the delta tables (parquet)
spark.sql(f"CREATE TABLE IF NOT EXISTS database.table_name using delta location 'path/to/delta/tables/'")