qheuristics

The papers I love (Software engineering)

David Parnas, On the criteria to be used in decomposing systems into modules, 1971 https://prl.khoury.northeastern.edu/img/p-tr-1971.pdf

Layer	Order	Description
`raw`	Sequential	Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. `csv`, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
`intermediate`	Sequential	This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
`primary`	Sequential

	# setup docker-compose
	sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
	sudo chmod +x /usr/local/bin/docker-compose
	sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

	# setup airflow 1.10.14
	git clone https://github.com/xnuinside/airflow_in_docker_compose
	cd airflow_in_docker_compose
	docker-compose -f docker-compose-with-celery-executor.yml up --build

	"""
	Example of using sub-parser, sub-commands and sub-sub-commands :-)
	"""

	import argparse


	def main(args):
	"""
	Just do something

	import pandas as pd
	import numpy as np

	def generate_random_dates(num_dates: int) -> np.array:
	"""Generate a 1D array of `num_dates` random dates.
	"""
	start_date = "2020-01-01"
	# Generate all days for 2020
	available_dates = [np.datetime64(start_date) + days for days in range(365)]
	# Get `num_dates` random dates from 2020

	# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/)

	bucket = 'mybucket'
	key = 'mycompressed_csv_files.tar.gz'

	import s3fs
	import tarfile
	import io
	import pandas as pd