David Parnas, On the criteria to be used in decomposing systems into modules, 1971 https://prl.khoury.northeastern.edu/img/p-tr-1971.pdf
Peter Naur, Programming as Theory Building, 1985 https://pablo.rauzy.name/dev/naur1985programming.pdf
David Parnas, On the criteria to be used in decomposing systems into modules, 1971 https://prl.khoury.northeastern.edu/img/p-tr-1971.pdf
Peter Naur, Programming as Theory Building, 1985 https://pablo.rauzy.name/dev/naur1985programming.pdf
| Layer | Order | Description |
|---|---|---|
raw |
Sequential | Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly! |
intermediate |
Sequential | This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted. |
primary |
Sequential |
| # setup docker-compose | |
| sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | |
| sudo chmod +x /usr/local/bin/docker-compose | |
| sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose | |
| # setup airflow 1.10.14 | |
| git clone https://github.com/xnuinside/airflow_in_docker_compose | |
| cd airflow_in_docker_compose | |
| docker-compose -f docker-compose-with-celery-executor.yml up --build |
| """ | |
| Example of using sub-parser, sub-commands and sub-sub-commands :-) | |
| """ | |
| import argparse | |
| def main(args): | |
| """ | |
| Just do something |
| import pandas as pd | |
| import numpy as np | |
| def generate_random_dates(num_dates: int) -> np.array: | |
| """Generate a 1D array of `num_dates` random dates. | |
| """ | |
| start_date = "2020-01-01" | |
| # Generate all days for 2020 | |
| available_dates = [np.datetime64(start_date) + days for days in range(365)] | |
| # Get `num_dates` random dates from 2020 |
| # -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/) | |
| bucket = 'mybucket' | |
| key = 'mycompressed_csv_files.tar.gz' | |
| import s3fs | |
| import tarfile | |
| import io | |
| import pandas as pd |