Skip to content

ETL Process

Here is the descrition of extraction, transforming and load the data.

1. APAC Quimioterapia 2022

The ETL of APAC Quimioterapy code is available in:

notebooks/001-rb-ETL.ipynb

1.1 Extract

From pysus I imported files from all states in the year of 2022.

sia.get_files("AQ", year=2022)

1.2 Transform

Converted all Parquet files to a Pandas DataFrame, which allowed me to perform joins with other datasets.

all_data_df_list = [ x.to_dataframe() for x in sia.download(files, local_dir='../data/raw/') ] 
combined_df = pd.concat(all_data_df_list, ignore_index=True)

Rename columns for enhanced clarity using data/external/better_names_apac_quimio.csv (referenced in the Supplementary Tables section).

Subsequently, the DataFrame was merged with external datasets:

  • CID10 - Categorias
  • CID10 - Simplificado (Available in the Supplementary Tables section)
  • Estados
  • Motivo Saída

Additionally, a cns_coded column was added to address non-UTF-8 characters that were not readable by Tableau.

1.3 Load

Exported data to a .csv file.

2. APAC Medicamentos 2022

The ETL of APAC Quimioterapy code is available in:

notebooks/003-rb-APACdrugs-ETL.ipynb

1.1 Extract

From pysus I imported files from all states in the year of 2022.

sia.get_files("AM", year=2022)

1.2 Transform

Converted all Parquet files to a Pandas DataFrame, which allowed me to perform joins with other datasets.

all_data_df_list = [ x.to_dataframe() for x in sia.download(files, local_dir='../data/raw/') ] 
combined_df = pd.concat(all_data_df_list, ignore_index=True)

Rename columns for enhanced clarity using data/external/better_names_apac_am.csv (referenced in the Supplementary Tables section).

Subsequently, the DataFrame was merged with external datasets:

  • CID10 - Categorias
  • Estados
  • Motivo Saída
  • Procedimentos SIGTAP

Additionally, a cns_coded column was added to address non-UTF-8 characters that were not readable by Tableau.

1.3 Load

Exported data to a .csv file.

The resulting dataset for APAC Drugs in 2022 comprised approximately 7 gigabytes of data. In order to accommodate the limitations of the free tier Tableau software (Tableau Public), which has a restriction on the number of rows, records with zero Valor Total APAC were removed.