Conventions and developement guidelines

Committing

sqlfmt

We use sqlfmt as an SQL formatter. Wheither we love it or not, we have to use it, so everybody code will look the same, making collaboration easier. Some linting rules might be controversial, but at least they are consistent and explicit.

Before committing, make sure to run the following command

sqlfmt .

... You might wan't to rerun a dbt build after formatting everything. sqlfmt schould'nt break anything, but we are better safe than sorry.

Commit message

Commit messages use the Commitizen convention.

Please, make sure all of your commit messages start with a type. The following types are available :

  • feat : A new feature
  • fix : A bug fix
  • docs : Documentation only changes
  • style : Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
  • refactor : A code change that neither fixes a bug nor adds a feature
  • perf : A code change that improves performance
  • test : Adding missing or correcting existing tests
  • chore : Changes to the build process or auxiliary tools and libraries such as documentation generation
  • revert : Reverts a previous commit

Folders structure and convention

  • All the SQL / Python code live in the models folder.
  • All dashboards and reports live in the reporting folder.

About the models folder

The models folder is organized as follow :

.
โ””โ”€โ”€ models/
    โ”œโ”€โ”€ interfaces/
    โ”‚   โ””โ”€โ”€ database_spame/
    โ”‚       โ””โ”€โ”€ mart_foobar
    โ”œโ”€โ”€ marts/
    โ”‚   โ””โ”€โ”€ mart_foobar
    โ””โ”€โ”€ dashboards/
        โ”œโ”€โ”€ spam/
        โ”‚   โ”œโ”€โ”€ features/
        โ”‚   โ”‚   โ””โ”€โ”€ fact_absences.sql
        โ”‚   โ””โ”€โ”€ pbi_tables/
        โ”‚      โ””โ”€โ”€ fact_absences.sql
        โ””โ”€โ”€ egg/
            ...

Where :

  • dashboards Each subfolder of models/dashboards should be named after the corresponding dashboard it's belongs to (one dashboard, one folder containing it's SQL code).
  • interfaces contains mapping to the interfaces tables. Each interface table can be overrided to add custom connection logic to the undelrying database..
  • marts contains the ... marts. A mart is a collection of tables reused/shared accross between dashboards.

Integration test and the nightly build

The nightly build is an automated check on the repo happening at the end (the night ^^) of each day

When introducting a new NON-OPTIONAL seed in the core.data.store repo, you must add it into core.data.store/nightly/dbt/seeds folder, so the next Nightly build won't fail because of a missing seed. The seed must be populated with data from CSSVDC, as the CSSVDC is used as a target database for the integration tests. You can also run the integration suit locally. Please, refer to the core.data.store/nightly/README.md file for more details.

Marts

educ_serv

This mart gather all the data related to the education service.

Populations

Populations are sets of students used as a filter by various dashboards. you can refer to analyses/marts/educ_serv/staging/population folder and use the population template to build/define your populations.

The following populations are mandatory (cf adatpers) and schould be defined :

  • stg_ele_prescolaire
  • stg_ele_primaire_reg
  • stg_ele_primaire_adapt
  • stg_ele_secondaire_reg
  • stg_ele_secondaire_adapt

The integrator can add new populations by overrding the custom_fgj_populations.sql model. To do so :

  1. Create a new file in cssXX.data.store/models/marts/educ_serv/staging/populations named custom_fgj_populations.sql
  2. Your custom_fgj_populations model schould be implemeted as a union of your own custom populations.
  3. Disable the core's placeholder in the cssXX.data.store:
# cssXX.data.store/dbt_project.yml
models:
  core_dashboards_store:
    marts:
      educ_serv:
        staging:
          populations:
            custom_fgj_populations:
              +enabled: false

Developers : when creating a new dashboard using the population mechanism, you must register it's tag in the marts/educ_serv/adapters.yml file, for it trigger the population computation.

human resources

This mart gather all the data related to the human resources departement

Populating the marts seed

This dashboard requiers the specification of the seeds in the human_resources mart.

The seed must be populated in cssXX.data.store/seeds/marts/human_resources/ and as per the definition of the core.data.store/seeds/marts/human_resources/schema.yml mart.

Please refer to the core.data.store/seeds/marts/human_resources/schema.yml mart documentation to get the concrete implementation.

Do not forget to refresh your seeds with the dbt seeds --select tag:human_resources --full-refresh command.

Exposing the freshness of the data into the dashboard

The core provide a mechanism to expose the freshness of the data into the dashboard. This mechanism is call the stamper and can be enabled and used through macros.

Enabling the stamper

Must be done ONCE in your cssXX.data.store/dbt_project.yaml.

The stamper is a table collecting metadata about your ETL's run. To enable the data collection, you first enable it in your dbt_project.yml by adding the two following hooks :

# cssXX.data.store/dbt_project.yml
# Hooks
on-run-start:
  - "{{ core_dashboards_store.init_metadata_table() }}"
on-run-end:
  - "{{ core_dashboards_store.purge_metadata_table() }}"

stamping my new dashboard

A good practice is to only stamp the reporting tables.

Only successfull run will be stamped. Meaning that taking the MIN(run_ended_at) will give you the last time your ETL run has been sucessfull. This is the worstcase scenario freshness

To add a stamp to your dashboard you can either :

  • Add the following post_hook into your model :
# model.sql
{{ config(
    post_hook='{{ core_dashboards_store.stamp_model("my_dashboard") }}',
) }}
  • Stamp multiples models at once by adding the hook directrly into the core/dbt_project
models:
  core_dashboards_store:
    dashboards:
      my_dashboard:
        +tags: ["my_dashboard"]
        +schema: dashboard_my_dashboard
        pbi_tables:
          +post_hook: ["{{ core_dashboards_store.stamp_model('my_dashboard') }}"]

The second option is to be prefered if all of your report models are under a common folder

Using the stamper in your dashboard

In PowerBi, you can easily fetch the last run of your ETL by filtering on the argument provided to the stamp_model macro.

Variable conventions

  • Don't use spaces in your variables names
  • Please, stick to a snake_case naming convention
    • Use the interface tables to set the right names for the variables.
  • Reserved keywords schould be written in caps.

dbt_project.yaml conventions

The dbt_project.yaml has to be updated every time a new dashboard is added to the core_dashboards_store.

  • Each dashboard (ie, each subfolder of the models folder) schould be given a tag and a schema. The following exemple shows the minimal lines to add to the dbt_project.yaml file to add dummy_dashboard to the core_dashboards_store. The rational behind this convention is to ease filtering information in either the database or the documentation.
models: # Already here, for reference only
  core_dashboards_store: # Already here, for reference only
    +enabled: False # Already here, for reference only
    dummy:
      +tags: ["dummy"]
      +schema: "dummy"

Naming conventions

  • use snake_case to names your variables.

Table conventions

  • Use snake_case naming conventions.
  • Use a prefix to indicates the high-level objective of the table. The following table prefixing conventions schould be used.
Table typeDescriptionPrefixExemple
factContains tables of factsfact_fact_eleve
dimensionA map between an arbitrary ID and a friendly namedim_dim_subject_category
bridgeA mapping of between-systems-primarish-keybridge_NA
baseA base table is a skeleton table used to build fact tablesbase_NA
stagingA staging table is a by product of the construction of fact table.
The table kind of acts as a fact table, but is not be queried by itself.
Staging tables are generally combined together or joined on a base table to create a fact table
stg_stg_droppers_raw
interfaceThe all mighty. Interfaces are tables mapping to the raw data from the operational system. It's basically a select clause followed by a list of the fields used in the downstream tasks. Those tables can be overriden in the inherited package to map the CSS requirements. Please, only add the columns you need.target_target_perseverance
reportingReporting tabel. Used as an eaye-catcher to easily detect the tables we need to plug the dashboard on.rprt_rprt_emp65_ann_bdgt

How to

Between projects conflicting tables names

Some tables names are quite generic (spine, dim_school) and can be used in various contexts without refering to the same underlying table. To avoid confusion, please, use the following pattern to disambiguate the tables.

Exemple

Let's consider the following two dashboards : employees_absences and dummy.

.
โ””โ”€โ”€ core.store/
    โ”œโ”€โ”€ reporting/
    โ”‚   โ””โ”€โ”€ employees_absences.pbit
    โ””โ”€โ”€ models/
        โ”œโ”€โ”€ employees_absences/
        โ”‚   โ””โ”€โ”€ fact_absences.sql
        โ””โ”€โ”€ dummy/
            โ””โ”€โ”€ fact_absences.sql
models:
  +employees_absences:
    +schema: "employees_absences"
    +tags: ["employees_absences"]
  +dummy:
    +schema: "dummy"
    +tags: ["dummy"]

The two dashboards are using a table named fact_absences but the SQL code is not the same, so I do need thoose two tables. Unfortunetely, DBT will raise an error as each name has to be unique.

An easy fix is to rename one of the fact_absences tables. Let's say we rename modesl/dummy/fact_absences to dummy_fact_absences. DBT will now be able to compile the code and any downstream task of the dummy project could refers to the table using {{ ref('dummy_fact_absence') }}

This issue with this fix is that DBT will output a dummy_fact_absences table in the dummy schema. This is redundant, as the conflict happens between schemas, and not within the dummy schema. Fortunately, we can override the outputed name in the SQL code by adding the following line into dummy_fact_absences.sql.

{{ config(alias='fact_absences') }}

DBT error message

In such case, DBT would output an error message similar to the following one.

Compilation Error
  dbt found two models with the name "<foobar>".

  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("foobar").

  To fix this, change the name of one of these resources:
  - model.core_dashboards_store.removeme (models/prospectif_cdp/features/foobar.sql)
  - model.core_dashboards_store.removeme (models/emp_conge/feature/foobar.sql)
Pattern

The generic pattern to solve the between-project-conflicting-tables-names issue is the following :

  1. Prefix your table with the (unique) friendly name of your dashboard.
    • The friendly name schould be short. Maybe 3-to-10 letters. dummy could become dmy
  2. Add a DBT directive into your table code to output the table under it's original name by setting the alias property