Getting Started
Conventions and development guidelines
Committing
sqlfmt
We use
sqlfmtas an SQL formatter. Whether we love it or not, we have to use it, so everybody's code will look the same, making collaboration easier. Some linting rules might be controversial, but at least they are consistent and explicit.
Before committing, make sure to run the following command
sqlfmt .
... You might want to rerun a dbt build after formatting everything. sqlfmt shouldn't break anything, but we are better safe than sorry.
Commit message
Commit messages use the Commitizen convention.
Please make sure all of your commit messages start with a type. The following types are available:
feat: A new featurefix: A bug fixdocs: Documentation only changesstyle: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)refactor: A code change that neither fixes a bug nor adds a featureperf: A code change that improves performancetest: Adding missing or correcting existing testschore: Changes to the build process or auxiliary tools and libraries such as documentation generationrevert: Reverts a previous commit
Folder structure and convention
- All the SQL / Python code lives in the
modelsfolder. - All dashboards and reports live in the
reportingfolder.
About the models folder
The models folder is organized as follows:
.
└── models/
├── interfaces/
│ └── database_spame/
│ └── mart_foobar
├── marts/
│ └── mart_foobar
└── dashboards/
├── spam/
│ ├── features/
│ │ └── fact_absences.sql
│ └── pbi_tables/
│ └── fact_absences.sql
└── egg/
...
Where :
dashboards: each subfolder ofmodels/dashboardsshould be named after the corresponding dashboard it belongs to (one dashboard, one folder containing its SQL code).interfaces: contains mappings to the interface tables. Each interface table can be overridden to add custom connection logic to the underlying database.marts: contains the marts. Amartis a collection of tables reused/shared across dashboards.
Integration test and the nightly build
The nightly build is an automated check on the repo happening at the end of each day.
When introducing a new NON-OPTIONAL seed in the core.dashboards_store repo, you must add it into core.dashboards_store/nightly/dbt/seeds folder, so the next Nightly build won't fail because of a missing seed. The seed must be populated with data from CSSVDC, as the CSSVDC is used as a target database for the integration tests.
You can also run the integration suite locally. Please, refer to the core.dashboards_store/nightly/README.md file for more details.
Marts
educ_serv
This mart gathers all the data related to the education service.
Populations
Populations are sets of students used as a filter by various dashboards. You can refer to the analyses/marts/educ_serv/staging/populations folder and use the population template to build/define your populations.
The following populations are mandatory (cf adapters) and should be defined :
stg_ele_prescolairestg_ele_primaire_regstg_ele_primaire_adaptstg_ele_secondaire_regstg_ele_secondaire_adapt
The integrator can add new populations by overriding the custom_fgj_populations.sql model. To do so:
- Create a new file in
cssXX.dashboards_store/models/marts/educ_serv/staging/populationsnamedcustom_fgj_populations.sql - Your
custom_fgj_populationsmodel should be implemented as a union of your own custom populations. - Disable the core's placeholder in the
cssXX.dashboards_store:
# cssXX.dashboards_store/dbt_project.yml
models:
core_dashboards_store:
marts:
educ_serv:
staging:
populations:
custom_fgj_populations:
+enabled: false
Developers: when creating a new dashboard using the population mechanism, you must register its tag in the marts/educ_serv/adapters.yml file so it triggers the population computation.
Human resources
This mart gathers all the data related to the human resources department.
Populating the marts seed
This dashboard requires the specification of the seeds in the
human_resourcesmart.
The seed must be populated in cssXX.dashboards_store/seeds/marts/human_resources/ and as per the definition of the core.dashboards_store/seeds/marts/human_resources/schema.yml mart.
Please refer to the core.dashboards_store/seeds/marts/human_resources/schema.yml mart documentation to get the concrete implementation.
Do not forget to refresh your seeds with the dbt seed --select tag:human_resources --full-refresh command.
Exposing the freshness of the data into the dashboard
The
coreprovides a mechanism to expose the freshness of the data in the dashboard. This mechanism is calledthe stamperand can be enabled and used through macros.
Enabling the stamper
Must be done ONCE in your
cssXX.dashboards_store/dbt_project.yml.
The stamper is a table collecting metadata about your ETL runs. To enable data collection, you first enable it in your dbt_project.yml by adding the following two hooks:
# cssXX.dashboards_store/dbt_project.yml
# Hooks
on-run-start:
- "{{ core_dashboards_store.init_metadata_table() }}"
on-run-end:
- "{{ core_dashboards_store.purge_metadata_table() }}"
stamping my new dashboard
A good practice is to only stamp the reporting tables.
Only successful runs will be stamped. This means that taking the MIN(run_ended_at) will give you the last time your ETL run was successful. This is the worst-case freshness scenario.
To add a stamp to your dashboard, you can either:
- Add the following
post_hookinto your model :
# model.sql
{{ config(
post_hook='{{ core_dashboards_store.stamp_model("my_dashboard") }}',
) }}
- Stamp multiple models at once by adding the hook directly into the
core/dbt_project
models:
core_dashboards_store:
dashboards:
my_dashboard:
+tags: ["my_dashboard"]
+schema: dashboard_my_dashboard
pbi_tables:
+post_hook: ["{{ core_dashboards_store.stamp_model('my_dashboard') }}"]
The second option is preferred if all of your report models are under a common folder.
Using the stamper in your dashboard
In Power BI, you can easily fetch the last run of your ETL by filtering on the argument provided to the stamp_model macro.
Variable conventions
- Don't use spaces in your variable names
- Please, stick to a snake_case naming convention
- Use the interface tables to set the right names for the variables.
- Reserved keywords should be written in caps.
dbt_project.yml conventions
The
dbt_project.ymlhas to be updated every time a new dashboard is added tocore.dashboards_store.
- Each dashboard (i.e. each subfolder of the
modelsfolder) should be given a tag and a schema. The following example shows the minimal lines to add to thedbt_project.ymlfile to adddummy_dashboardtocore.dashboards_store. The rationale behind this convention is to ease filtering information in either the database or the documentation.
models: # Already here, for reference only
core_dashboards_store: # Already here, for reference only
+enabled: False # Already here, for reference only
dummy:
+tags: ["dummy"]
+schema: "dummy"
Naming conventions
- Use snake_case to name your variables.
Table conventions
- Use snake_case naming conventions.
- Use a prefix to indicate the high-level objective of the table. The following table prefixing conventions should be used.
| Table type | Description | Prefix | Example |
|---|---|---|---|
| fact | Contains tables of facts | fact_ | fact_eleve |
| dimension | A map between an arbitrary ID and a friendly name | dim_ | dim_subject_category |
| bridge | A mapping of primary-ish keys between systems | bridge_ | NA |
| base | A base table is a skeleton table used to build fact tables | base_ | NA |
| staging | A staging table is a by-product of the construction of a fact table. The table kind of acts as a fact table, but is not queried by itself. Staging tables are generally combined together or joined on a base table to create a fact table | stg_ | stg_droppers_raw |
| interface | The all mighty. Interfaces are tables mapping to the raw data from the operational system. It's basically a select clause followed by a list of the fields used in the downstream tasks. Those tables can be overridden in the inherited package to map the CSS requirements. Please, only add the columns you need. | target_ | target_perseverance |
| reporting | Reporting table. Used as an eye-catcher to easily detect the tables we need to plug the dashboard on. | rprt_ | rprt_emp65_ann_bdgt |
How to
Between projects conflicting tables names
Some table names are quite generic (spine, dim_school) and can be used in various contexts without referring to the same underlying table. To avoid confusion, please use the following pattern to disambiguate the tables.
Example
Let's consider the following two dashboards: employees_absences and dummy.
.
└── core.store/
├── reporting/
│ └── employees_absences.pbit
└── models/
├── employees_absences/
│ └── fact_absences.sql
└── dummy/
└── fact_absences.sql
models:
+employees_absences:
+schema: "employees_absences"
+tags: ["employees_absences"]
+dummy:
+schema: "dummy"
+tags: ["dummy"]
The two dashboards are using a table named fact_absences but the SQL code is not the same, so I do need those two tables. Unfortunately, dbt will raise an error as each name has to be unique.
An easy fix is to rename one of the fact_absences tables. Let's say we rename models/dummy/fact_absences to dummy_fact_absences. dbt will now be able to compile the code and any downstream task of the dummy project could refer to the table using {{ ref('dummy_fact_absence') }}.
The issue with this fix is that dbt will output a dummy_fact_absences table in the dummy schema. This is redundant, as the conflict happens between schemas, and not within the dummy schema. Fortunately, we can override the output name in the SQL code by adding the following line into dummy_fact_absences.sql.
{{ config(alias='fact_absences') }}
dbt error message
In such case, dbt would output an error message similar to the following one.
Compilation Error
dbt found two models with the name "<foobar>".
Since these resources have the same name, dbt will be unable to find the correct resource
when looking for ref("foobar").
To fix this, change the name of one of these resources:
- model.core_dashboards_store.removeme (models/prospectif_cdp/features/foobar.sql)
- model.core_dashboards_store.removeme (models/emp_conge/feature/foobar.sql)
Pattern
The generic pattern to solve the between-project-conflicting-tables-names issue is the following:
- Prefix your table with the (unique) friendly name of your dashboard.
- The friendly name should be short. Maybe 3-to-10 letters.
dummycould becomedmy
- The friendly name should be short. Maybe 3-to-10 letters.
- Add a dbt directive into your table code to output the table under its original name by setting the alias property
