Conventions and developement guidelines
Committing
sqlfmt
We use
sqlfmt
as an SQL formatter. Wheither we love it or not, we have to use it, so everybody code will look the same, making collaboration easier. Some linting rules might be controversial, but at least they are consistent and explicit.
Before committing, make sure to run the following command
sqlfmt .
... You might wan't to rerun a dbt build
after formatting everything. sqlfmt
schould'nt break anything, but we are better safe than sorry.
Commit message
Commit messages use the Commitizen convention.
Please, make sure all of your commit messages start with a type. The following types are available :
feat
: A new featurefix
: A bug fixdocs
: Documentation only changesstyle
: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)refactor
: A code change that neither fixes a bug nor adds a featureperf
: A code change that improves performancetest
: Adding missing or correcting existing testschore
: Changes to the build process or auxiliary tools and libraries such as documentation generationrevert
: Reverts a previous commit
Folders structure and convention
- All the SQL / Python code live in the
models
folder. - All dashboards and reports live in the
reporting
folder.
About the models
folder
The models
folder is organized as follow :
.
โโโ models/
โโโ interfaces/
โ โโโ database_spame/
โ โโโ mart_foobar
โโโ marts/
โ โโโ mart_foobar
โโโ dashboards/
โโโ spam/
โ โโโ features/
โ โ โโโ fact_absences.sql
โ โโโ pbi_tables/
โ โโโ fact_absences.sql
โโโ egg/
...
Where :
dashboards
Each subfolder ofmodels/dashboards
should be named after the corresponding dashboard it's belongs to (one dashboard, one folder containing it's SQL code).interfaces
contains mapping to the interfaces tables. Each interface table can be overrided to add custom connection logic to the undelrying database..marts
contains the ... marts. Amart
is a collection of tables reused/shared accross between dashboards.
Integration test and the nightly build
The nightly build is an automated check on the repo happening at the end (the night ^^) of each day
When introducting a new NON-OPTIONAL seed in the core.data.store
repo, you must add it into core.data.store/nightly/dbt/seeds
folder, so the next Nightly build won't fail because of a missing seed. The seed must be populated with data from CSSVDC, as the CSSVDC is used as a target database for the integration tests.
You can also run the integration suit locally. Please, refer to the core.data.store/nightly/README.md
file for more details.
Marts
educ_serv
This mart gather all the data related to the education service.
Populations
Populations
are sets of students used as a filter by various dashboards. you can refer to analyses/marts/educ_serv/staging/population folder and use the population template to build/define your populations.
The following populations are mandatory (cf adatpers
) and schould be defined :
stg_ele_prescolaire
stg_ele_primaire_reg
stg_ele_primaire_adapt
stg_ele_secondaire_reg
stg_ele_secondaire_adapt
The integrator can add new populations by overrding the custom_fgj_populations.sql
model. To do so :
- Create a new file in
cssXX.data.store/models/marts/educ_serv/staging/populations
namedcustom_fgj_populations.sql
- Your
custom_fgj_populations
model schould be implemeted as a union of your own custom populations. - Disable the core's placeholder in the
cssXX.data.store
:
# cssXX.data.store/dbt_project.yml
models:
core_dashboards_store:
marts:
educ_serv:
staging:
populations:
custom_fgj_populations:
+enabled: false
Developers : when creating a new dashboard using the population mechanism, you must register it's tag in the marts/educ_serv/adapters.yml
file, for it trigger the population computation.
human resources
This mart gather all the data related to the human resources departement
Populating the marts seed
This dashboard requiers the specification of the seeds in the
human_resources
mart.
The seed must be populated in cssXX.data.store/seeds/marts/human_resources/
and as per the definition of the core.data.store/seeds/marts/human_resources/schema.yml
mart.
Please refer to the core.data.store/seeds/marts/human_resources/schema.yml
mart documentation to get the concrete implementation.
Do not forget to refresh your seeds with the dbt seeds --select tag:human_resources --full-refresh
command.
Exposing the freshness of the data into the dashboard
The
core
provide a mechanism to expose the freshness of the data into the dashboard. This mechanism is callthe stamper
and can be enabled and used through macros.
Enabling the stamper
Must be done ONCE in your
cssXX.data.store/dbt_project.yaml
.
The stamper is a table collecting metadata about your ETL's run. To enable the data collection, you first enable it in your dbt_project.yml
by adding the two following hooks :
# cssXX.data.store/dbt_project.yml
# Hooks
on-run-start:
- "{{ core_dashboards_store.init_metadata_table() }}"
on-run-end:
- "{{ core_dashboards_store.purge_metadata_table() }}"
stamping
my new dashboard
A good practice is to only stamp the reporting tables.
Only successfull run will be stamped. Meaning that taking the MIN(run_ended_at)
will give you the last time your ETL run has been sucessfull. This is the worstcase scenario freshness
To add a stamp to your dashboard you can either :
- Add the following
post_hook
into your model :
# model.sql
{{ config(
post_hook='{{ core_dashboards_store.stamp_model("my_dashboard") }}',
) }}
- Stamp multiples models at once by adding the hook directrly into the
core/dbt_project
models:
core_dashboards_store:
dashboards:
my_dashboard:
+tags: ["my_dashboard"]
+schema: dashboard_my_dashboard
pbi_tables:
+post_hook: ["{{ core_dashboards_store.stamp_model('my_dashboard') }}"]
The second option is to be prefered if all of your report models are under a common folder
Using the stamper
in your dashboard
In PowerBi, you can easily fetch the last run of your ETL by filtering on the argument provided to the stamp_model
macro.
Variable conventions
- Don't use spaces in your variables names
- Please, stick to a snake_case naming convention
- Use the interface tables to set the right names for the variables.
- Reserved keywords schould be written in caps.
dbt_project.yaml conventions
The
dbt_project.yaml
has to be updated every time a new dashboard is added to the core_dashboards_store.
- Each dashboard (ie, each subfolder of the
models
folder) schould be given a tag and a schema. The following exemple shows the minimal lines to add to thedbt_project.yaml
file to adddummy_dashboard
to the core_dashboards_store. The rational behind this convention is to ease filtering information in either the database or the documentation.
models: # Already here, for reference only
core_dashboards_store: # Already here, for reference only
+enabled: False # Already here, for reference only
dummy:
+tags: ["dummy"]
+schema: "dummy"
Naming conventions
- use snake_case to names your variables.
Table conventions
- Use snake_case naming conventions.
- Use a prefix to indicates the high-level objective of the table. The following table prefixing conventions schould be used.
Table type | Description | Prefix | Exemple |
---|---|---|---|
fact | Contains tables of facts | fact_ | fact_eleve |
dimension | A map between an arbitrary ID and a friendly name | dim_ | dim_subject_category |
bridge | A mapping of between-systems-primarish-key | bridge_ | NA |
base | A base table is a skeleton table used to build fact tables | base_ | NA |
staging | A staging table is a by product of the construction of fact table. The table kind of acts as a fact table, but is not be queried by itself. Staging tables are generally combined together or joined on a base table to create a fact table | stg_ | stg_droppers_raw |
interface | The all mighty. Interfaces are tables mapping to the raw data from the operational system. It's basically a select clause followed by a list of the fields used in the downstream tasks. Those tables can be overriden in the inherited package to map the CSS requirements. Please, only add the columns you need. | target_ | target_perseverance |
reporting | Reporting tabel. Used as an eaye-catcher to easily detect the tables we need to plug the dashboard on. | rprt_ | rprt_emp65_ann_bdgt |
How to
Between projects conflicting tables names
Some tables names are quite generic (spine, dim_school) and can be used in various contexts without refering to the same underlying table. To avoid confusion, please, use the following pattern to disambiguate the tables.
Exemple
Let's consider the following two dashboards : employees_absences
and dummy
.
.
โโโ core.store/
โโโ reporting/
โ โโโ employees_absences.pbit
โโโ models/
โโโ employees_absences/
โ โโโ fact_absences.sql
โโโ dummy/
โโโ fact_absences.sql
models:
+employees_absences:
+schema: "employees_absences"
+tags: ["employees_absences"]
+dummy:
+schema: "dummy"
+tags: ["dummy"]
The two dashboards are using a table named fact_absences
but the SQL code is not the same, so I do need thoose two tables. Unfortunetely, DBT will raise an error as each name has to be unique.
An easy fix is to rename one of the fact_absences
tables. Let's say we rename modesl/dummy/fact_absences
to dummy_fact_absences
. DBT will now be able to compile the code and any downstream task of the dummy project could refers to the table using {{ ref('dummy_fact_absence') }}
This issue with this fix is that DBT will output a dummy_fact_absences
table in the dummy
schema. This is redundant, as the conflict happens between schemas, and not within the dummy
schema. Fortunately, we can override the outputed name in the SQL code by adding the following line into dummy_fact_absences.sql
.
{{ config(alias='fact_absences') }}
DBT error message
In such case, DBT would output an error message similar to the following one.
Compilation Error
dbt found two models with the name "<foobar>".
Since these resources have the same name, dbt will be unable to find the correct resource
when looking for ref("foobar").
To fix this, change the name of one of these resources:
- model.core_dashboards_store.removeme (models/prospectif_cdp/features/foobar.sql)
- model.core_dashboards_store.removeme (models/emp_conge/feature/foobar.sql)
Pattern
The generic pattern to solve the between-project-conflicting-tables-names issue is the following :
- Prefix your table with the (unique) friendly name of your dashboard.
- The friendly name schould be short. Maybe 3-to-10 letters.
dummy
could becomedmy
- The friendly name schould be short. Maybe 3-to-10 letters.
- Add a DBT directive into your table code to output the table under it's original name by setting the alias property