1.Project_Introduction.pdf

Project Introduction 1 Project Introduction Idea describe the project business requirements and overall set-up Level Project Star Schema Integration Job Reading Order 1/7 Project Date Requirements Use case @August 10, 2022 Requirements Use case File description 1. (General) Ledger 2. Account 3. Entity Project overview Data updates Rejections Reloads Audit Talend schemas Generic schemas DB schemas Talend context variables Generic context variables Fixed values Dynamic values Master execution context variables Talend DB connections Project Introduction 2 The client is a retail company looking to build some reports and dashboards derived from their accounting results The client receives from the accounting team, on a monthly basis, a General Ledger summarizing all the transactions that have occurred during the month. The client also has two lookup tables used to relate the General Ledger information with : Account information : relates each General Ledger record to an accounting budget item and budget family. Entity information : relates each General Ledger record to an entity owned by the client. File description As described above, the client will be working with 3 different files : 1. (General) Ledger [ACCOUNT] : account key to which the ledger record item relates [JOURNAL] : journal key to which the ledger record item relates [DATE] : accounting date at which the records are evaluated [REFERENCE] : technical ledger item reference used to uniquely identify transactions [DEBIT] : debit amount for the item [CREDIT] : credit amount for the item [BALANCE] : calculated balance as [CREDIT] - [DEBIT] for the item Project Introduction 3 The primary key of the General Ledger is defined by : [REFERENCE] An alternate key is : [ACCOUNT] x [JOURNAL] x [DATE] 2. Account [ACCOUNT] : account key to which the ledger record item relates [FINANCIAL_TYPE] : type of financial accounting to which the account key relates [CATEGORY] : category within the financial accounting group to which the account key relates 2021/01 general ledger table sample account table sample Project Introduction 4 The primary key of the Account table is defined by : [ACCOUNT] 3. Entity [JOURNAL] : journal key to which the ledger record item relates [ENTITY] : reference of the client’s entity to which the journal belongs [ENTITY_TYPE] : type of entity as defined by the client’s internal organization The primary key of the entity table is defined by : [JOURNAL] Project overview The project is therefore relatively straightforward : 1. The .csv source files are made available on a FTP server on a regular basis. 2. Those files are loaded, untransformed, on an ODS. 3. The data is then loaded in a DWH with few transformations and lookups. entity table sample Project Introduction 5 Data updates The client sends the .csv files on a regular basis. Those flat files are then loaded on a FTP server for ETL integration. Account and Entity information rarely change but it should be possible for the client to easily update this information and dynamically re-assign historical values to the updated information. Previous Account and Entity value assignments are irrelevant for the client. The Ledger information is received by email and requires to be inserted/updated on a monthly basis. Ledger .csv files follow the naming convention “GENERAL_LEDGER_YYYYMM”. If the Ledger extracts of multiple months are made available at once , only the latest extract should be loaded (highest date, derived from the file naming convention “GENERAL_LEDGER_YYYYMM”). Rejections Sometimes, some Ledger key information will not be recorded in the Account and Entity tables . Under such scenario, the client asks that the matching simplified project architecture Project Introduction 6 information is recorded in the DWH nevertheless . In other words, the client does not want to kill the entire ETL process if a couple of records are not matching the dimension tables. The non-matching data should be recorded in a separate table for a rejection analysis which will, most of the time, lead to a L edger, Entity and/or Account reload Reloads The client will ask for file reloads under two scenarios : 1. The rejections may reveal that the Entity/Account tables are incomplete and should be re-loaded. 2. The client may realize that some records are missing from the Ledger information and should be loaded. Full data loads on the staging area are sufficient as historical staging information is irrelevant for the client. Audit Each data ETL load/reload batch should be traceable through a uniquely generated UID All DML changes should be tracked in a metadata table . This metadata table should contain the number of rows deleted, inserted, updates, rejected, as well as the batch UID. Talend schemas Generic schemas The project is built around several generic schemas to store the raw information extracted from the client’s data : A generic schema for the Ledger information : Project Introduction 7 A generic schema for the Account information : A generic schema for the Entity information : Project Introduction 8 A metadata schema, used to record the ETL operations : DB schemas After the data is ingested in its raw form in the ODS, the DWH will host the transformed data. To do so, several database schemas are defined : A DB schema for the Ledger fact table : Project Introduction 9 A DB schema for the Account dimension table : Project Introduction 10 A DB schema for the Entity dimension table : A DB schema for the Ledger rejections : Project Introduction 11 Talend context variables Generic context variables Several context variables were created for this project, some with fixed values, some with dynamic values. All those variables are stored under the context group “JobExecution” in Talend. Fixed values Folder path where all the .csv files will be dropped. For simplicity, we assume a local file path. DB table names for at ODS/DWH levels. Dynamic values Dynamic values are set blank by default and are defined at runtime : Project Introduction 12 CSV Ledger file name (used to capture the latest Ledger file available in the folder path). UID (used to identify job execution batches). Master execution context variables To manage the reloads described earlier, it is not always necessary to reload the complete project. Instead only subsets of the project can be executed to satisfy the reload requirements : Project Introduction 13 Talend DB connections We create two schemas in MySQL to manage the DWH and ODS schemas . In MySQL, those are treated as two separate DB objects : The ODS schema : The DWH schema : Project Introduction 14