9-week build · P1 close-out + P3 design lock
The replication pipeline keeps running every night. P1 (bronze staging) is closing out: bootstrap PR merged Monday, Phase B seeds + macros landed, and all 8 BU contacts staging models materialize in DEV against real data. Rich's reconciliation review closed Friday morning — the canonical model field list is now locked at 85 fields, and #175a (P3a canonical) is ready to start once the 3 staging PRs merge. Splink IR (P2) build window is the next workstream, in parallel to the canonical build.
Receiving team unchanged: Mike Grabbe + Adrian LeDoux (ET Data Engineering). Rich transitioning out — weekly standups continue; this report keeps you in the loop.
W4 cleared the critical-path bottleneck
Two long-running blockers cleared this week — Rich's reconciliation review came back, and Adrien approved the first two of three staging PRs. Plus the structural verification of the CH ↔ US replication confirmed the pipeline is healthy across all 49 source tables (rows + columns + bytes identical). Net: canonical build (#175a) can start, blocked only on the third PR review.
Friday morning Rich returned the DC-direct verification XLS with his final decisions on the 17 ambiguous fields:
- 6 fields Drop (gone from canonical + staging)
- 1 field Keep in canonical (
Contact_Restriction_Code· ET+EFSA scope) - 2 fields Keep on Source only (per-BU staging, not promoted to unified):
Departure_Year(HSEY),Is_Primary_Contact(CCAP+Academy) - 1 field Use for CI:
Is_Currently_Hosting(cross-BU CCAP+HSEY) - 7 ghost fields Derive in DBT, keep on source only — Rich chose to derive them INTO the canonical from CCAP source (the only BU with the data) rather than deploy as DMO columns. No DC metadata changes needed; passthrough survivorship since CCAP is the only source.
Canonical model #175a active UI fields: 89 → 85 (77 implicit Keep + 1 explicit Keep + 7 ghosts derived from CCAP). Design-locked.
Friday morning Adrien approved both foundational PRs:
- #3989 · sources refactor — splits monolithic
sources.ymlinto per-product files insidemodels/staging/<product>/ - #4027 · CCAP Phase C pilot — first real
stg_ccap__contacts(878,903 rows, lead-source normalized, survivorship clock wired)
The 7-BU extension PR (#4059) is still pending Adrien's review — gentle follow-up sent. Likely waits until #4027 lands so review happens against the merged base.
Comprehensive sweep across every registered source table for the dbt project — verified row count, column count, and storage bytes are identical between US consolidation (EF_DATA_HUB_RAW) and CH replica:
- 49 tables across 9 schemas (SF orgs, shared DBs, Juno Kafka, NeverBounce)
- 285 M+ rows total · all match exactly
- Original Fivetran sources sit ~0.01-0.02% above CH (nightly task lag — expected)
Replication group is healthy. No structural drift, no missing tables.
What landed in W4
-
1
Bootstrap PR
eftours/de-dbt#3868merged Monday (W4 D1). The vanilla-audit restructured project (8 → 4 vanilla schemas, UPPERCASE prefixed, sources outsidemodels/) is now onmain. Pass 3c DDL executed on HH82036 — dropped the 8 obsolete unprefixed schemas across DEV/QA/PROD (24 schemas total). The cross-repo project surface is now clean and Phase B work moves forward on top of the merged base. -
2
Phase C pilot —
stg_ccap__contactsported end-to-end (W4 D3). First real per-BU staging model. UNION of Contact (host-family record type, non-au-pair, US-only) + Lead (US, non-deleted, not-yet-converted), LEFT JOIN to Account foris_primary_contact, fullutm_sourcenormalization (~20 mappings preserved from legacy),lead_source_normalizedvia seed JOIN,survivorship_tsmacro wired. Materializes 878,903 rows in DEV (346,778 Contact + 532,125 Lead) with 81.2% seed coverage onlead_source_normalized. -
3
7 BU staging contacts extended (W4 D4) — same pattern applied to the other 7 BUs. Real ports for academy (PersonAccount + Contact-Guardian UNION, 2.77 M rows), hsey (Contact + Lead UNION, 1.59 M), language (Juno Kafka source, 33.4 M), student_tours (data_hub_share, 5.79 M), study_abroad (PLB share, 16.1 k), gap_year (PLB share, 132 k), world_journeys (multi-tenant via BUSINESS_CODE, 8.69 M). Each materializes ≤ 2.4 s in DEV. Row counts reconcile to filter logic to the exact row — verified independently via raw-source count queries.
-
4
Sources refactor PR (#3989) Adrien-approved. Splits the monolithic
sources/sources.yml(~181 lines, 9 sources in one file) into per-product_<product>__sources.ymlfiles inside eachmodels/staging/<product>/folder, aligning with the dbt Labs 1.12 best practice Adrien flagged on May 13. Bonus fix on a missing Fivetranfreshnessconfig for salesforce_ccap. No new logic — just organization. -
5
Rich's reconciliation review closed Friday morning — canonical at 78 fields. Rich returned the DC-direct verification XLS with his decisions on all 17 ambiguous fields: 6 Drop · 1 Keep · 2 Keep-on-Source-only · 1 Use-for-CI · 7 ghost fields Derive-in-DBT (rather than deploy as DMO columns). No DC metadata changes needed. Canonical model #175a active field count: 89 → 78. Design-locked.
-
6
DC-direct verification + 49-table CH ↔ US audit. Earlier in the week (W4 D3-D4), did a full DC-direct verification of the 17 ambiguous fields against
UnifiedIndividual__dlmindatahub_prod(15.09 M rows) — including the surprise finding that the 7 "ghost" fields don't actually exist on the DMO (PROPOSED indataCloudMappings/+ recon rules but never deployed). Then verified the CH-side replica matches US consolidation 100% on row count, column count, AND storage bytes across all 49 source tables (285 M+ rows total). Original Fivetran sources sit ~0.01-0.02% above CH due to nightly task lag — expected.
What's running every night + what's now built on top
EF_DATA_HUB_RAW · CH = US 100% match (rows + cols + bytes)Nightly schedule (UTC)
| Time | Step | Owner |
|---|---|---|
00:00 | Root task COPY_ROOT fires; 48 children copy from source DBs into EF_DATA_HUB_RAW on US | Snowflake task graph (autonomous) |
~00:30 | All 48 children complete · row counts logged | Snowflake |
01:00 | Replication group EF_DATA_HUB_RAW_TO_CH_RG refresh · pulls deltas to CH | Snowflake replication |
~01:30+ | Available on HH82036 · ready for dbt build | — |
Schemas + sizes (CH side)
| Schema | Tables | Rows | Size | Source |
|---|---|---|---|---|
data_hub_share | 7 | 45.6 M | 8.07 GB | ET / Student Tours share |
language_kafka | 9 | 209.6 M | 35.20 GB | EU Language (Juno + Poseidon Kafka) |
wojo | 3 | 11.5 M | 0.65 GB | World Journeys share |
salesforce_ccap | 7 | 8.6 M | 1.59 GB | CCAP Fivetran landing |
salesforce_academy | 8 | 4.6 M | 0.50 GB | Academy Fivetran landing |
salesforce_hsey | 6 | 2.9 M | 0.36 GB | HSEY Fivetran landing |
ef_us_uploads | 1 | 0.75 M | 0.01 GB | NeverBounce upload (audit pending) |
higher_ed_gy | 4 | 0.16 M | 0.02 GB | Higher Ed share — Gap Year |
higher_ed_sa | 4 | 0.18 M | 0.02 GB | Higher Ed share — Study Abroad |
EVALUATION_QUESTION_RESPONSES_V from W2 has been on CH for a week now — included in the 49 source count above.
Schema surface on HH82036 (post Pass 3c · obsolete schemas dropped W4 D3)
| Database | Schema | Layer | Current content |
|---|---|---|---|
EF_DBT_DEV | EF_DATA_HUB_CH_STAGING | bronze | 8 real stg_<bu>__contacts models · 49 placeholders for other entities (opportunities, sales_orders, contact_point_*) · all CI-validated |
EF_DATA_HUB_CH_INTERMEDIATE | silver | Empty · canonical model (#175a · 85 UI fields locked) + Splink IR (#161) land here | |
EF_DATA_HUB_CH_CORE | silver | Empty · ~37 CIs (#166) + segments (#142) | |
EF_DATA_HUB_CH_MARTS | gold | Empty · Cortex agent feeds · cross-regional sharing · BI consumers (#224) | |
Same 4-schema set mirrored in EF_DBT_QA + EF_DBT_PROD (12 total). The 8 obsolete unprefixed schemas were dropped W4 D3. | |||
Where each in-flight ticket landed
Closed this week
| Ticket | Title | Why closed |
|---|---|---|
| #143 | CH dbt home — Pass 3 final | Pass 3c executed W4 D3 — 8 obsolete schemas dropped across DEV/QA/PROD (24 total) |
| #155 | Bootstrap single dbt project ef_data_hub_ch | PR #3868 merged W4 D1 via merge queue |
| #225 | Pre-spec audit · reconciliation rules + IR + field normalization | Rich's final decisions on 17 ? fields received Friday morning · canonical design-locked at 85 UI fields |
| #158 Phase A | Staging placeholders for 8 products | Superseded by Phase C — 8 BU staging contacts ports landed real (in PRs #4027 + #4059) |
Now ready for the build · pending PR merges only
PR #4027 (CCAP exemplar · 878 k rows) and PR #3989 (sources refactor) both Adrien-approved Friday morning. PR #4059 (the 7 other BUs · 52 M rows) ready for review since Thursday EOD; promotion from DRAFT happened after Adrien's first review pass, so the next pass will pick it up. Together they close P1 bronze.
Field list locked at 85 UI fields after Rich's Friday decisions: 77 implicit Keep + 1 explicit Keep (Contact_Restriction_Code) + 7 ghosts derived from CCAP source INTO canonical (Rich's option (b) — derive in dbt rather than deploy DMO column). The 6 Drops, 2 Keep-on-Source-only, and 1 Use-for-CI are out. Spec refreshed today. Build commits can start against DEV stg views (no longer gated on the staging PRs for design — just for materialization in PROD).
Still pending or stakeholder-blocked
- PR
#4059— Adrien review (was draft when he reviewed the other two on Friday AM; gentle re-nudge sent, expecting Monday) - #142 — Segments reframed: MC primary / Snowflake Plan B; Vivek call gates the methodology
- #161 + #162 — Splink IR + parity QA (incorporate Amon's modified 10-rule design from the parallel CH workstream's SF PS session); next workstream after canonical build kickoff
- #224 — Marts sublayer (Cortex / cross-regional / BI consumers); spec write pending after #158 Phase C fully merges
- #136-#140 — RLT-* normalize tickets: tracked as follow-up enrichments on the per-BU staging models once #4059 merges (nationality maps, host family aggregation, contact_origin_bu split, etc.)
- RLT-3991 — 5 HSEY closed-lost codes (CN/CY/EF/APP/INV → "Other"): Rich to confirm
$25 k NTE · spent vs remaining
NTE consumption
54.6% consumed end of W4. Trajectory still healthy: P0 + P1 done in 4 active weeks with ~45% of the cap reserved for P2 (Splink IR) through P6 (handoff). EF-engagement work only — the parallel ohanacloud-CH workstream (IR design with SF PS) is tracked separately and not billed against this NTE.
How this is calculated
Three rate tiers per SOW Amendment §4 (depending on day intensity):
| Day intensity | Rate | Applied to |
|---|---|---|
| Full-time (> 4 h/day) | $140 / h | Most active days · ~83 h × $140 ≈ $11,620 |
| Part-time (2.1–4 h/day) | $160 / h | 3 days · ~10 h × $160 ≈ $1,600 |
| Ad-hoc (≤ 2 h/day) | $180 / h | 3 days · ~2-4 h × $180 = $360-720 |
Total: 95 h · $13,640. Detailed daily breakdown lives in sprint-log.md.
Phase status
| Phase | SOW low (h) | Actual h | Actual $ | Status |
|---|---|---|---|---|
| P0 · M1 Foundation & access (W1) | 16 | 14.0 | $2,040 | Done |
| P1 · dbt #1 normalize + replication (W2-4) | 25 | 78.0 | $11,120 | Closing · 312% h · 278% $ |
| P2 · Splink IR (W5-6) | 37 | 0 | $0 | Upcoming |
| P3 · Canonical + CIs (W5-6) | 37 | 3.0 | $480 | Design locked · build pending PR merges |
| P4 · SFMC + Marts (W6-7) | 56 | 0 | $0 | Upcoming |
| P5 · Cutover (W8) | 14 | 0 | $0 | Upcoming |
| P6 · Handoff + Hypercare (W9 + Jul) | 18 | 0 | $0 | Upcoming |
P1 closed at 78 h vs 25-h SOW estimate (over-run absorbed). Phase scope materially expanded over W2-W4: #143 + #159 + #155 bootstrap + vanilla-audit restructure + CI debug + Pass 3 DDL + Phase B seeds + Phase C real ports for all 8 BUs + CH↔US 49-table audit. Cumulative NTE at 54.6% leaves ~$11.4 k for P2 → P6. Pacing P5/P6 conservatively; no escalation needed.
W5 (26 May – 1 Jun) · canonical build kickoff · IR design intake
P3a · canonical_individual model build
With the design locked at 78 fields and 8 BU staging contacts views materialized in DEV, the canonical model build is the natural next bite. canonical_individual sits in EF_DATA_HUB_CH_INTERMEDIATE and references the 8 per-BU staging views through the survivorship pattern. Splink IR (#161) work runs in parallel — the rule design from the parallel CH workstream's SF PS sessions can start integrating here.
- Spec #175a refresh · 78-field canonical list locked with Rich's 4 decision categories documented
- Canonical scaffold ·
canonical_individual.sqlinmodels/intermediate/canonical/reading fromref('stg_<bu>__contacts')× 8 - Survivorship CTE pattern · 6-rule-type framework wired (RT1 source-priority → RT6 aggregate) classified across the 78 fields
- 3 BU contacts enrichments · biggest-impact TODOs landed: HSEY nationality→ISO2 + host-family aggregation · Academy nationality + guardian dedup · Student Tours contact_origin_bu split
- 3 Google_Analytics_* cols · added to
stg_ccap__contacts(Rich's "derive in DBT" decision) - PR
#4059merged (assumes Adrien reviews early-week)
- Adrien — review + merge PR
#4059(was draft when he checked the other two on Friday AM; re-nudge sent) - Rich — confirm 5 HSEY closed-lost codes for RLT-3991 (CN/CY/EF/APP/INV → "Other") to close that ticket
- Vivek (MC) — still gating #142 segments methodology call
- Mike / Adrian — NeverBounce source status confirmation (5+ months stale)
- Mike — sanity check on first 4 weeks of replication credits to size the cost line
Looking past W5 — phase trajectory
| Window | Phase | What gets built |
|---|---|---|
| W5-6 (26 May - 8 Jun) | P2 + P3a | Canonical model (78-field canonical_individual with 6-RT survivorship CTE pattern) · Splink IR pipeline kickoff (blocking, fuzzy, graph resolution) · staging BU enrichments (nationality maps, host-family aggregation, contact_origin_bu split) |
| W6-7 (3-16 Jun) | P3b + P4 | ~37 CIs (including new ci__is_currently_hosting from Rich's decision) · IR parity QA vs DC's UnifiedIndividual · marts pre-aggregated for Cortex agent · Segments (MC primary or Snowflake fallback) · SFMC engagement via S3 |
| W7-8 (10-23 Jun) | P4 + P5 | Marketing Cloud integration · CloudPage UI · cutover runbook · parallel-run reconciliation report · Prefect orchestration · DC schedule disable plan |
| W9 + Jul | P6 | Handoff to Mike Grabbe + Adrian LeDoux · operational runbook · KT sessions · 2-cycle hypercare · formal sign-off checklist |
Critical path now: PR #4059 merge → #175a canonical_individual → #161 Splink IR (parallel) → #166 CIs → #142 segments → cutover. The 3 open PRs are the last bronze-layer gate.
Where to dig deeper
- Migration plan deck + Architecture options + Structure proposal + Vanilla audit deck
- SOW Amendment #2 (signed 29 Apr 2026) — NTE $25 k, T&M capped, Net 30
- Sprint log — weekly hours and budget tracker
- docs/sf-analysis/unifiedindividual-fillrate-rich-final-decisions-2026-05-22.csv — Rich's reply with the 17 ? field decisions
- docs/sf-analysis/unifiedindividual-rich-final-decisions-summary-2026-05-22.md — per-category breakdown + downstream actions
- docs/sf-analysis/unifiedindividual-fillrate-for-rich-2026-05-21-dc-verified.xlsx — the DC-direct verification XLS shared with Rich on Wed
- Sprint log W4 days 1-5 with per-day breakdown
- specs/build/158-dbt1-normalize-staging-models.md — Phase A/B/C plan (now closing P1)
- specs/build/175a-canonical-survivorship.md — canonical model (78-field design lock pending today's refresh)
- specs/setup/143-ch-dbt2-home.md — CH dbt home (closed; Pass 3c done)
- specs/setup/159-us-ch-replication-group.md — replication group US→CH (closed; 49 tables verified)
- eftours/de-dbt#3868 — bootstrap restructure (MERGED W4 D1)
- eftours/de-dbt#3989 — sources refactor (Adrien-approved · pending merge)
- eftours/de-dbt#4027 — CCAP Phase C pilot real model (Adrien-approved · pending merge)
- eftours/de-dbt#4059 — 7-BU staging contacts extension (Ready for Review · awaiting Adrien)
- Repository CH = US 100% match audit · 49 source tables · 285M+ rows