EF Migration · W4 Status Report

Plan

9-week build · P1 close-out + P3 design lock

The replication pipeline keeps running every night. P1 (bronze staging) is closing out: bootstrap PR merged Monday, Phase B seeds + macros landed, and all 8 BU contacts staging models materialize in DEV against real data. Rich's reconciliation review closed Friday morning — the canonical model field list is now locked at 85 fields, and #175a (P3a canonical) is ready to start once the 3 staging PRs merge. Splink IR (P2) build window is the next workstream, in parallel to the canonical build.

P0 · W1 Foundation

P1 · W2-4 Normalize + Replication

P2 · W5-6 Splink IR

P3 · W5-6 Canonical + CIs

P4 · W6-7 SFMC + Marts

P5 · W8 Cutover

P6 · W9 + Jul Handoff

Receiving team unchanged: Mike Grabbe + Adrian LeDoux (ET Data Engineering). Rich transitioning out — weekly standups continue; this report keeps you in the loop.

Three milestones this week

W4 cleared the critical-path bottleneck

Two long-running blockers cleared this week — Rich's reconciliation review came back, and Adrien approved the first two of three staging PRs. Plus the structural verification of the CH ↔ US replication confirmed the pipeline is healthy across all 49 source tables (rows + columns + bytes identical). Net: canonical build (#175a) can start, blocked only on the third PR review.

Milestone 1 · field decisions

Rich locked canonical at 85 fields

Friday morning Rich returned the DC-direct verification XLS with his final decisions on the 17 ambiguous fields:

6 fields Drop (gone from canonical + staging)
1 field Keep in canonical (Contact_Restriction_Code · ET+EFSA scope)
2 fields Keep on Source only (per-BU staging, not promoted to unified): Departure_Year (HSEY), Is_Primary_Contact (CCAP+Academy)
1 field Use for CI: Is_Currently_Hosting (cross-BU CCAP+HSEY)
7 ghost fields Derive in DBT, keep on source only — Rich chose to derive them INTO the canonical from CCAP source (the only BU with the data) rather than deploy as DMO columns. No DC metadata changes needed; passthrough survivorship since CCAP is the only source.

Canonical model #175a active UI fields: 89 → 85 (77 implicit Keep + 1 explicit Keep + 7 ghosts derived from CCAP). Design-locked.

Milestone 2 · code review

Adrien approved 2 of 3 staging PRs

Friday morning Adrien approved both foundational PRs:

#3989 · sources refactor — splits monolithic sources.yml into per-product files inside models/staging/<product>/
#4027 · CCAP Phase C pilot — first real stg_ccap__contacts (878,903 rows, lead-source normalized, survivorship clock wired)

The 7-BU extension PR (#4059) is still pending Adrien's review — gentle follow-up sent. Likely waits until #4027 lands so review happens against the merged base.

Milestone 3 · pipeline audit

CH = US 100% across 49 tables

Comprehensive sweep across every registered source table for the dbt project — verified row count, column count, and storage bytes are identical between US consolidation (EF_DATA_HUB_RAW) and CH replica:

49 tables across 9 schemas (SF orgs, shared DBs, Juno Kafka, NeverBounce)
285 M+ rows total · all match exactly
Original Fivetran sources sit ~0.01-0.02% above CH (nightly task lag — expected)

Replication group is healthy. No structural drift, no missing tables.

Done this week

What landed in W4

1
Bootstrap PR eftours/de-dbt#3868 merged Monday (W4 D1). The vanilla-audit restructured project (8 → 4 vanilla schemas, UPPERCASE prefixed, sources outside models/) is now on main. Pass 3c DDL executed on HH82036 — dropped the 8 obsolete unprefixed schemas across DEV/QA/PROD (24 schemas total). The cross-repo project surface is now clean and Phase B work moves forward on top of the merged base.
Closes #143 + #155 + #158 Phase A · PR #3868 also brought Phase B seeds (16-cat lead_source_mappings · 82-row language_iso_mappings) and 8 normalize macros (lead_source, language_iso, gender, geo, age, contactable, etc.) live in PROD
2
Phase C pilot — stg_ccap__contacts ported end-to-end (W4 D3). First real per-BU staging model. UNION of Contact (host-family record type, non-au-pair, US-only) + Lead (US, non-deleted, not-yet-converted), LEFT JOIN to Account for is_primary_contact, full utm_source normalization (~20 mappings preserved from legacy), lead_source_normalized via seed JOIN, survivorship_ts macro wired. Materializes 878,903 rows in DEV (346,778 Contact + 532,125 Lead) with 81.2% seed coverage on lead_source_normalized.
PR eftours/de-dbt#4027 · CI green · Adrien-approved Friday
3
7 BU staging contacts extended (W4 D4) — same pattern applied to the other 7 BUs. Real ports for academy (PersonAccount + Contact-Guardian UNION, 2.77 M rows), hsey (Contact + Lead UNION, 1.59 M), language (Juno Kafka source, 33.4 M), student_tours (data_hub_share, 5.79 M), study_abroad (PLB share, 16.1 k), gap_year (PLB share, 132 k), world_journeys (multi-tenant via BUSINESS_CODE, 8.69 M). Each materializes ≤ 2.4 s in DEV. Row counts reconcile to filter logic to the exact row — verified independently via raw-source count queries.
PR eftours/de-dbt#4059 · Ready for Review since Thursday EOD · awaiting Adrien (was draft when he reviewed the others on Friday AM)
4
Sources refactor PR (#3989) Adrien-approved. Splits the monolithic sources/sources.yml (~181 lines, 9 sources in one file) into per-product _<product>__sources.yml files inside each models/staging/<product>/ folder, aligning with the dbt Labs 1.12 best practice Adrien flagged on May 13. Bonus fix on a missing Fivetran freshness config for salesforce_ccap. No new logic — just organization.
Approved Friday AM (post Friday morning re-nudge) · awaits merge
5
Rich's reconciliation review closed Friday morning — canonical at 78 fields. Rich returned the DC-direct verification XLS with his decisions on all 17 ambiguous fields: 6 Drop · 1 Keep · 2 Keep-on-Source-only · 1 Use-for-CI · 7 ghost fields Derive-in-DBT (rather than deploy as DMO columns). No DC metadata changes needed. Canonical model #175a active field count: 89 → 78. Design-locked.
Reply saved: docs/sf-analysis/unifiedindividual-fillrate-rich-final-decisions-2026-05-22.csv · Summary: docs/sf-analysis/unifiedindividual-rich-final-decisions-summary-2026-05-22.md
6
DC-direct verification + 49-table CH ↔ US audit. Earlier in the week (W4 D3-D4), did a full DC-direct verification of the 17 ambiguous fields against UnifiedIndividual__dlm in datahub_prod (15.09 M rows) — including the surprise finding that the 7 "ghost" fields don't actually exist on the DMO (PROPOSED in dataCloudMappings/ + recon rules but never deployed). Then verified the CH-side replica matches US consolidation 100% on row count, column count, AND storage bytes across all 49 source tables (285 M+ rows total). Original Fivetran sources sit ~0.01-0.02% above CH due to nightly task lag — expected.
PR #317 (Rich XLS) merged W4 D4 · RLT-3812 commented with the per-BU replication match table

Pipeline + bronze layer live

What's running every night + what's now built on top

Source tables on CH

across 9 schemas in EF_DATA_HUB_RAW · CH = US 100% match (rows + cols + bytes)

Staging contacts models · real

8 / 8

all 8 BUs ported · 52 M+ rows materialized in DEV (1 in PROD post #4027 merge, +7 post #4059)

Total dbt models in DAG

8 contacts (real) + ~49 placeholders for other entities · CI-validated

Nightly schedule (UTC)

Time	Step	Owner
`00:00`	Root task `COPY_ROOT` fires; 48 children copy from source DBs into `EF_DATA_HUB_RAW` on US	Snowflake task graph (autonomous)
`~00:30`	All 48 children complete · row counts logged	Snowflake
`01:00`	Replication group `EF_DATA_HUB_RAW_TO_CH_RG` refresh · pulls deltas to CH	Snowflake replication
`~01:30+`	Available on HH82036 · ready for dbt build	—

Schemas + sizes (CH side)

Schema	Tables	Rows	Size	Source
`data_hub_share`	7	45.6 M	8.07 GB	ET / Student Tours share
`language_kafka`	9	209.6 M	35.20 GB	EU Language (Juno + Poseidon Kafka)
`wojo`	3	11.5 M	0.65 GB	World Journeys share
`salesforce_ccap`	7	8.6 M	1.59 GB	CCAP Fivetran landing
`salesforce_academy`	8	4.6 M	0.50 GB	Academy Fivetran landing
`salesforce_hsey`	6	2.9 M	0.36 GB	HSEY Fivetran landing
`ef_us_uploads`	1	0.75 M	0.01 GB	NeverBounce upload (audit pending)
`higher_ed_gy`	4	0.16 M	0.02 GB	Higher Ed share — Gap Year
`higher_ed_sa`	4	0.18 M	0.02 GB	Higher Ed share — Study Abroad

EVALUATION_QUESTION_RESPONSES_V from W2 has been on CH for a week now — included in the 49 source count above.

Schema surface on HH82036 (post Pass 3c · obsolete schemas dropped W4 D3)

Database	Schema	Layer	Current content
`EF_DBT_DEV`	`EF_DATA_HUB_CH_STAGING`	bronze	8 real `stg_<bu>__contacts` models · 49 placeholders for other entities (opportunities, sales_orders, contact_point_*) · all CI-validated
	`EF_DATA_HUB_CH_INTERMEDIATE`	silver	Empty · canonical model (#175a · 85 UI fields locked) + Splink IR (#161) land here
	`EF_DATA_HUB_CH_CORE`	silver	Empty · ~37 CIs (#166) + segments (#142)
	`EF_DATA_HUB_CH_MARTS`	gold	Empty · Cortex agent feeds · cross-regional sharing · BI consumers (#224)
Same 4-schema set mirrored in `EF_DBT_QA` + `EF_DBT_PROD` (12 total). The 8 obsolete unprefixed schemas were dropped W4 D3.

Board state

Where each in-flight ticket landed

Closed this week

Ticket	Title	Why closed
#143	CH dbt home — Pass 3 final	Pass 3c executed W4 D3 — 8 obsolete schemas dropped across DEV/QA/PROD (24 total)
#155	Bootstrap single dbt project `ef_data_hub_ch`	PR #3868 merged W4 D1 via merge queue
#225	Pre-spec audit · reconciliation rules + IR + field normalization	Rich's final decisions on 17 ? fields received Friday morning · canonical design-locked at 85 UI fields
#158 Phase A	Staging placeholders for 8 products	Superseded by Phase C — 8 BU staging contacts ports landed real (in PRs #4027 + #4059)

Now ready for the build · pending PR merges only

Phase 1 · close-out · ready

#158 Phase C — 8 BU contacts ports + per-product sources refactor

PR #4027 (CCAP exemplar · 878 k rows) and PR #3989 (sources refactor) both Adrien-approved Friday morning. PR #4059 (the 7 other BUs · 52 M rows) ready for review since Thursday EOD; promotion from DRAFT happened after Adrien's first review pass, so the next pass will pick it up. Together they close P1 bronze.

Gates on: Adrien merging #3989 + #4027 (already approved) + reviewing/approving #4059

Phase 3 · build · canonical

#175a — canonical model (85 UI fields locked)

Field list locked at 85 UI fields after Rich's Friday decisions: 77 implicit Keep + 1 explicit Keep (Contact_Restriction_Code) + 7 ghosts derived from CCAP source INTO canonical (Rich's option (b) — derive in dbt rather than deploy DMO column). The 6 Drops, 2 Keep-on-Source-only, and 1 Use-for-CI are out. Spec refreshed today. Build commits can start against DEV stg views (no longer gated on the staging PRs for design — just for materialization in PROD).

Gates on: spec #175a refresh · staging PR merges for PROD ref()

Still pending or stakeholder-blocked

PR #4059 — Adrien review (was draft when he reviewed the other two on Friday AM; gentle re-nudge sent, expecting Monday)
#142 — Segments reframed: MC primary / Snowflake Plan B; Vivek call gates the methodology
#161 + #162 — Splink IR + parity QA (incorporate Amon's modified 10-rule design from the parallel CH workstream's SF PS session); next workstream after canonical build kickoff
#224 — Marts sublayer (Cortex / cross-regional / BI consumers); spec write pending after #158 Phase C fully merges
#136-#140 — RLT-* normalize tickets: tracked as follow-up enrichments on the per-BU staging models once #4059 merges (nationality maps, host family aggregation, contact_origin_bu split, etc.)
RLT-3991 — 5 HSEY closed-lost codes (CN/CY/EF/APP/INV → "Other"): Rich to confirm

Budget snapshot

$25 k NTE · spent vs remaining

NTE cap

$25,000

SOW Amendment #2 (29 Apr)

Spent to date

$13,640

95.0 hours · 54.6% of NTE

Remaining

$11,360

≈ 81 h at full-time rate ($140 / h)

W4 burn

18 h

$2,560 · P1 close-out · Phase C ports · CH↔US audit

NTE consumption

54.6% consumed end of W4. Trajectory still healthy: P0 + P1 done in 4 active weeks with ~45% of the cap reserved for P2 (Splink IR) through P6 (handoff). EF-engagement work only — the parallel ohanacloud-CH workstream (IR design with SF PS) is tracked separately and not billed against this NTE.

How this is calculated

Three rate tiers per SOW Amendment §4 (depending on day intensity):

Day intensity	Rate	Applied to
Full-time (> 4 h/day)	$140 / h	Most active days · ~83 h × $140 ≈ $11,620
Part-time (2.1–4 h/day)	$160 / h	3 days · ~10 h × $160 ≈ $1,600
Ad-hoc (≤ 2 h/day)	$180 / h	3 days · ~2-4 h × $180 = $360-720

Total: 95 h · $13,640. Detailed daily breakdown lives in sprint-log.md.

Phase status

Phase	SOW low (h)	Actual h	Actual $	Status
P0 · M1 Foundation & access (W1)	16	14.0	$2,040	Done
P1 · dbt #1 normalize + replication (W2-4)	25	78.0	$11,120	Closing · 312% h · 278% $
P2 · Splink IR (W5-6)	37	0	$0	Upcoming
P3 · Canonical + CIs (W5-6)	37	3.0	$480	Design locked · build pending PR merges
P4 · SFMC + Marts (W6-7)	56	0	$0	Upcoming
P5 · Cutover (W8)	14	0	$0	Upcoming
P6 · Handoff + Hypercare (W9 + Jul)	18	0	$0	Upcoming

P1 closed at 78 h vs 25-h SOW estimate (over-run absorbed). Phase scope materially expanded over W2-W4: #143 + #159 + #155 bootstrap + vanilla-audit restructure + CI debug + Pass 3 DDL + Phase B seeds + Phase C real ports for all 8 BUs + CH↔US 49-table audit. Cumulative NTE at 54.6% leaves ~$11.4 k for P2 → P6. Pacing P5/P6 conservatively; no escalation needed.

What's next

W5 (26 May – 1 Jun) · canonical build kickoff · IR design intake

P3a · canonical_individual model build

With the design locked at 78 fields and 8 BU staging contacts views materialized in DEV, the canonical model build is the natural next bite. canonical_individual sits in EF_DATA_HUB_CH_INTERMEDIATE and references the 8 per-BU staging views through the survivorship pattern. Splink IR (#161) work runs in parallel — the rule design from the parallel CH workstream's SF PS sessions can start integrating here.

Build deliverables · W5

What lands by Friday 29 May

Spec #175a refresh · 78-field canonical list locked with Rich's 4 decision categories documented
Canonical scaffold · canonical_individual.sql in models/intermediate/canonical/ reading from ref('stg_<bu>__contacts') × 8
Survivorship CTE pattern · 6-rule-type framework wired (RT1 source-priority → RT6 aggregate) classified across the 78 fields
3 BU contacts enrichments · biggest-impact TODOs landed: HSEY nationality→ISO2 + host-family aggregation · Academy nationality + guardian dedup · Student Tours contact_origin_bu split
3 Google_Analytics_* cols · added to stg_ccap__contacts (Rich's "derive in DBT" decision)
PR #4059 merged (assumes Adrien reviews early-week)

Stakeholder asks · W5

What we need from EF / ET

Adrien — review + merge PR #4059 (was draft when he checked the other two on Friday AM; re-nudge sent)
Rich — confirm 5 HSEY closed-lost codes for RLT-3991 (CN/CY/EF/APP/INV → "Other") to close that ticket
Vivek (MC) — still gating #142 segments methodology call
Mike / Adrian — NeverBounce source status confirmation (5+ months stale)
Mike — sanity check on first 4 weeks of replication credits to size the cost line

Looking past W5 — phase trajectory

Window	Phase	What gets built
W5-6 (26 May - 8 Jun)	P2 + P3a	Canonical model (78-field `canonical_individual` with 6-RT survivorship CTE pattern) · Splink IR pipeline kickoff (blocking, fuzzy, graph resolution) · staging BU enrichments (nationality maps, host-family aggregation, contact_origin_bu split)
W6-7 (3-16 Jun)	P3b + P4	~37 CIs (including new `ci__is_currently_hosting` from Rich's decision) · IR parity QA vs DC's UnifiedIndividual · marts pre-aggregated for Cortex agent · Segments (MC primary or Snowflake fallback) · SFMC engagement via S3
W7-8 (10-23 Jun)	P4 + P5	Marketing Cloud integration · CloudPage UI · cutover runbook · parallel-run reconciliation report · Prefect orchestration · DC schedule disable plan
W9 + Jul	P6	Handoff to Mike Grabbe + Adrian LeDoux · operational runbook · KT sessions · 2-cycle hypercare · formal sign-off checklist

Critical path now: PR #4059 merge → #175a canonical_individual → #161 Splink IR (parallel) → #166 CIs → #142 segments → cutover. The 3 open PRs are the last bronze-layer gate.

Reference material

Where to dig deeper

Engagement docs

Plan + commercial baseline

Migration plan deck + Architecture options + Structure proposal + Vanilla audit deck
SOW Amendment #2 (signed 29 Apr 2026) — NTE $25 k, T&M capped, Net 30
Sprint log — weekly hours and budget tracker

This week's deliverables

Audits + decisions captured

docs/sf-analysis/unifiedindividual-fillrate-rich-final-decisions-2026-05-22.csv — Rich's reply with the 17 ? field decisions
docs/sf-analysis/unifiedindividual-rich-final-decisions-summary-2026-05-22.md — per-category breakdown + downstream actions
docs/sf-analysis/unifiedindividual-fillrate-for-rich-2026-05-21-dc-verified.xlsx — the DC-direct verification XLS shared with Rich on Wed
Sprint log W4 days 1-5 with per-day breakdown

Specs landed / updated

Build-ready specs

specs/build/158-dbt1-normalize-staging-models.md — Phase A/B/C plan (now closing P1)
specs/build/175a-canonical-survivorship.md — canonical model (78-field design lock pending today's refresh)
specs/setup/143-ch-dbt2-home.md — CH dbt home (closed; Pass 3c done)
specs/setup/159-us-ch-replication-group.md — replication group US→CH (closed; 49 tables verified)

Cross-repo + tooling

Outside this repo

eftours/de-dbt#3868 — bootstrap restructure (MERGED W4 D1)
eftours/de-dbt#3989 — sources refactor (Adrien-approved · pending merge)
eftours/de-dbt#4027 — CCAP Phase C pilot real model (Adrien-approved · pending merge)
eftours/de-dbt#4059 — 7-BU staging contacts extension (Ready for Review · awaiting Adrien)
Repository CH = US 100% match audit · 49 source tables · 285M+ rows

EF Migration · Status

From placeholders to real per-BU staging models · canonical design locked