Data sources
One card per source. Add the card BEFORE writing the ingester — it forces clarity on schema and known issues. Update Status as it lands.
Card conventions
Section titled “Card conventions”- Cadence: how often the source publishes
- Auth: none / API key / scraping / manual
- Format: JSON / CSV / Excel / PDF / HTML
- Lag: typical delay between event and publication
- Bronze schema: shape we land in
bronze/ - Known issues: gotchas that broke things or might
- Status: Implemented / In Progress / Not Started / Tracker only
NASA POWER (hourly weather)
Section titled “NASA POWER (hourly weather)”- URL:
https://power.larc.nasa.gov/api/temporal/hourly/point - Cadence: continuous, ~1 day lag
- Auth: none
- Format: JSON
- Coverage: 1981-01-01 to ~yesterday
- Limits: 366 days per hourly request; recommended < 1 req/sec
- Variables:
ALLSKY_SFC_SW_DWN(GHI),ALLSKY_SFC_SW_DNI(DNI),ALLSKY_SFC_SW_DIFF(DHI),CLRSKY_SFC_SW_DWN,T2M,RH2M,PS,WS10M,WS50M - Spatial: underlying MERRA-2 is ~0.5°×0.625°. Querying finer is interpolation. We use a 0.25° JB grid (9 points: lat ∈ {1.25, 1.5, 1.75} × lon ∈ {103.5, 103.75, 104.0})
- Bronze schema (wide):
timestamp_utc, lat, lon, ghi, dni, dhi, ghi_clr, t2m, rh2m, ps, ws10m, ws50m, _ingested_at, _source_url, _source_hash - Partition:
yyyymm=YYYYMM/run_id=YYYYMMDDTHHMMSSZ/data.parquet - Known issues: requesting > 366 days returns 400; community=“RE” gives slightly different naming than community=“AG”; missing values come back as -999 (we coerce to null); temperature sometimes missing for very recent days (use ERA5 to fill)
- Status: Implemented
ERA5 (Copernicus CDS reanalysis)
Section titled “ERA5 (Copernicus CDS reanalysis)”- URL:
https://cds.climate.copernicus.eu/api/v2viacdsapi - Cadence: ~5 day lag
- Auth: free signup at CDS;
~/.cdsapircwith UID:KEY - Format: NetCDF (we’ll request NetCDF over GRIB)
- Coverage: 1940–present
- Resolution: 0.25° native
- Variables:
ssrd(surface solar rad downwards),2t,10u,10v,sp - Bronze schema: same wide shape as NASA POWER but suffix
_era5to disambiguate when joined - Known issues: queue can be hours; download is async; large windows return GB-scale files. Plan: split per-month, parallel ≤ 3 jobs.
- Status: Not Started
TNB ETOU (time-of-use schedule)
Section titled “TNB ETOU (time-of-use schedule)”- URL:
https://www.tnb.com.my/commercial-industrial/pricing-tariffs1 - Cadence: rare changes; latest revision documented as PDF
- Auth: none (manual download)
- Format: PDF
- Bronze schema:
effective_date, day_type, hour_start, hour_end, rate_sen_per_kwh, customer_class - Known issues: peak/off-peak hours differ by day type (weekday/Saturday/Sunday/PH); MD charges are separate from energy charges; tariff codes change over time
- Status: Not Started
TNB ICPT (imbalance cost surcharge)
Section titled “TNB ICPT (imbalance cost surcharge)”- URL: ST publishes the surcharge/rebate (
https://www.st.gov.my/en/general/electricity-tariff) - Cadence: every 6 months
- Auth: none (manual download)
- Format: PDF / press release
- Coverage: 2014–present
- Bronze schema:
period_start, period_end, customer_class, surcharge_sen_per_kwh - Known issues: structure changed in 2022 (subsidy eligibility split); industrial vs domestic split
- Status: Not Started
EMC USEP (Singapore wholesale price)
Section titled “EMC USEP (Singapore wholesale price)”- URL:
https://www.emcsg.com/marketdata/priceinformation - Cadence: 30-minute publication, near real-time
- Auth: none
- Format: CSV
- Coverage: 2003–present
- Bronze schema:
timestamp_sgt, period (1–48), usep_sgd_per_mwh, eep, lcp, ... - Why we need it: ENEGEM-side reference — the export-leg revenue cap.
- Known issues: SGT = MYT (no offset); period numbering 1–48 (half-hours); occasional intra-day revisions
- Status: Not Started
ENEGEM (Energy Exchange Malaysia)
Section titled “ENEGEM (Energy Exchange Malaysia)”- URL:
https://www.enegem.com.my(announcements only — no time series) - Cadence: irregular auction announcements
- Auth: none for announcements; participation requires registration
- Format: PDF / HTML announcements
- Coverage: 2024–present (new market)
- Bronze schema:
announcement_date, auction_window_start, auction_window_end, capacity_mw, counterparty, status - Known issues: no public clearing price feed. We can only build a structural model, not a backtest. Manually curated yaml/parquet log of announcements.
- Status: Tracker only (planned)
SEDA NEM (net energy metering)
Section titled “SEDA NEM (net energy metering)”- URL:
https://www.seda.gov.my/reportal/nem/ - Cadence: quota windows announced periodically
- Auth: none
- Format: HTML tables / PDFs
- Coverage: NEM 1.0 / 2.0 / 3.0 historical + current
- Bronze schema:
window_id, scheme_version, state, customer_class, quota_mw, allocated_mw, applicants - Why we need it: BTM PV penetration ceiling for Johor
- Status: Not Started
DOSM / OpenDOSM (api.data.gov.my)
Section titled “DOSM / OpenDOSM (api.data.gov.my)”- URL:
https://api.data.gov.my/data-catalogue?id=<dataset_id> - Auth: none. Follows redirects (HTTP 301 → trailing-slash variant).
- Format: JSON list
- Datasets pulled:
electricity_consumption— national monthly by sector (2018-01 to 2024-06, 78 months × 6 sectors = 468 rows). Sectors:total, local, local_commercial, local_domestic, exports, losses. No state breakdown — peninsular national only.electricity_supply— same shape as consumption, supply side.gdp_state_real_supply— annual GDP by state × sector × series (2015–2023, 16 states × 7 sectors × 2 series). Used for Johor allocation share. Empirical 2023: Johor = 11.1% of peninsular GDP.population_state— annual by state × age × sex × ethnicity (large; filter tooverall_*in silver).
- Bronze schema: typed columns mirror API JSON;
dateparsed to Date; value cols cast to Float64; standard_ingested_at, _source_url, _source_hashprovenance. - Partition:
bronze/macro/dosm/<dataset>/run_id=YYYYMMDDTHHMMSSZ/data.parquet(no time partition — small enough for a single parquet per run) - Known issues: NO state-level electricity dataset exists in DOSM’s catalogue (verified 2026-05-07). For Johor electricity we either (a) compute national × GDP share, or (b) parse ST annual report PDFs (Stage 2). Industrial vs commercial is not separated in
electricity_consumption— both are insidelocal_commercial. - Status: Implemented (4 datasets)
ST Statistics on Energy Sector
Section titled “ST Statistics on Energy Sector”- URL:
https://www.st.gov.my/sites/default/files/2026-02/Malaysia_Energy_Statistics_Handbook_2023.pdf(8 MB, 104 pages) - Cadence: annual
- Auth: none (public PDF)
- Format: PDF, English; pdfplumber works for table extraction
- Coverage in 2023 edition: 2015–2021 historical
- Stored at:
data/bronze/macro/st-pdf/MESH_2023.pdf - Extracted to:
reference/electricity/peninsular_annual.yaml(annual peninsular GWh, 2015–2021) - What’s NOT in it (verified by full-PDF scan, 2026-05-07): state-level electricity (Johor or any other state) is not published. MESH only goes to Peninsular / Sabah / Sarawak granularity. SAIDI/SAIFI/CAIDI ARE published by state, but consumption is not.
- Implication: Johor electricity must be modelled (see
reference/electricity/johor_share.yaml). The PSI report (Performance-and-Statistical-Information-{YEAR}.pdf) is only publicly accessible up to 2014 and was not pursued. - Status: PDF stored, manual transcription of peninsular data done. Full table-parser not built (single annual update doesn’t justify the effort).
Ember (Malaysia monthly generation)
Section titled “Ember (Malaysia monthly generation)”- URL:
https://ember-energy.org(monthly electricity data API) - Cadence: monthly, ~2 week lag
- Auth: none
- Format: CSV
- Bronze schema:
month, country=MY, generation_source, gwh - Status: Not Started
Senai METAR (WMKJ)
Section titled “Senai METAR (WMKJ)”- URL:
https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py - Cadence: hourly observations + irregular SPECI for weather changes (~24–28 rows/day typical)
- Auth: none
- Format: CSV (
format=onlycomma) - Variables:
tmpc, dwpc, relh, sknt, drct, vsby, gust, alti, mslp, p01i - Bronze schema:
timestamp_utc, station, tmpc, dwpc, relh, sknt, drct, vsby, gust, alti, mslp, p01i, _ingested_at, _source_url, _source_hash - Partition:
yyyymm=YYYYMM/run_id=YYYYMMDDTHHMMSSZ/data.parquet - Why we need it: ground truth to validate NASA POWER + ERA5 over JB. Empirical: POWER is ~1.4 ± 1.0 °C warmer than METAR at WMKJ (Jan 2024 sample) — within typical reanalysis-vs-observation bias for coastal tropics.
- Known issues:
day2/month2/year2is exclusive (the API treats it as the end-of-window midnight). The ingester adds 1 day toendbefore sending so our inclusiveWindowcontract holds. Missing values come back asnull(or sometimesM) — both are coerced to null.vsby=9.99means “10+ miles unlimited” — bronze keeps as-is; silver should clip. - Status: Implemented
Adding a new source
Section titled “Adding a new source”- Add the card here first (PR-able artifact even before code)
- Add an ingester module under
src/jb_vpp/ingest/<source>.pyexposingrun(...) - Wire a CLI subcommand in
src/jb_vpp/cli.py - Add unit tests under
tests/test_<source>.pywithrespxfor HTTP mocking - Update Status here when green