Skip to content

Data sources

One card per source. Add the card BEFORE writing the ingester — it forces clarity on schema and known issues. Update Status as it lands.

  • Cadence: how often the source publishes
  • Auth: none / API key / scraping / manual
  • Format: JSON / CSV / Excel / PDF / HTML
  • Lag: typical delay between event and publication
  • Bronze schema: shape we land in bronze/
  • Known issues: gotchas that broke things or might
  • Status: Implemented / In Progress / Not Started / Tracker only

  • URL: https://power.larc.nasa.gov/api/temporal/hourly/point
  • Cadence: continuous, ~1 day lag
  • Auth: none
  • Format: JSON
  • Coverage: 1981-01-01 to ~yesterday
  • Limits: 366 days per hourly request; recommended < 1 req/sec
  • Variables: ALLSKY_SFC_SW_DWN (GHI), ALLSKY_SFC_SW_DNI (DNI), ALLSKY_SFC_SW_DIFF (DHI), CLRSKY_SFC_SW_DWN, T2M, RH2M, PS, WS10M, WS50M
  • Spatial: underlying MERRA-2 is ~0.5°×0.625°. Querying finer is interpolation. We use a 0.25° JB grid (9 points: lat ∈ {1.25, 1.5, 1.75} × lon ∈ {103.5, 103.75, 104.0})
  • Bronze schema (wide): timestamp_utc, lat, lon, ghi, dni, dhi, ghi_clr, t2m, rh2m, ps, ws10m, ws50m, _ingested_at, _source_url, _source_hash
  • Partition: yyyymm=YYYYMM/run_id=YYYYMMDDTHHMMSSZ/data.parquet
  • Known issues: requesting > 366 days returns 400; community=“RE” gives slightly different naming than community=“AG”; missing values come back as -999 (we coerce to null); temperature sometimes missing for very recent days (use ERA5 to fill)
  • Status: Implemented
  • URL: https://cds.climate.copernicus.eu/api/v2 via cdsapi
  • Cadence: ~5 day lag
  • Auth: free signup at CDS; ~/.cdsapirc with UID:KEY
  • Format: NetCDF (we’ll request NetCDF over GRIB)
  • Coverage: 1940–present
  • Resolution: 0.25° native
  • Variables: ssrd (surface solar rad downwards), 2t, 10u, 10v, sp
  • Bronze schema: same wide shape as NASA POWER but suffix _era5 to disambiguate when joined
  • Known issues: queue can be hours; download is async; large windows return GB-scale files. Plan: split per-month, parallel ≤ 3 jobs.
  • Status: Not Started
  • URL: https://www.tnb.com.my/commercial-industrial/pricing-tariffs1
  • Cadence: rare changes; latest revision documented as PDF
  • Auth: none (manual download)
  • Format: PDF
  • Bronze schema: effective_date, day_type, hour_start, hour_end, rate_sen_per_kwh, customer_class
  • Known issues: peak/off-peak hours differ by day type (weekday/Saturday/Sunday/PH); MD charges are separate from energy charges; tariff codes change over time
  • Status: Not Started
  • URL: ST publishes the surcharge/rebate (https://www.st.gov.my/en/general/electricity-tariff)
  • Cadence: every 6 months
  • Auth: none (manual download)
  • Format: PDF / press release
  • Coverage: 2014–present
  • Bronze schema: period_start, period_end, customer_class, surcharge_sen_per_kwh
  • Known issues: structure changed in 2022 (subsidy eligibility split); industrial vs domestic split
  • Status: Not Started
  • URL: https://www.emcsg.com/marketdata/priceinformation
  • Cadence: 30-minute publication, near real-time
  • Auth: none
  • Format: CSV
  • Coverage: 2003–present
  • Bronze schema: timestamp_sgt, period (1–48), usep_sgd_per_mwh, eep, lcp, ...
  • Why we need it: ENEGEM-side reference — the export-leg revenue cap.
  • Known issues: SGT = MYT (no offset); period numbering 1–48 (half-hours); occasional intra-day revisions
  • Status: Not Started
  • URL: https://www.enegem.com.my (announcements only — no time series)
  • Cadence: irregular auction announcements
  • Auth: none for announcements; participation requires registration
  • Format: PDF / HTML announcements
  • Coverage: 2024–present (new market)
  • Bronze schema: announcement_date, auction_window_start, auction_window_end, capacity_mw, counterparty, status
  • Known issues: no public clearing price feed. We can only build a structural model, not a backtest. Manually curated yaml/parquet log of announcements.
  • Status: Tracker only (planned)
  • URL: https://www.seda.gov.my/reportal/nem/
  • Cadence: quota windows announced periodically
  • Auth: none
  • Format: HTML tables / PDFs
  • Coverage: NEM 1.0 / 2.0 / 3.0 historical + current
  • Bronze schema: window_id, scheme_version, state, customer_class, quota_mw, allocated_mw, applicants
  • Why we need it: BTM PV penetration ceiling for Johor
  • Status: Not Started
  • URL: https://api.data.gov.my/data-catalogue?id=<dataset_id>
  • Auth: none. Follows redirects (HTTP 301 → trailing-slash variant).
  • Format: JSON list
  • Datasets pulled:
    • electricity_consumption — national monthly by sector (2018-01 to 2024-06, 78 months × 6 sectors = 468 rows). Sectors: total, local, local_commercial, local_domestic, exports, losses. No state breakdown — peninsular national only.
    • electricity_supply — same shape as consumption, supply side.
    • gdp_state_real_supply — annual GDP by state × sector × series (2015–2023, 16 states × 7 sectors × 2 series). Used for Johor allocation share. Empirical 2023: Johor = 11.1% of peninsular GDP.
    • population_state — annual by state × age × sex × ethnicity (large; filter to overall_* in silver).
  • Bronze schema: typed columns mirror API JSON; date parsed to Date; value cols cast to Float64; standard _ingested_at, _source_url, _source_hash provenance.
  • Partition: bronze/macro/dosm/<dataset>/run_id=YYYYMMDDTHHMMSSZ/data.parquet (no time partition — small enough for a single parquet per run)
  • Known issues: NO state-level electricity dataset exists in DOSM’s catalogue (verified 2026-05-07). For Johor electricity we either (a) compute national × GDP share, or (b) parse ST annual report PDFs (Stage 2). Industrial vs commercial is not separated in electricity_consumption — both are inside local_commercial.
  • Status: Implemented (4 datasets)
  • URL: https://www.st.gov.my/sites/default/files/2026-02/Malaysia_Energy_Statistics_Handbook_2023.pdf (8 MB, 104 pages)
  • Cadence: annual
  • Auth: none (public PDF)
  • Format: PDF, English; pdfplumber works for table extraction
  • Coverage in 2023 edition: 2015–2021 historical
  • Stored at: data/bronze/macro/st-pdf/MESH_2023.pdf
  • Extracted to: reference/electricity/peninsular_annual.yaml (annual peninsular GWh, 2015–2021)
  • What’s NOT in it (verified by full-PDF scan, 2026-05-07): state-level electricity (Johor or any other state) is not published. MESH only goes to Peninsular / Sabah / Sarawak granularity. SAIDI/SAIFI/CAIDI ARE published by state, but consumption is not.
  • Implication: Johor electricity must be modelled (see reference/electricity/johor_share.yaml). The PSI report (Performance-and-Statistical-Information-{YEAR}.pdf) is only publicly accessible up to 2014 and was not pursued.
  • Status: PDF stored, manual transcription of peninsular data done. Full table-parser not built (single annual update doesn’t justify the effort).
  • URL: https://ember-energy.org (monthly electricity data API)
  • Cadence: monthly, ~2 week lag
  • Auth: none
  • Format: CSV
  • Bronze schema: month, country=MY, generation_source, gwh
  • Status: Not Started
  • URL: https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py
  • Cadence: hourly observations + irregular SPECI for weather changes (~24–28 rows/day typical)
  • Auth: none
  • Format: CSV (format=onlycomma)
  • Variables: tmpc, dwpc, relh, sknt, drct, vsby, gust, alti, mslp, p01i
  • Bronze schema: timestamp_utc, station, tmpc, dwpc, relh, sknt, drct, vsby, gust, alti, mslp, p01i, _ingested_at, _source_url, _source_hash
  • Partition: yyyymm=YYYYMM/run_id=YYYYMMDDTHHMMSSZ/data.parquet
  • Why we need it: ground truth to validate NASA POWER + ERA5 over JB. Empirical: POWER is ~1.4 ± 1.0 °C warmer than METAR at WMKJ (Jan 2024 sample) — within typical reanalysis-vs-observation bias for coastal tropics.
  • Known issues: day2/month2/year2 is exclusive (the API treats it as the end-of-window midnight). The ingester adds 1 day to end before sending so our inclusive Window contract holds. Missing values come back as null (or sometimes M) — both are coerced to null. vsby=9.99 means “10+ miles unlimited” — bronze keeps as-is; silver should clip.
  • Status: Implemented

  1. Add the card here first (PR-able artifact even before code)
  2. Add an ingester module under src/jb_vpp/ingest/<source>.py exposing run(...)
  3. Wire a CLI subcommand in src/jb_vpp/cli.py
  4. Add unit tests under tests/test_<source>.py with respx for HTTP mocking
  5. Update Status here when green