kacho.io
Writing
7 min read

I recorded Polymarket's 5-minute crypto markets for two months. Here's the dataset.

A free, open dataset: nearly 89,000 of Polymarket's 5-minute crypto up/down markets (BTC, ETH, SOL, XRP, DOGE, HYPE, BNB), captured second by second — about 26.8 million top-of-book order-book observations from March to May 2026. I collected it to backtest a bot. You can have it.

  • polymarket
  • dataset
  • crypto
  • build-in-public

You're most likely familiar with Polymarket's 5-minute crypto markets. If not, here's the gist — Polymarket runs a market on whether Bitcoin will be higher or lower five minutes from now. Then another. Then another. 24/7, one every five minutes, for seven different coins. As far as I know, there's no freely available history of those markets anywhere. Polymarket will hand you the live order book, but the moment a window closes it's gone — and you can't backtest a bot on data that doesn't exist.

So I recorded my own. For about seven weeks I captured the order book of every one of these markets, once per second, for BTC, ETH, SOL, XRP, DOGE, HYPE and BNB. This post is me giving that away — the data, the schema, how it was collected, and exactly where it's thin. It's free, and you can do whatever you like with it.

Fair warning: most of this post from here on is AI-generated — written from a description and analysis of the data I fed it. It's still an accurate, thorough account of what's in the dataset. And one caveat on the data itself: it's only two months, and not as granular as you'd probably need to build a genuinely competitive bot for these markets. But it's something to start with, and it's yours — no strings attached.

My two cents on the data: it's good enough to backtest any bot you build, but don't expect the live results to match. I evaluated my own BTC bot and it showed a respectable 3–5% ROI after fees — and then running it live cost me roughly $600. I'd made some mistakes early that I later fixed, but even then the fees quietly ate whatever ROI I was getting close to. That's a longer story, and a full write-up for another day.

Btw, if you can't be bothered with the whole write-up and you just want the data, you can jump straight to it.

The headline numbers

  • 7 coins — BTC, ETH, SOL, XRP, DOGE, HYPE, BNB.
  • ~89,000 markets — each a single 5-minute up/down window that opened, traded and resolved.
  • ~26.8 million per-second observations — every market sampled once a second for its full five-minute life (≈300 ticks each).
  • Span: BTC from 24 Mar 2026, the other six from 5 Apr 2026, all running to 18 May 2026 — all timestamps UTC.
  • Coverage 99.8%+ with no duplicates. It's a fixed historical window, not a live feed.

What these markets actually are

Skip this part if you already know the 5/15/60-minute BTC (or ETH, SOL, XRP, BNB, DOGE, HYPE) markets.

Each market asks one question: will this coin's price be up or down at the end of a fixed 5-minute window? Two outcomes, "Up" and "Down", each trading as its own token priced between 0 and 1 in USDC. That price is the market's implied probability — an Up token at 0.62 means the market thinks there's a 62% chance the coin closes the window higher.

Because Up and Down are two separate order books with their own spreads, the two best bids won't add up to exactly 1. That gap is the spread, and any persistent drift away from 1 is itself worth looking at.

Data dictionary

The data comes as two tables per coin, joined on condition_id: a markets table (one row per resolved 5-minute market) and a ticks table (one row per second — the order book). So btc_markets.parquet has ~15,700 rows; btc_ticks.parquet has ~4.7 million.

markets — one row per 5-minute market

ColumnTypeUnitMeaning
condition_idtextPolymarket's on-chain condition ID. Unique — the join key to ticks.
event_idtextPolymarket Gamma event ID.
slugtextHuman-readable market slug, e.g. btc-updown-5m-1774745100.
market_starttimestamptzUTCWhen the 5-minute window opens. Always aligned to a 5-min boundary (:00, :05, :10…).
market_endtimestamptzUTCWhen the market resolves — market_start + 5 min.
recorded_attimestamptzUTCWhen the row was written, just after market_end.
token_uptextERC-1155 token ID for the "Up" outcome.
token_downtextERC-1155 token ID for the "Down" outcome.
volumenumericUSDCMarket volume from the Gamma API at discovery time.
liquiditynumericUSDCMarket liquidity from the Gamma API at discovery time.
outcometext'Up', 'Down', or NULL. Inferred from the final tick (winning side's bid → ~0.99), not read from on-chain resolution.
n_ticksintNumber of per-second rows for this market in ticks (≈300).

ticks — one row per second

Each row is one 1-second sample of the live book, joined to markets by condition_id. Up and Down are separate books, so every tick captures both sides.

ColumnTypeUnitMeaning
condition_idtextJoin key back to markets.
tbigintsecondsSample time, unix epoch seconds (UTC).
ts_utctimestamptzUTCSame instant as t, as an ISO timestamp.
bu / aunumericUSDC (0–1)Best bid / ask, Up token.
bd / adnumericUSDC (0–1)Best bid / ask, Down token.
su / sdnumericsharesSize resting at best bid, Up / Down. NaN if that side was empty.
sau / sadnumericsharesSize resting at best ask, Up / Down. NaN if that side was empty.
du / ddnumericUSDCDepth — Σ(size × price) for all bids within of best bid, Up / Down.

A natural point estimate for the implied probability of "Up" is the mid, (bu + au) / 2. Within a market the t values run from ≈market_start to ≈market_start + 299s — a clean 300-second sweep, one tick per second. The only NaNs in the whole dataset are in the resting-size columns, where a side of the book happened to be empty.

Coverage and gaps

I'd rather you know where this is thin before you build on it.

CoinMarketsSpan (UTC)5-min coverageMissing windows
BTC15,68224 Mar → 18 May99.88%19
ETH12,25805 Apr → 18 May99.84%20
SOL12,25905 Apr → 18 May99.85%19
XRP12,25805 Apr → 18 May99.84%20
DOGE12,25905 Apr → 18 May99.85%19
HYPE12,25805 Apr → 18 May99.84%20
BNB12,25905 Apr → 18 May99.85%19

The missing windows aren't random per-coin loss — they line up on the same wall-clock times across all seven coins, which means they're brief outages of my collector, not anything wrong with a specific market. The entire gap inventory over 7.5 weeks:

  • 18 Apr, 10:35 → 11:55 UTC — ~15 windows (~1.2h), the big one.
  • 15 Apr, 12:00 → 12:15 UTC — 2 windows.
  • 16 Apr, ~22:50 and ~23:30 UTC — 1–2 windows each.

Inside the markets that are present, the per-second data is excellent: ~99.97% of markets carry the full ~300 ticks (the exact count is in each market's n_ticks), and no market has internal gaps — its seconds are contiguous start to finish. A handful of short markets exist per coin (the odd ~230- or ~293-tick one, again from the same shared hiccups). No empty markets anywhere, and condition_id is unique, so every market appears exactly once — no duplicates.

Known limitations

  • It's a fixed window, not live. Collection ran to 18 May 2026 and stops there. Treat it as history.
  • BTC has ~12 extra days of history than the other six — it took me a while to realise the other coins were worth recording too.
  • outcome is inferred, not read from chain. It comes from the final tick's bid and can be NULL near edge cases — treat it as best-effort.
  • volume / liquidity are point-in-time values from discovery, not end-of-window figures.
  • Ask-side depth isn't recorded — only best-ask price and size. The 5¢ depth aggregate (du/dd) is bid-side only.
  • bu + bd ≠ 1 in general — two independent books, two spreads.
  • Sampling is best-effort 1 Hz from a cache; a tick is written only when at least one side had book data, which is why a few markets fall short of 300 ticks and why the resting-size columns are NaN where a side was empty.

How it was collected

A custom recorder subscribes to Polymarket's public CLOB WebSocket order-book feed and keeps an in-memory book for every active market's Up and Down tokens. It discovers new markets via the Gamma API every 30 seconds. Once per second it reads the cached top-of-book for each active market and appends a sample — zero network calls per sample, just a read of the live cache — and when a market's window closes, all of its per-second ticks are written out. Everything is UTC.

Get the data

Each coin ships as two files — <coin>_markets.parquet and <coin>_ticks.parquet — joined on condition_id. Parquet is the primary download (the whole set is ~725MB). Quick start in pandas:

import pandas as pd
 
markets = pd.read_parquet("btc_markets.parquet")   # one row per market
ticks   = pd.read_parquet("btc_ticks.parquet")     # one row per second
 
ticks["mid_up"] = (ticks["bu"] + ticks["au"]) / 2  # implied P(Up)
 
# the per-second history of a single market
first = markets.iloc[0]["condition_id"]
one = ticks[ticks["condition_id"] == first].sort_values("t")
print(one[["ts_utc", "bu", "au", "mid_up", "du"]].head())

Licence and terms

Released under CC0 1.0 — public domain. No attribution required, no strings. Use it for anything. (A link back to my website or a share on social media is appreciated, never required.)

This dataset is derived from public market data on Polymarket — specifically the order books of its 5-minute crypto up/down markets — captured via Polymarket's public CLOB WebSocket and Gamma API. It's an independent, transformed recording (per-second top-of-book aggregates), not affiliated with or endorsed by Polymarket, and provided "as is" for research and educational use, with no warranty.


If you do something interesting with it, I'd genuinely like to see it. And if you want the next drop — I'm restarting the recorder, and the bot that this was all for gets its own write-up — subscribe and I'll send it over.