hckrnws
Show HN: PgDog – Scale Postgres without changing the app
by levkk
Hey HN! Lev and Justin here, authors of PgDog (https://pgdog.dev/), a connection pooler, load balancer and database sharder for PostgreSQL. If you build apps with a lot of traffic, you know the first thing to break is the database. We are solving this with a network proxy that works without requiring application code changes or database migrations.
Our post from last year: https://news.ycombinator.com/item?id=44099187
The most important update: we are in production. Sharding is used a lot, with direct-to-shard queries (one shard per query) working pretty much all the time. Cross-shard (or multi-database) queries are still a work in progress, but we are making headway.
Aggregate functions like count(), min(), max(), avg(), stddev() and variance() are working, without refactoring the app. PgDog calculates the aggregate in-transit, while transparently rewriting queries to fetch any missing info. For example, multi-database average calculation requires a total count of rows to calculate the original sum. PgDog will add count() to the query, if it’s not there already, and remove it from the rows sent to the app.
Sorting and grouping works, including DISTINCT, if the columns(s) are referenced in the result. Over 10 data types are supported, like, timestamp(tz), all integers, varchar, etc.
Cross-shard writes, including schema changes (CREATE/DROP/ALTER), are now atomic and synchronized between all shards with two-phase commit. PgDog keeps track of the transaction state internally and will rollback the transaction if the first phase fails. You don’t need to monkeypatch your ORM to use this: PgDog will intercept the COMMIT statement and execute PREPARE TRANSACTION and COMMIT PREPARED instead.
Omnisharded tables, a.k.a replicated or mirrored (identical on all shards), support atomic reads and writes. That’s important because most databases can’t be completely sharded and will have some common data on all databases that has to be kept in-sync.
Multi-tuple inserts, e.g., INSERT INTO table_x VALUES ($1, $2), ($3, $4), are split by our query rewriter and distributed to their respective shards automatically. They are used by ORMs like Prisma, Sequelize, and others, so those now work without code changes too.
Sharding keys can be mutated. PgDog will intercept and rewrite the update statement into 3 queries, SELECT, INSERT, and DELETE, moving the row between shards. If you’re using Citus (for everyone else, Citus is a Postgres extension for sharding databases), this might be worth a look.
If you’re like us and prefer integers to UUIDs for your primary keys, we built a cross-shard unique sequence, directly inside PgDog. It uses the system clock (and a couple other inputs), can be called like a Postgres function, and will automatically inject values into queries, so ORMs like ActiveRecord will continue to work out of the box. It’s monotonically increasing, just like a real Postgres sequence, and can generate up to 4 million numbers per second with a range of 69.73 years, so no need to migrate to UUIDv7 just yet.
INSERT INTO my_table (id, created_at) VALUES (pgdog.unique_id(), now());
Resharding is now built-in. We can move gigabytes of tables per second, by parallelizing logical replication streams across replicas. This is really cool! Last time we tried this at Instacart, it took over two weeks to move 10 TB between two machines. Now, we can do this in just a few hours, in big part thanks to the work of the core team that added support for logical replication slots to streaming replicas in Postgres 16.Sharding hardly works without a good load balancer. PgDog can monitor replicas and move write traffic to a promoted primary during a failover. This works with managed Postgres, like RDS (incl. Aurora), Azure Pg, GCP Cloud SQL, etc., because it just polls each instance with “SELECT pg_is_in_recovery()”. Primary election is not supported yet, so if you’re self-hosting with Patroni, you should keep it around for now, but you don’t need to run HAProxy in front of the DBs anymore.
The load balancer is getting pretty smart and can handle edge cases like SELECT FOR UPDATE and CTEs with INSERT/UPDATE statements, but if you still prefer to handle your read/write separation in code, you can do that too with manual routing. This works by giving PgDog a hint at runtime: a connection parameter (-c pgdog.role=primary), SET statement, or a query comment. If you have multiple connection pools in your app, you can replace them with just one connection to PgDog instead. For multi-threaded Python/Ruby/Go apps, this helps by reducing memory usage, I/O and context switching overhead.
Speaking of connection pooling, PgDog can automatically rollback unfinished transactions and drain and re-sync partially sent queries, all in an effort to preserve connections to the database. If you’ve seen Postgres go to 100% CPU because of a connection storm caused by an application crash, this might be for you. Draining connections works by receiving and discarding rows from abandoned queries and sending the Sync message via the Postgres wire protocol, which clears the query context and returns the connection to a normal state.
PgDog is open source and welcomes contributions and feedback in any form. As always, all features are configurable and can be turned off/on, so should you choose to give it a try, you can do so at your own pace. Our docs (https://docs.pgdog.dev) should help too.
Thanks for reading and happy hacking!
As someone who has worked on many-TB-sized "custom" sharded systems with 30-150 shards at multiple (ok, 2) employers, a key challenge to the overall sharding landscape is unsharding all the data back at the analytics layer.
This at a minimum often involved adding back a shard key to the physical data, or partitioning, and/or physical data sorting easily in the "OLAP" layer. And a surprising number of CDC and ETL toolkits don't make it easy to parameterize a single code/configuration base, nor handle situations like shards being down at different times for maintenance or fetching data from each shard at a time of day specified by its end-of-day or handling retransmissions or reconciliation or gaps or data quality of a single shard when back in an unsharded landscape. SQL UNION ALL to reunite shards works, until it doesn't.
YMMV but would be curious if you have a story/solution/thoughts along these lines. It's easier if you shard with unified analytics/reporting in mind on day one of a sharded system design, but in the worlds I've lived in, nobody ever does. But maybe you could.
A couple options come to mind:
1. Replicate shards into one beefy database and use that. Replication is cheaper than individual statements, so this can work for a while. The sink can be Postgres or another database like Clickhouse. At Instacart, we used Snowflake, with an in-house CDC pipeline. It worked well, but Snowflake was only usable for offline analytics, like BI / batch ML, and quite expensive. We'll add support for this eventually; we're getting pretty good at managing logical replication, including DDL changes.
2. Use the shards themselves and build a decent query engine on top. This is the Citus way and we know it's possible. Some queries could be expensive, but that's expected and can be solved with more compute.
In our architecture, shards going down for maintenance is an incident-level event, so we expect those to be up at all times, and failover to a standby if there is an issue. These days, most maintenance tasks can be done online in-place, or with blue/green, which we'll support as well. Zero downtime is the name of the game.
Great progress, guys! It’s impressive to see all the enhancements - more types, more aggregate functions, cross-node DML, resharding, and reliability-focused connection pooling and more. Very cool! These were really hard problems and took multiple years to build at Citus. Kudos to the shipping velocity.
Happy pgdog user here, I can recommend it from a user perspective as a connection pooler to anyone checking this out (we're also running tests and positive about sharding, but haven't run it in prod yet, so I can't 100% vouch for it on that, but that's where we're headed.)
@Lev, how is the 2pc coming along? I think it was pretty new when I last checked, and I haven't looked into it much since then. Is it feeling pretty solid now?
It feels better now, but we still need to add crash protection - in case PgDog itself crashes, we need to restore in-progress 2pc transaction records from a durable medium. We will add this very soon.
Stupid question but does this shard the database as well or do we shard manually and then setup the configuration accordingly ?
It shards it as well. We handle schema sync, moving table data (in parallel), setting up logical replication, and application traffic cutover. The zero-downtime resharding is currently WIP, working on the PR as we speak: https://github.com/pgdogdev/pgdog/pull/784.
Incredible. I am really interested in trying pgdog for our B2B SAAS product. Will do some testing.
Interesting, I'm a bit new to this... could you explain how PgDog works high-level? It's both a connection pooler and a sharding engine? Is there a kind of tutorial to "play" with those features and get a sense of how they work/what complexity they add?
Some HTTP proxies can do retries -- if a connection to one backend fails, it is retried on a different backend. Can PgDog (or PgBouncer, or any other tool) do something similar -- if there's a "database server shutting down" error or a connection reset, retry it on another backend?
Not currently, but we can add this. One thing we have to be careful of is to not retry requests that are executing inside transactions, but otherwise this would be a great feature.
> If you build apps with a lot of traffic, you know the first thing to break is the database.
Just out of curiosity, what kinds of high-traffic apps have been most interested in using PgDog? I see you guys have Coinbase and Ramp logos on your homepage -- seems like fintech is a fit?
We have all kinds, it's not specific to any particular sector. That's kind of the beauty for building for Postgres - everyone uses it in some capacity!
My general advice is, once you see more than 100 connections on your database, you should consider adding a connection pooler. If your primary load exceeds 30% (CPU util), consider adding read replicas. This also applies if you want some kind of workload isolation between databases, e.g. slow/expensive analytics queries can be pushed to a replica. Vertically scaling primaries is also a fine choice, just keep that vertical limit in mind.
Once you're a couple instance types away from the largest machine your cloud provider has, start thinking about sharding.
> If your primary load exceeds 30% (CPU util), consider adding read replicas.
I'm not an expert, but isn't this excessive? In theory you could triple the load and still have slack. I'd actually try to scale down, not up.
If most of your users are concentrated in the same (or nearby) time zones, your traffic can easily vary by 5–10x over a 24-hour period. In that case, 30% average CPU utilization doesn't mean you have 70% headroom at peak... it may already imply you're close to saturation during busy hours.
For example, if 30% is your daily average and your peak-to-average ratio is ~5x, you're effectively hitting 150% of capacity at peak. Obviously the system can't sustain that, so you'll see queueing, latency spikes, or throttling.
The 30% guideline makes sense if you care about strict SLAs and predictable latency under peak load. If you're more tolerant of temporary slowdowns, you could probably run closer to 60–70% average utilization, but you're explicitly trading off peak performance and tail latency to do so.
Load is highly bursty. You can autoscale application services quickly, but scaling your database up is a slower thing.
Comment was deleted :(
I see the word 'replication' mentioned quite a few times. Is this managed by pgdog? Would I be able to replace other logical replication setups with pgdog to create a High Availability cluster?
Do you have any write up on how to do this?
I'll need a bit more info about your use case to answer. We use logical replication to move data between shards, with the intention of creating new shards.
This is managed by PgDog. We are building a lot of tooling here, and a lot of it is configurable and can be used separately. For example, we have a CLI and admin database commands to setup replication streams between databases, irrespective of their sharded status, so it can be used for other purposes as well, like moving tables or entire databases to new hardware. If you keep the stream(s) running, you can effectively keep up-to-date logical replicas.
We don't currently manage DDL replication (CREATE/ALTER/DROP) for logically replicated databases - this is a known limitation that we will address shortly. After all, we don't want users to pause schema migrations during resharding. I think once that piece is in, you'll be able to run pretty much any kind of long-lived logical replicas for any purpose, including HA.
Thanks for the explanation. My use case is replacing patroni with pgdog. DDL replication would be required for this.
Is there some way I can get updates about pgdog and especially when the replication you mentioned is there?
Absolutely. We do weekly releases here: https://github.com/pgdogdev/pgdog/releases. Each release includes a detailed changelog.
I believe you can use an RSS reader if those are still in vogue, e.g.: https://github.com/pgdogdev/pgdog/releases.atom.
Congrats guys! Curious how the read write splitting is reliable in practice due to replication lag. Do you need to run the underlying cluster with synchronous replication?
The way we solved it is by checking the lsn on the primary, and then waiting for the replica to catch up to that lsn before doing reads on the replica in various scenarios.
Not really, replication lag is generally an accepted trade-off. Sync replication is rarely worth it, since you take a 30% performance hit on commits and add more single points of failure.
We will add some replication lag-based routing soon. It will prioritize replicas with the lowest lag to maximize the chance of the query succeeding and remove replicas from the load balancer entirely if they have fallen far behind. Incidentally, removing query load helps them catch up, so this could be used as a "self-healing" mechanism.
It sounds like this is one of the few places that might be a leaky abstraction in that queries _might_ fail and the failure might effectively be silent?
It can be silent, but usually it's loud and confusing because people do something like this (Rails example):
user = User.create(email: "test@test.com")
SendWelcomeEmail.perform_later(user.id)
And the job code fetches the row like so: user = User.find(id)
This blows up because `find` throws an error if the record isn't there. Job queues typically use replicas for reads. This is a common gotcha: code that runs async expects the data to be there after creation.There can be others, of course, especially in fintech where you have an atomic ledger, but people are usually pretty conscious about this and send those type of queries to the primary.
In general though, I completely agree, this is leaky and an unsolved problem. You can have performance or accuracy, but not both, and most solutions skew towards performance and make applications handle the lack of accuracy.
Makes sense, appreciate it
This looks great! I have a couple of questions:
1) Is it possible to start off with plain Postgres and add pgdog without scheduled downtime down the road when scaling via sharding becomes necessary?
2) How are schema updates handled when using physical multi-tenancy? Does pgdog just loop over all the databases that it knows about and issues the update schema command to each?
1. Yup, we support online resharding, so you don't need to deploy this until you have to.
2. That's right, we broadcast the DDL to all shards in the configuration. If two-phase commit [1] is enabled, you have a strong guarantee that this operation will be atomic. The broadcast is done in parallel, so this is fast.
Can you elaborate a bit more on the challenges faced in making Postgres shard-able?
I remember that adding sharing to Postgres natively was an uphill battle. There were a few companies who has proprietary solutions for it. What you've been able to achieve is nothing less than a miracle.
So many, where to begin.
1. People don't design schemas to be sharded, although many gravitate towards a common key, e.g. user_id or country_id or tenant_it or customer_id. Once that happens, sharding becomes easier.
2. Postgres provides a lot of guarantees that are tricky to maintain when sharded: atomic changes, referential integrity, check constraints, unique indexes (and constraints), to name a few. Those have to be built separately by a sharding layer (like PgDog) and have trade-offs, usually around performance. It's a lot more expensive to check a globally enforced constraint than a local one (network hops aren't free).
3. Online migrations from unsharded to sharded can be tricky: you have to redistribute terabytes of data while the DB continues to serve writes. You can't lose a single row - Postgres is used as a store of record and this can be a serious issue with business impact.
We're taking increasingly bigger bites at this apple. We started with basic query routing and are now doing query rewrites as well. We didn't handle data movements previously and now have almost fully automatic resharding. It takes time, elbow grease and most importantly, willing and courageous early adopters to whom we owe a huge debt of gratitude.
That's was my second question, how on earth can you replicate real world Postgres workloads that benefit the most from sharing.
Are there some specific standard Postgres test suites you run PgDog through to ensure it's compliant with Postgres standards?
You've mentioned NoSQL quite a bit, what sort of techniques do shard-able NoSQL database employ which makes sharding inherently easier? Do you attempt to emulate some of those techniques in PGDog?
Lastly how do you solve the problem of Postgres constraints, from what I've understood PgDog runs standard Postgres instances as the shard, if let's say one table in shard 1 has a foreign key to a record in shard 2 how do you prevent Postgres from rejecting that record since it technically doesn't exist on it's current shard?
> Are there some specific standard Postgres test suites you run PgDog through to ensure it's compliant with Postgres standards?
That's right. We have many levels of testing: unit, integration, and acceptance, where we run the same query against an unsharded Postgres database and PgDog, and compare the result.
> what sort of techniques do shard-able NoSQL database employ which makes sharding inherently easier?
They remove features. For example, most of them don't support joins, so each table can be stored anywhere in the cluster with no data locality restrictions. There are no foreign key constraints either, or even transaction support. The list goes on. Ultimately, NoSQL databases are just K/V stores, with a fancy API. Scaling K/V is a solved problem.
> one table in shard 1 has a foreign key to a record in shard 2 how do you prevent Postgres from rejecting that
We don't, at least not yet. We can and will build a more sophisticated query engine that will validate constraints, but it may not always be completely atomic or performant. Cross-shard queries are expensive, because of the laws of physics. For example, if a query is executed outside of a transaction, validating the constraint could introduce a race condition, while in non-sharded Postgres, all queries run inside implicit transactions.
Congrats on the progress! What is the behavior of PgDoc if it receives some sort of query it can't currently handle properly? Is there a linter/static analysis tool I can use to evaluate if my query will work?
The current behavior unfortunately is to just let it through and return an incorrect result. We are adding more checks here and rely heavily on early adopters to have a decent test suite before launching their apps to prod.
That being said, we do have this [1]:
[general]
expanded_explain = true
This will modify the output of EXPLAIN queries to return routing decisions made by PgDog. If you see that your query is "direct-to-shard", i.e. goes to only one shard, you can be certain that it'll work as expected. These queries will talk to only one database and don't require us to manipulate the result or assemble results from multiple shards.For cross-shard queries, you'll need your own integration tests, for now. We'll add checks here shortly. We have a decent CI suite as well, but it doesn't cover everything. Every time we look at that part of the code, we just end up adding more features, like the recent support for LIMIT x OFFSET y (PgDog rewrites it to LIMIT x + y and applies the offset calculation in memory).
We'll get there.
How would this product compare to a PostgREST based approach (this is the cool tech behind the original supabase) with load balancing at the HTTP level?
PostgREST is a translation layer: you use HTTP methods, inputs and outputs, to interact with Postgres, the database. It's a replacement for SQL, the language, which happens to also have a load balancer.
Their load balancer is still at the Postgres layer though. You can think of it as just an application that happens to speak a specific API. Load balancing applications is a solved problem.
> It's a replacement for SQL, the language
PostgREST doesn't provide a replacement, rather a subset of the SQL language meant to be safe to expose to untrusted (frontend) clients.
Load balancing is not built-in currently, but it can be done at the proxy layer, taking the advantage that GET/HEAD requests are always executed on read only transactions, so they can be routed to read replicas. This is what Supabase does [1] for example.
[1]: https://supabase.com/docs/guides/platform/read-replicas#api-...
Does it support extensions like PostGIS?
Technically yes. We only support BIGINT (and all other integers), VARCHAR and UUID for sharding keys, but we'll happily pass through any other data. If we need to process it, we'll need to parse it. To be clear: you can include PostGIS data in all queries, as long as we don't need it for sharding.
It's not too difficult to add sharding on it if we wanted to. For example, we added support for pgvector a while back (L2/IVFlat-based sharding), so we can add any other data type, e.g., POLYGON for sharding on ST_Intersects, or for aggregates.
(apologies for new account - NDA applies to the specifics)
Nice surprise to see this here today. I was working on a deployment just last week.
Unfortunately for me, I found that it crashed when doing a very specific bulk load (COPY FORMAT BINARY with array columns inside a transaction). The process loads around 200MB of array columns (in the region of 10K rows) into a variety of tables. Very early in the COPY process PgDog crashes with :
"pgdog router error: failed to fill whole buffer"
So it appears something is not quite right for my specific use case (COPY with array columns). I'm not familiar enough with Rust but the failed to fill whole buffer seemed to come from Rust (rather than PgDog) based on what little I could find with searches.
I was very disappointed as it looked much simpler to get set up and running that PgPool-II (which I have had to revert to as my backup plan - I'm finding it more difficult to configured, but it does cope with the COPY command without issues).
I would have preferred to stick with PgDog.
I think we may have fixed this 3 weeks ago: https://github.com/pgdogdev/pgdog/pull/744
Might be worth another try. If not, a GitHub issue with more specifics would be great, and we'll take a look. Also, if binary encoding isn't working out, try using text - it's more compatible between Postgres versions:
[general]
resharding_copy_format = "text"I've been watching PgDog for a while now. Great progress!
I really hope to use the sharding feature one day.
Comment was deleted :(
How do you know when/if it's justified to add additional complexity like PgDog?
Is there a number of simultaneous connection / req per sec that's a good threshold?
Is it easy on my postgres instance to get the number of simulataneous connections, for instance if I simulate traffic, to know if I would gain anything from a connection pooler?
I would say, over 100 Postgres connections, consider getting a connection pooler. Requests per second is highly variable. Postgres can serve a lot of them, as long as you keep the number of server connections low - that's what the pooler is for.
You can use pgbench to benchmark this on local pretty easily. The TPS curve will be interesting. At first, the connection pooler will cause a decrease and as you add more and more clients (-c parameter), you should see increasing benefits.
Ultimately, you add connection poolers when you don't have any other option: you have hundreds of app containers with dozens of connections each and Postgres can't handle it anymore, so it's a necessity really.
Load balancing becomes useful when you start adding read replicas. Sharding is necessary when you're approaching the vertical limit of your cloud provider (on the biggest instance or close).
Okay, on my side I have a server for my API, using Drizzle, I guess it already uses some kind of pooling (or at least it asks me to instantiate a pg.Pool, not sure if that's a lightweight connection pooler on the server side), and only a couple of workers with a Drizzle pool each, so I guess I'm far enough from that limit
Do connection increase mostly as you increase microservices / workers, or is it something more related to how many endusers of your service (eg connections on your webserver) you have?
That's exactly right, it's both of those. More containers / services means more connections to the DB, which themselves need to be pooled. More requests to the app require more connections as well.
Looks great - I'd love to include it in https://postgresisenough.dev (just put in a PR: https://github.com/agoodway/postgresisenough?tab=readme-ov-f...)
How well does PG work with 10-20 million (financial) records per day? Basic stuff: a few writes per, some reads, generating some analytics, etc.
The entire point of just using Postgres went right over your head…
While the lift to add to your database is low, I don’t think you’re at a point you can outsource the work.
But all the better if they do!
.
Why don't you just do it yourself if you maintain a curated resource list?
Wanted to give them chance to write it up as they like
[dead]
The cross-shard aggregate rewriting is really nice. Transparently injecting count() for average calculations sounds straightforward but there are so many edge cases once you add GROUP BY, HAVING, subqueries, etc.
Curious about latency overhead for the common case. On a direct-to-shard read where no rewriting happens, what's the added latency from going through PgDog vs connecting to Postgres directly? Sub-millisecond?
Subms typically, yeah. We measured the average latency between nodes in the same AZ (e.g., AWS availability zone) to be less than one ms, so you need to account for one extra hop and processing time by PgDog, which is typically fast.
That being said if you don't currently use a connection pooler, you will notice some latency when adding one. It's usually table stakes for Postgres at scale since you need one anyway, but it can be surprising. This especially affects "chatty" apps - the ones that send 10+ queries to service one API request, and makes bugs like N+1s considerably worse.
TLDR: not a free lunch, but generally acceptable at scale.
[dead]
Crafted by Rajat
Source Code