title: Engineering for Slow Internet
url: https://brr.fyi/posts/engineering-for-slow-internet
hash_url: 9e9c6f97d7
archive_date: 2024-05-31
og_image: https://brr.fyi/media/engineering-for-slow-internet/engineering-for-slow-internet-icon.png
description: How to minimize user frustration in Antarctica.
favicon: https://brr.fyi/favicon-32x32.png
language: en_US
Hello everyone! I got partway through writing this post while I was still in Antarctica, but I departed before finishing it.
I’m going through my old draft posts, and I found that this one was nearly complete.
It’s a bit of a departure from the normal content you’d find on brr.fyi, but it reflects my software / IT engineering background.
I hope folks find this to be an interesting glimpse into the on-the-ground reality of using the Internet in bandwidth-constrained environments.
Please keep in mind that I wrote the majority of this post ~7 months ago, so it’s likely that the IT landscape has shifted since then.
Welcome back for a ~~bonus post~~ about Engineering for Slow Internet!
For a 14-month period, while working in Antarctica, I had access to the Internet only through an extremely limited series of satellite links provided by the United States Antarctic Program.
Before I go further, this post requires a special caveat, above and beyond my standard disclaimer:
Even though I was an IT worker within the United States Antarctic Program, everything I am going to discuss in this post is based on either publicly-available information, or based on my own observations as a regular participant living on ice.
I have not used any internal access or non-public information in writing this post.
As a condition of my employment, I agreed to a set of restrictions regarding public disclosure of non-public Information Technology material. I fully intend to honor these restrictions. These restrictions are ordinary and typical of US government contract work.
It is unlikely that I will be able to answer additional questions about matters I discuss in this post. I’ve taken great care to write as much as I am able to, without disclosing non-public information regarding government IT systems.
Good? Ok, here we go.
… actually wait, sorry, one more disclaimer.
This information reflects my own personal experience in Antarctica, from August 2022 through December 2022 at McMurdo, and then from December 2022 through November 2023 at the South Pole.
Technology moves quickly, and I make no claims that the circumstances of my own specific experience will hold up over time. In future years, once I’ve long-since forgotten about this post, please do not get mad at me when the on-the-ground IT experience in Antarctica evolves away from the snapshot presented here.
Ok, phew. Here we go for real.
It’s a non-trivial feat of engineering to get any Internet at the South Pole! If you’re bored, check out the South Pole Satellite Communications page on the public USAP.gov website, for an overview of the limited selection of satellites available for Polar use.
If you’re interested, perhaps also look into the 2021 Antarctic Subsea Cable Workshop for an overview of some hurdles associated with running traditional fiber to the continent.
I am absolutely not in a position of authority to speculate on the future of Antarctic connectivity! Seriously. I was a low-level, seasonal IT worker in a large, complex organization. Do not email me your ideas for improving Internet access in Antarctica – I am not in a position to do anything with them.
I do agree with the widespread consensus on the matter: There is tremendous interest in improving connectivity to US research stations in Antarctica. I would timidly conjecture that, at some point, there will be engineering solutions to these problems. Improved connectivity will eventually arrive in Antarctica, either through enhanced satellite technologies or through the arrival of fiber to the continent.
But – that world will only exist at some point in the future. Currently, Antarctic connectivity is extremely limited. What do I mean by that?
Until very recently, at McMurdo, nearly a thousand people, plus numerous scientific projects and operational workloads, all relied on a series of links that provided max, aggregate speeds of a few dozen megabits per second to the entire station. For comparison, that’s less bandwidth shared by everyone combined than what everyone individually can get on a typical 4g cellular network in an American suburb.
Things are looking up! The NSF recently announced some important developments regarding Starlink at McMurdo and Palmer.
I’m aware that the on-the-ground experience in McMurdo and Palmer is better now than it was even just a year ago.
But – as of October 2023, the situation was still pretty dire at the South Pole. As far as I’m aware, similar developments regarding Starlink have not yet been announced for South Pole Station.
As of October 2023, South Pole had the limitations described above, plus there was only connectivity for a few hours a day, when the satellites rose above the horizon and the station was authorized to use them. The satellite schedule generally shifts forward (earlier) by about 4 minutes per day, due to the difference between Sidereal time and Solar (Civil) time.
The current satellite schedule can be found online, on the South Pole Satellite Communications page of the public USAP.gov website. Here’s an example of the schedule from October 2023:
South Pole satellite schedule, for two weeks in October 2023.
These small intermittent links to the outside world are shared by everyone at Pole, for operational, science, and community / morale usage.
Complicating matters further is the unavoidable physics of this connectivity. These satellites are in a high orbit, thousands of miles up. This means high latency. If you’ve used a consumer satellite product such as HughesNet or ViaSat, you’ll understand.
From my berthing room at the South Pole, it was about 750 milliseconds, round trip, for a packet to get to and from a terrestrial US destination. This is about ten times the latency of a round trip between the US East and West coasts (up to 75 ms). And it’s about thirty times the expected latency of a healthy connection from your home, on a terrestrial cable or fiber connection, to most major content delivery networks (up to 25 ms).
Seriously, I can’t emphasize how jarring this is. At my apartment back home, on GPON fiber, it’s about 3 ms roundtrip to Fastly, Cloudflare, CloudFront, Akamai, and Google. At the South Pole, the latency was over two hundred and fifty times greater.
I can’t go into more depth about how USAP does prioritization, shaping, etc, because I’m not authorized to share these details. Suffice to say, if you’re an enterprise network engineer used to working in a bandwidth-constrained environment, you’ll feel right at home with the equipment, tools, and techniques used to manage Antarctic connectivity.
Any individual trying to use the Internet for community use at the South Pole, as of October 2023, likely faced:
These constraints drastically impact the modern web experience! Some of it is unavoidable. The link characteristics described above are truly bleak. But – a lot of the end-user impact is caused by web and app engineering which fails to take slow/intermittent links into consideration.
If you’re an app developer reading this, can you tell me, off the top of your head, how your app behaves on a link with 40 kbps available bandwidth, 1,000 ms latency, occasional jitter of up to 2,000 ms, packet loss of 10%, and a complete 15-second connectivity dropout every few minutes?
It’s probably not great! And yet – these are real-world performance parameters that I encountered, under certain conditions, at the South Pole. It’s normally better than this, but this does occur, and it occurs often enough that it’s worth taking seriously.
This is what happens when you have a tiny pipe to share among high-priority operational needs, plus dozens of community users. Operational needs are aggressively prioritized, and the community soaks up whatever is left.
I’m not expecting miracles here! Obviously no amount of client engineering can make, say, real-time video conferencing work under these conditions. But – getting a few bytes of text in and out should still be possible! I know it is possible, because some apps are still able to do it. Others are not.
One day at the South Pole, I was trying to load the website of <$enterprise_collaboration_platform> in my browser. It’s huge! It needed to load nearly 20 MB of Javascript, just to render the main screen! And of course, the app had been updated since last time I loaded it, so all of my browser’s cached assets were stale and had to be re-downloaded.
Fine! It’s slow, but at least it will work… eventually, right? Browsers do a decent job of handling slow Internet. Under the hood, the underlying protocols do a decent job at congestion control. I should get a steady trickle of data. This will be subject to the negotiated send and receive windows between client and server, which are based on the current level of congestion on the link, and which are further influenced by any shaping done by middleware along the way.
It’s a complex webapp, so the app developer would also need to implement some of their own retry logic. This allows for recovery in the event that individual assets fail, especially for those long, multi-second total connectivity dropouts. But eventually, given enough time, the transfers should complete.
Unfortunately, this is where things broke down and got really annoying. The developers implemented a global failure trigger somewhere in the app. If the app didn’t fully load within the parameters specified by the developer (time? number of retries? I’m not sure.), then the app stopped, gave up, redirected you to an error page, dropped all the loading progress you’d made, and implemented aggressive cache-busting countermeasures for next time you retried.
I cannot tell you how frustrating this was! Connectivity at the South Pole was never going to meet the performance expectations set by engineers using a robust terrestrial Internet connection. It’s not a good idea to hardcode a single, static, global expectation for how long 20 MB of Javascript should take to download. Why not let me load it at my own pace? I’ll get there when I get there. As long as data is still moving, however slow, just let it run.
But – the developers decided that if the app didn’t load within the parameters they set, I couldn’t use it at all. And to be clear – this was primarily a messaging app. The actual content payload here, when the app is running and I’m chatting with my friends, is measured in bytes.
As it turns out, our Internet performance at the South Pole was right on the edge of what the app developers considered “acceptable”. So, if I kept reloading the page, and if I kept letting it re-download the same 20 MB of Javascript, and if I kept putting up with the developer’s cache-busting shenanigans, eventually it finished before the artificial failure criteria.
What this means is that I wasted extra bandwidth doing all these useless reloads, and it took sometimes hours before I was able to use the app. All of this hassle, even though, if left alone, I could complete the necessary data transfer in 15 minutes. Several hours (and a shameful amount of retried Javascript) later, I was finally able to send a short, text-based message to my friends.
...all so that I could send a 1.8 KB HTTPS POST...
...containing a 6-byte message.
Does this webapp really need to be 20 MB? What all is being loaded that could be deferred until it is needed, or included in an “optional” add-on bundle? Is there a possibility of a “lite” version, for bandwidth-constrained users?
In my 14 months in Antarctica, I collected dozens of examples of apps like this, with artificial constraints built in that rendered them unusable or borderline-unusable.
For the rest of this post, I’ll outline some of my major frustrations, and what I would have liked to see instead that would mitigate the issues.
I understand that not every app is in a position to implement all of these! If you’re a tiny app, just getting off the ground, I don’t expect you to spend all of your development time optimizing for weirdos in Antarctica.
Yes, Antarctica is an edge case! Yes, 750 ms / 10% packet loss / 40 kbps is rather extreme. But the South Pole was not uniquely bad. There are entire commercial marine vessels that rely on older Inmarsat solutions for a few hundred precious kbps of data while at sea. There’s someone at a remote research site deep in the mountains right now, trying to load your app on a Thales MissionLink using the Iridium Certus network at a few dozen kbps. There are folks behind misconfigured routers, folks with flaky wifi, folks stuck with fly-by-night WISPs delivering sub-par service. Folks who still use dial-up Internet connections over degraded copper phone lines.
These folks are worthy of your consideration. At the very least, you should make an effort to avoid actively interfering with their ability to use your products.
So, without further ado, here are some examples of development patterns that routinely caused me grief at the South Pole.
As per the above example, do not hardcode your assumptions about how long a given payload will take to transfer, or how much you can transfer in a single request.
A popular desktop application tries to download some configuration information from the vendor’s website at startup. There is a hardcoded timeout for the HTTPS call. If it fails, the app will not load. It’ll just keep retrying the same call, with the same parameters, forever. It’ll sit on the loading page, without telling you what’s wrong. I’ve confirmed this is what’s happening by reading the logs.
Luckily, if you kept trying, the call would eventually make it through under network conditions I experienced at the South Pole.
It’s frustrating that just a single hardcoded timeout value, in an otherwise perfectly-functional and enterprise-grade application, can render it almost unusable. The developers could have:
A popular chat app (“app #1”) maintains a websocket for sending and receiving data. The initialization process for that websocket uses a hardcoded 10-second timeout. Upon cold boot, when network conditions are especially congested, that websocket setup can sometimes take more than 10 seconds! We have to do a full TCP handshake, then set up a TLS session, then set up the websocket, then do initial signaling over the websocket. Remember – under some conditions, each individual roundtrip at the South Pole took multiple seconds!
If the 10-second timeout elapses, the app simply does not work. It enters a very long backoff state before retrying. The UX does not clearly show what is happening.
On the other hand, a competitor’s chat app (“app #2”) does very well in extremely degraded network conditions! It has multiple strategies for sending network requests, for resilience against certain types of degradation. It aggressively re-uses open connections. It dynamically adjusts timeouts. In the event of a failure, it intelligently chooses a retry cadence. And, throughout all of this, it has clear UX explaining current network state.
The end result is that I could often use app #2 in network conditions when I could not use app #1. Both of them were just transmitting plain text! Just a few actual bytes of content! And even when I could not use app #2, it was at least telling me what it was trying to do. App #1 is written naively, with baked-in assumptions about connectivity that simply did not hold true at the South Pole. App #2 is written well, and it responds gracefully to the conditions it encounters in the wild.
A chance to talk about my own blog publishing toolchain!
The site you’re reading right now is a static Jekyll blog. Assets are stored on S3 and served through CloudFront. I build the static files locally here on my laptop, and I upload them directly to S3. Nothing fancy. No servers, no QA environment, no build system, no automated hooks, nothing dynamic.
Given the extreme connectivity constraints at the South Pole, I wrote a Python script for publishing to S3 that worked well in the challenging environment. It uses the S3 API to upload assets in small chunks. It detects and resumes failed uploads without losing progress. It waits until everything is safely uploaded before publishing the new version.
If I can do it, unpaid, working alone, for my silly little hobby blog, in 200 lines of Python… surely your team of engineers can do so for your flagship webapp.
It’s amazing the usability improvements that come along with some proactive engineering. I had friends at Pole with blogs on commercial platforms, and who routinely shared large files to social media sites. They had to carefully time their day to maximize the likelihood of a successful “one-shot” upload, during a satellite window, using their platform’s poorly-engineered publishing tools. Often it took several retries, and it’s not always clear what was happening at every step of the process (Is the content live? Did the upload finish? Is it safe / should I hit “Post” again?).
Meanwhile, I was able to harvest whatever connectivity I could find. I got a few kilobytes uploaded here and there, whenever it was convenient. If a particular chunked POST failed, no worries! I could retry or resume, with minimal lost progress, at a later time. Once it was all done and staged, I could safely publish the new version.
My custom publishing script for this blog, to handle intermittent and unreliable Internet.
If you’re going to build in a downloader into your app, you have a high bar for quality that you have to meet. Otherwise, it’s going to fail in profoundly annoying or catastrophic ways.
If I had to give one piece of advice: Let the user break out of your in-app downloader and use their own, if at all possible.
Provide a manual download link, ideally one that leads to whatever differential patch file the app was going to download. Don’t punish the user by making them download the full installer, just because your in-app patch downloader doesn’t meet their needs.
This has the following benefits:
Why is this all so important when considering users at the South Pole?
Downloads are possible at the South Pole, but they are subject to unique constraints. The biggest constraint is the lack of 24x7 Internet. While I was there, I knew we would lose Internet access at a certain time!
It’s a frustrating reality: with most apps that do their own downloads, we were powerless to do anything about this known break in connectivity. We just had to sit there and watch it fail, and often watch all our progress be lost.
Let’s say I had a 4-hour window, every day, during which I could do (very slow!!) downloads. If the total amount of data I could download in those 4 hours was less than the total size of the payload I was downloading, then there is no way I could complete the download in one shot! I’d have to split it over multiple Internet windows. Often the app wouldn’t let me do so.
And that’s not even considering the fact that access might be unreliable during that time! What if the underlying connection dropped and I had to resume the download? What if my plans changed and I needed to pause? I didn’t want to waste whatever precious little progress I’d made so far.
A lot of modern apps include their own homegrown, artisanal, in-app downloaders for large payloads. By “in-app downloader”, I’m referring to the system that obtains the content for automatic updates, patches, content database updates, etc. The common theme here is that the app transparently downloads content for you, without you being exposed to the underlying details such as the URL or raw file. This includes UI patterns such as Check for updates, Click here to download new version, etc.
Unfortunately, most of these in-app downloaders are woefully ill-equipped for the task! Many of them lack pause/resume functionality, state notifications, retry logic, and progress tracking. Many of them have frustrating restrictions, such as time limits for downloading the payload. While most of these issues are mere annoyances in the land of fast Internet, at the South Pole, they can make or break the app entirely.
It was always frustrating to face down one of these interfaces, because I knew how much time, and data transfer, was going to be wasted.
Every app that includes an in-app downloader has to compete with an extraordinarily high bar for usability: web browsers.
Think about it! Every modern web browser includes a download manager that contains Abort, Pause, and Resume functionality. It allows you to retry failed downloads (assuming the content URL doesn’t include an expiring token). It clearly shows you current status, download speed, and estimated time remaining. It allows you to choose where you save the underlying file, so you can copy it around if needed. And – it doesn’t include arbitrary performance cutoffs! If you really want to download a multi-gigabyte file at 60 kbps, go for it!
A partially-downloaded file, for the above-mentioned download.
Here are a few more examples of where in-app downloaders caused us grief.
It’s no secret that macOS updates are huge. This is sometimes even annoying back home, and it was much worse at the South Pole.
The patch size for minor OS updates is usually between 0.5 and 1.5 gigabytes. Major OS upgrade patches are sometimes 6+ gigabytes. Additional tools, such as Xcode, are often multiple gigabytes.
Sigh, yet another 1 GB patch for my personal macOS device at the South Pole.
If every single macOS device at the South Pole downloaded these updates, directly from Apple, we would have wasted a tremendous amount of bandwidth. And the built-in macOS downloader certainly wanted us to do this! Look at this interface – few controls, no way to break out and easily get the underlying patch files. If I canceled the download, or if it failed for some reason, it didn’t always intelligently resume. Sometimes, I lost all my progress.
Now – Apple does have a caching server feature built into macOS. In theory, this should alleviate some of the burden! We should be able to leverage this feature to ensure each patch is only downloaded to the South Pole one time, and then client Macs will hit the cache.
I experimented with this feature in my spare time, with a handful of my own Apple devices. In practice, this feature still required each client Macbook to make a successful HTTPS call directly to Apple, to negotiate cache parameters. If this call failed, which it often did (because of hardcoded short timeouts!!!), then the client Mac just fetched the patch from public Apple servers. No retry, no notification. The client Mac just made a unilateral decision to bypass the cache, without any recourse or even a notification for the user. In practice, this initial cache negotiation call failed often enough at the South Pole that the caching feature wasn’t useful.
What we could do was to fetch the full installer (12 gigabytes!) from Apple. Links to the full installer packages are conveniently aggregated on the Mr. Macintosh blog. We could pull the full installer down to the South Pole slowly and conscientiously: throttled, at low, background priority, using robust, interrupt-tolerant tooling, with support for caching and resumption of paused or failed transfers. Once we had the file, we could distribute it on station. This process could take several days, but it was reliable.
The macOS full installer, painstakingly and conscientiously downloaded to the South Pole.
But even this didn’t solve the problem! If the client Mac is Apple Silicon, it still insisted on downloading additional content directly from Apple, even if you ran the update using the full, 12 GB installer. There is no way to bypass or cache this. If the OS update required certain types of firmware updates or Rosetta updates, then every Apple Silicon client Mac would still download 1-2 GB of data directly from Apple during the install process.
Even worse, the download process was sometimes farmed out to a separate component in macOS, which didn’t even report progress to the installer! Installing a macOS update at the South Pole meant staring at a window that said “installing, 32 minutes remaining”, for several hours, while a subcomponent of macOS downloaded a gigabyte of un-cacheable data in the background.
Apple naively assumed that the 1 GB download would be so fast that they didn’t bother incorporating download speed feedback into the updater’s time estimate. They did not anticipate people installing macOS updates from a location where a gigabyte of downloads can take several hours, if not days.
You can’t cache it, and you can’t download it directly using a web browser (or other mechanism). You have to let Apple’s downloader do it directly. And, of course, there’s no pause button. It is a major inconvenience to users, and a major waste of bandwidth, for each individual client Mac to download 1-2 GB of data in a single, uninterrupted shot.
Ways that Apple could make this significantly better for users with slow or otherwise-weird Internet:
As it stands, it was a huge hassle for me to help people with macOS updates at the South Pole.
My Samsung Android phone receives periodic OS updates. These updates include relevant Android patches, as well as updates to the Samsung UI and other OS components.
The updater is a particularly bad example of an app that fails to consider slow / intermittent Internet use cases.
Downloading an OS update for my Samsung Android phone at the South Pole.
First, the basics. There is no speed indicator, no numeric progress indicator (good luck counting pixels on the moving bar), no pause button, no cancel button, no indicator of the file size, and no way to get at the underlying file to download separately.
Second – if the download fails, it cannot be resumed. It will restart from the beginning.
In practice, at the South Pole, the phone could not download an entire OS update on a single satellite pass. So – it inevitably failed as soon as connectivity dropped, and I had to restart it from the beginning.
The only way I was able to get this done was by turning off the phone entirely, right before Internet access dropped, and then turning it back on when Internet access resumed at the next satellite pass. This tricked the phone into not giving up on the download, because it was totally off during the period without Internet. It never had a chance to fail. By doing this, I was able to spread out the download across multiple satellite passes, and I could complete the download.
This is an absurd workaround! I should not have had to do this.
My US carrier (Verizon) does offer a downloadable application for macOS and Windows which should, in theory, allow me to flash the OS updates from my computer, instead of relying on the phone to download the patches. In practice, the Verizon app is even worse. Buggy, unreliable, and also insists on using its own in-app downloader to fetch the update files (sigh…).
I’m sure there’s a way I could have gotten the images and flashed them manually. This is not an invitation for a bunch of Android enthusiasts to email me and explain bootloaders and APKs and ROMs and sideloading and whatever else is involved here. That’s not the point. The point is – the mainstream tools that vendors ship are hopelessly deficient for users on slow Internet, and that’s a bummer.
A small desktop app has an in-app downloader for updates. Can you spot the issues?
Let’s count them:
This is actually one of my favorite desktop apps! It’s a shame to call them out like this. A quick, easy way to make this MUCH better for users at the South Pole would be to provide a manual download link. Then, the developers wouldn’t need to reimplement all the nice download features that my browser provides. I could just use my browser.
Here’s another one!
Downloading another in-app update.
Let’s count the issues:
It does have a few things going for it:
But – overall, still a frustrating user experience for users with slow or intermittent Internet access.
Credit where credit is due – Microsoft has a GREAT auto-updater built into Office for Mac! Check it out:
Downloading Office for macOS updates at the South Pole.
Look at all these nice features!
The only thing that could have made this better is a link to get at the underlying URL, so I could use my own downloader. But, given how good this interface is, I didn’t mind using it, even at the South Pole.
I hope the examples I’ve shown in this post have been a helpful illustration of how minor oversights or under-developed features back home can become major issues in a place with slow Internet.
Again, I’m not asking that every app developer spend a huge amount of time optimizing for edge cases like the South Pole.
And I’m also definitely not asking for people to work miracles. Internet access at the South Pole, as of October 2023, was slow. I don’t expect immersive interactive streaming media to work under the conditions I described here, but it would be nice if apps were resilient enough to get a few bytes of text up or down. Unfortunately, what often ended up happening is that apps got stuck in a loop because of an ill-advised hardcoded timeout.
I hope everyone found this helpful, or at least interesting.
And thank you again to everyone who followed along with me on my Antarctic journey! I’ve been off-ice for about six months now, and going through my old posts here have brought back fond memories.
I hope the current winter-over crew is doing well, and that everyone is enjoying the Polar Night. If the egg supply and consumption rate is the same as it was during Winter 2023, they should soon be finishing up The Last Egg.
I won’t promise any more content, but I do have a handful of other half-finished posts sitting in my drafts. We’ll see!