Tuan Anh

container nerd. k8s || GTFO

The story behind my talk: Cloud Cost Optimization at Scale: How we use Kubernetes and spot instances to reduce EC2 billing up to 80%

This is the story behind my talk: “Cloud Cost Optimization at Scale: How we use Kubernetes and spot instances to reduce EC2 billing up to 80%”.

Now before I tell this story, I will admit first hand that the actual number is lower than 80%.

2015

The story began in mid 2015 when I was employed by one of my ex-employer. It was a .NET framework shop that struggled to scale in both performance and cost at the time. I was hired as developer to work on the API integration but I can’t help to notice too much money was sunk into AWS EC2 billing. Bear in mind I’m not an Ops guy by any mean but you know about startup, one usually has to wear many hats.

At first, when the AWS credit is still plenty, we don’t have to worry much about it. But when it ran low, it’s clearly becoming one of the biggest pain point of our startup.

The situation at the time is like this:

  • There were 2 teams: the core team using .NET framework and the API team using Node.js
  • Core team mostly uses Windows-based instance and API team uses Linux-based.
  • The core team uses a lot more instances than API team.
  • Most EC2 instances are Windows-based. All are on-demand instances. No reserved instances whatsoever 😨.
  • Few are Linux-based instances where we install other linux based applications but there weren’t many of them.
  • On-demand Windows-based instance price is about 30% higher than Linux-based.
  • We use RDS for database.
  • We don’t have any real ops guy as you think these days. Whenever we need something setup, we just have to page someone from India team to create instances for us and then proceed to set them up ourselves.

Now, the biggest sunk cost are obviously RDS and EC2. If I were to assigned to optimize this, I will definitely take a look at those 2 first. But I wasn’t working on it at that time. I was hired to do other things.

At that time, I used Deis - a container management solution (acquired by Microsoft later) for my projects. I experimented shortly with Flynn but ended up not using it.

2016

Spotinst

In 2016, I heard of this startup called Spotinst. I found several useful posts from their blog regarding EC2 cost optimization and find their whole startup ideas very fascinating. For those of you who are not working with infrastructure, the whole idea of Spotinst is to use spot instances to reduce the infrastructure cost for you. And they take some cut from it.

Spotinst automates cloud infrastructure to improve performance, reduce complexity and optimize costs.

Spot instances are very cheap (think 70-90% cheaper vs on-demand) EC2 offering from AWS but comes with a small problem: it can goes away anytime with just 2 minutes notice.

I thought if we can design our workload to be fault tolerant and gracefully shutdown, spot instances will make perfect sense. Or anything like a queue and worker workload would fit as well. Web apps, on another hand, will be a little bit more difficult but totally do-able.

Kubernetes

During 2016, I also learnt about this super duper cool project called Kubernetes. I believe they were at version 1.2 at the time.

Kubernetes comes with the promise of many awesome features but what caught my eyes were this “self-healing” feature. This make perfect complement with spot instances, I thought.

And so I dig a little bit more to see if I can set one up with spot instances and they do support it. Awesome!! 🥰

Now, the only problem left is our core team still need Windows and Kubernetes didn’t support Windows at the time. So my whole infrastructure revamp idea is useless now, or so I thought.

.NET Core

In mid 2016, I learnt about .NET core project. They were around 1.0 release at the time. One of the feature is cross-platform. I thought to myself: I can still salvage this.

Now, please note that I’m a Node.js guy and I don’t know much about .NET aside from my thesis in university. So I asked the lead guy from core team to take a look into it and while there are many quirks, it’s actually not very difficult to migrate our core to .NET Core. It would be time consuming but it’s very much doable. I know that .NET Core is going to be the future so eventually, we will need to migrate to it anyway.

Tests + Migration

While the core team do that, I setup a test cluster with spot instances and learnt Kubernetes. I optimized the cluster setup a little bit and migrate all my projects over to them by the end of 2016. The whole process is quite fast because all the apps I have (Node.js) are already Dockerized and have graceful shutdown implemented. I just need to learn the in-and-out of Kubernetes.

I started with a managed GKE at first using their free $300 credit, to learn the basic of Kubernetes; but then later on use kops to setup a production cluster with AWS.

Some of the changes I did for the production cluster is:

  • Setup instance termination daemon to notify all the containers + graceful shutdown for all the apps.
  • Setup multiple instance groups of various size and availablity zone, mixing spot instances with reserved instances. This is to prevent price spike of certain spot instance group; and minimize the chances of all spot instances going down at the same time.
  • Calculate and provision a slightly bigger fleet then what we actually need so that when there were instances shut off, there won’t be service downgrading. Because spot instances are so cheap, we can do this without worry much about the cost.
  • Watch to see if there were scheduling failture to scale the reserved groups.

2017

At this point, our API apps’ EC2 cost is already very managable. We’re waiting for the core team to migrate over. And we did that in 2017. The overall cost saving for EC2 was around 60-70% because we need to mix reserved instances in and provision a little higher than what we actually need. We were very happy with the result.

What we did back then is actually what Spotinst does but at much smaller scale. And it’s more doable with smaller startups with only 1 ops guy.

And that is my story behind the talk: “Cloud Cost Optimization at Scale: How we use Kubernetes and spot instances to reduce EC2 billing up to 80%”.

Thoughts on Workers KV

Infrequent write / frequent read

I tried to build a todobackend.com with Cloudflare workers and Workers KV. However, the specs runner keeps failing, inconsistently.

Meaning they would pass this run and fail the next. Manual tests usually doesn’t have this problem. This tells me it seems Workers KV is not synchronous or the data replication is slow.

Turns out, it’s mentioned right there in the Workers KV’s docs; emphasises are mine.

Workers KV is generally good for use-cases where you need to write relatively infrequently, but read quickly and frequently. It is optimized for these high-read applications, only reaching its full performance when data is being frequently read. Very infrequently read values are stored centrally, while more popular values are maintained in all of our data centers around the world.

KV achieves this performance by being eventually-consistent. New key-value pairs are immediately available everywhere, but value changes may take up to 60 seconds to propagate. Workers KV isn’t ideal for situations where you need support for atomic operations or where values must be read and written in a single transaction.

With this, todobackend.com specs runner would never pass.

The Workers KV API is quite simple for now and I do hope they keep it that way, or maybe resemble Redis API with some more data types like list / sorted set. That would be lovely.

Batch load is not yet supported

It’s on roadmap but not yet available. So the result of .list() will have to be mapped with a Promise.all like this

await Promise.all(keys.map(key => myKvStore.get(key.name)))

With these limitations, Workers KV are more suitable for keeping build assets or anything that doesn’t need close-to-real-time propagation. Keep that in mind when you want to build something with Workers KV; also watch this space as Cloudflare is moving pretty fast.

Some other limits can be found here on Cloudflare workers docs.

reader

Another experiment with Cloudflare workers. I haven’t use Worker KV here though.

reader is a service that mimic reader mode on browser and let user shares the reader mode view on the web. It’s still super buggy now due to lib that I use is quite abandoned at the moment. I just want to whip out something that works first.

reader

Something I learnt from reading Cloudflare workers docs while doing this:

  • HTMLRewriter is delightful even though I didn’t get to use it (much) in this small project.
  • Worker KV is another nice bit from them. With this, it’s probably be enough to build a complete web apps.

Next, I’m going to look at Worker KV and HTMLRewriter more in an attempt to build something that use both of those features.

link bài gốc

Experiment with Cloudflare Workers

I’ve been meaning to try Cloudflare Workers with my blog. Given that it’s static website, it should be straight forward to do.

They (Cloudflare) makes it incredibly easy to migrate. Their tutorial works just fine with a minor exception regarding the DNS setup. The whole process took like 5 mins overall.

I just had to do an additional step of setting up an A record of my domain to 192.0.2.1 so that they can be resolved to Cloudflare Workers.

The site is now up at https://tuananh.net. Eventually, if all is well, I think i’m gonna migrate mine currently on RamNode over to them.

link bài gốc

Brag document

There’s this idea that, if you do great work at your job, people will (or should!) automatically recognize that work and reward you for it with promotions / increased pay. In practice, it’s often more complicated than that – some kinds of important work are more visible/memorable than others. It’s frustrating to have done something really important and later realize that you didn’t get rewarded for it just because the people making the decision didn’t understand or remember what you did. So I want to talk about a tactic that I and lots of people I work with have used!

I’ve been doing this for years and it really works. Highly recommend you give this post a read.

link bài gốc

Debugging with git bisect

Suppose I have this project with 5 commits. You can clone it from here.

git bisect

Say, there’s a regression bug in the master branch but a lot has been added to master after the feature was first inroduced. How would I go debugging this? Which commits break it?

Usually, we would go manually and see which commit would possibly do this but if the project is large and active, it’s a quite troulesome process.

Luckily, we have git bisect for that.

  • Go to the project and issue git bisect start
  • Mark the bad commit by git bisect bad <commit-id>. You can ommit the commit id if you’re already on it.
  • Mark the good commit by git bisect good <commit-id>.
  • Add the tests for the regression bug. In this case it’s add() function.

I’m gonna go ahead and add a fail test case for the regression bug I’m having. You may argue why not add test in the first place? Well, this is just an example so I have 0 test for it.

In actual scenario, you can have an extensive test cases but still can miss an edge case. In that scenario, you will need to add that fail test for that edge case here.

// test.js
const assert = require('assert')
const add = require('./add')

assert(add(1,2) === 3, 'one plus two should equal to three')
  • Do git bisect run <test-command>. In this case, it would be git bisect run node test.js.

  • Do git bisect log and see the result. It would look like this.

# bad: [addb180af061bbfbad298cd6a9ad2110df0f873e] feat: add multiply
git bisect bad addb180af061bbfbad298cd6a9ad2110df0f873e
# good: [7688391b1a9b133bef92198e376c9f5979260ade] feat: add add() function
git bisect good 7688391b1a9b133bef92198e376c9f5979260ade
# bad: [d504f94f1d71c93deb9d9bbdf87bfe333bbecff6] chore: add readme
git bisect bad d504f94f1d71c93deb9d9bbdf87bfe333bbecff6
# bad: [d516aaf29331953382a8558f013b683427d7a390] feat: add subtract() function
git bisect bad d516aaf29331953382a8558f013b683427d7a390
# first bad commit: [d516aaf29331953382a8558f013b683427d7a390] feat: add subtract() function

There you can see which first commit makes the test fail is [d516aaf29331953382a8558f013b683427d7a390] feat: add subtract() function.

  • Do git bisect reset when you’re done.

Happy coding!

The state of tiling window manager on Windows 10

The state

1. Workspacer

  • Opensource.
  • Best in term of feature set.
  • Closest thing to an actual tiling window manager on Windows 10.
  • Have some weird bugs that’s quite annoying.
  • Development velocity is slow.

2. Bug.n

  • Opensource.
  • Written in scripting language - AutoHotkey. Require AutoHotkey installed.

3. PowerToys

  • Opensourced by Microsoft.
  • Not really tiling. It’s still manual process.
  • Very stable for daily use.
  • Not really focusing on window management, it’s just one of the features.
  • Development velocity is very fast.

Conclusion

I settled on PowerToys for now but keeping an eye on workspacer. I love workspacer’s feature set. It actually has everything I need but lacking stability.

A beginner guide to CPU air cooling

I was researching about CPU air coolers for the best CPU air cooler and I learnt a lot of stuff. Thought this might be useful to some of you. Most of the stuff below are copy and pasted from various sources. This serves as a basic guide for absolute beginners like me when it comes to CPU air cooling.

TL;DR: Noctua NH-D15 is the best air CPU cooler out there.

Reading fan specs

CFM, m³/h, RPM, dB/A and mm H20

Every fan features a cubic feet per minute (CFM) rating, which measures of the volume of air it moves in a minute. The greater the CFM, the more air a fan moves. To properly air cool your computer, you need have enough case fans to push or pull air into and out of the case. More case fans means higher total CFM and more air being moved through your computer.

Some manufacturers use m³/h as unit. One m³/h is ~0.589 CFM and one CFM is ~1.699 m³/h. You can use website like ConvertUnits.com to convert them.

The airflow is always moving from the front to the back of the fan

Other specs like RPM (rounds per minute) and dB/A which is the noise level are kind of self-explained. Higher RPM is cooler but also noisier. Lower dB/A is better.

You may also notice mm H2O which stands for millimeters of water - which is measurement of “static pressure” - the amount of negative pressure it takes to makee a fan come to a complete stop at xxx RPM. As you can see, it’s highly dependent on RPM speed.

Fan connector pins

A three pin connector is basically power (5/12 volt), ground, and signal. The signal wire measures how fast the fan is moving without any controls for the fan’s speed. With this type, fan speed is typically controlled by increasing or decreasing the voltage over the power wire.

A four pin connector is a little different than the three pin connector since it has the extra (fourth) wire used for controlling and sending signals to the fan, which likely has a chip on it that tells it to slow down or speed up (in addition to the other wires the three pin connector has). You probably don’t need to know more than that.

Positive / Neutral / Negative air pressure

Explained by Linus Tech Tips

Or you can just take a look at thousand words below:

Positive pressure looks like this

Neutral pressure looks like this

Negative pressure looks like this

Selecting a CPU air cooler

Once you found an air cooler that you like, make sure it can be fit inside your case. Read your case specs and find CPU cooler clearance.

To be sure, you can also go to site like PCPartPicker and find complete builds features the case and cooler you intend to use.

References:

Advanced filtering and sorting with redis (part 2)

With the recent introduction of Redis modules (since Redis v4), redis is now a lot more flexible than the old redis.

Previously in part 1, if you want to mimic the sorting and filtering behavior, you need to use set/sorted set and do the intersection/union by yourself.

Not anymore.

Meet RediSQL

RediSQL is an in-memory SQL engine, built on top of Redis as Redis module.

It’s pretty much SQL under the hood now. No more smart trick to mimic the behavior.

The downside is there ain’t many redis client that support browsing data for these modules, aside from the newly released RedisInsight, which currently only support RedisGraph, RediSearch and RedisTimeSeries. This makes debugging is really troublesome. This is a big show stopper for me. Just something for you to keep in mind.

RedisGraph

RedisGraph is a graph database module for Redis. It’s specificly built for graph database but can be utilized for doing filtering as well, because it’s a graph database. It’s kinda using the wrong tool for the purpose. RedisGraph is a lot more powerful than just doing filtering and sorting.

Example of doing filtering in RedisGraph

Loading data

GRAPH.QUERY TestGraph "CREATE (:Property {id: '1', name: 'hotel 1'})-[:hasFacility]->(:Facility {id: '1', name: 'Swimming pool'})"
GRAPH.QUERY TestGraph "CREATE (:Property {id: '1', name: 'hotel 1'})-[:inCity]->(:City {id: '1', name: 'Hanoi'})"
GRAPH.QUERY TestGraph "CREATE (:Property {id: '1', name: 'hotel 1'})-[:hasStarRating]->(:Rating {id: '4', name: '4 star'})"

GRAPH.QUERY TestGraph "CREATE (:Property {id: '2', name: 'hotel 2'})-[:hasFacility]->(:Facility {id: '2', name: 'Spa'})"
GRAPH.QUERY TestGraph "CREATE (:Property {id: '2', name: 'hotel 2'})-[:inCity]->(:City {id: '1', name: 'Hanoi'})"
GRAPH.QUERY TestGraph "CREATE (:Property {id: '2', name: 'hotel 2'})-[:hasStarRating]->(:Rating {id: '3', name: '3 star'})"

Filter all 3 star hotels in Hanoi

GRAPH.QUERY TestGraph "MATCH (h:Property)-[:inCity]->(c:City) WHERE c.name = 'Hanoi' and r.name='3 star' RETURN h.id, h.name"

Pi-hole

I’ve heard a lot of praise about Pi-hole project but hadn’t gotten around actually trying it yet until recently.

Pi-hole is a network-wide ad-blocking solution via local DNS. You set it up as a local DNS server and it will block all the ads that match the rule from DNS level. This way, you don’t have to setup adblock on each and every devices you have, especially tablets and mobiles.

People usually use it with a low powered device like Raspberry Pi (hence the name Pi-hole) but in my case, I already have an Intel NUC around as Plex server (running Windows 10). I could just use it instead of setting up sth new.

The easiest way to install Pi-hole is to use Docker. The process is as easy as

  • Install Docker for Desktop.
  • Create a few folders for pihole config. Let’s do it in Documents folder and mount it to container. If you change it to something else, make sure to update the following commands. Create the 3 folders with the folowing structure.
pi-hole-config/
├── dnsmasq.d/
├── pihole/
  • Download and run pihole Docker container with the following command
docker run -d --name pihole \
    -p 53:53/tcp \
    -p 53:53/udp \
    -p 80:80 \
    -p 443:443 \
    -v "/c/Users/<USESRNAME>/Documents/pi-hole-config/pihole/:/etc/pihole/" \
    -v "/c/Users/<USESRNAME>/Documents/pi-hole-config/dnsmasq.d/:/etc/dnsmasq.d/" \
    -e ServerIP="<YOUR_HOST_IP>" \
    --dns=127.0.0.1 \
    --dns=1.1.1.1 \
    -e WEBPASSWORD=<PASSWD> \
    --restart=unless-stopped pihole/pihole:latest
  • I need to disable Windows Firewall for local network which I think it’s safe to do at home in order to access the container from other machine.

And that’s it. Now you can head over to HOST_IP/admin to login and configure Pi-hole. Once it’s done, you can config your router to use the HOST_IP as the default DNS server, and maybe Cloudflare or Google’s DNS as backup. Ah, and make sure to set static IP for your Pi-hole host machine so you don’t have to update router’s setting if the IP changes.

Keep in mind that Pihole is not a complete replacement for browser extension like uBlock origin. You probably still needs that because DNS-based adblocking functionality is quite limited. It’s mostly useful for mobile browser where adblocking is almost non-existent or not good enough.