Latacora

A Child’s Garden of Inter-Service Authentication Schemes

Modern applications tend to be composed from relationships between smaller applications. Secure modern applications thus need a way to express and enforce security policies that span multiple services. This is the “server-to-server” (S2S) authentication and authorization problem (for simplicity, I’ll mash both concepts into the term “auth” for most of this post).

Designers today have a lot of options for S2S auth, but there isn’t much clarity about what the options are or why you’d select any of them. Bad decisions sometimes result. What follows is a stab at clearing the question up.

Cast Of Characters

Alice and Bob are services on a production VPC. Alice wants to make a request of Bob. How can we design a system that allows this to happen?

Here’s, I think, a pretty comprehensive overview of available S2S schemes. I’ve done my best to describe the “what’s” and minimize the “why’s”, beyond just explaining the motivation for each scheme. Importantly, these are all things that reasonable teams use for S2S auth.

Nothing At All

Far and away the most popular S2S scheme is “no auth at all”. Internet users can’t reach internal services. There’s little perceived need to protect a service whose only clients are already trusted.

Bearer Token

Bearer tokens rule everything around us. Give Alice a small blob of data, such that when Bob sees that data presented, he assumes he’s talking to Alice. Cookies are bearer tokens. Most API keys are bearer tokens. OAuth is an elaborate scheme for generating and relaying bearer tokens. SAML assertions are delivered in bearer tokens.

The canonical bearer token is a random string, generated from a secure RNG, that is at least 16 characters long (that is: we generally consider 128 bits a reasonable common security denominator). But part of the point of a bearer token is that the holder doesn’t care what it is, so Alice’s bearer token could also encode data that Bob could recover. This is common in client-server designs and less common in S2S designs.

A few words about passwords

S2S passwords are disappointingly common. You see them in a lot of over-the-Internet APIs (ie, for S2S relationships that span companies). A password is basically a bearer token that you can memorize and quickly type. Computers are, in 2018, actually pretty good at memorizing and typing, and so you should use real secrets, rather than passwords, in S2S applications.

HMAC(timestamp)

The problem with bearer tokens is that anybody who has them can use them. And they’re routinely transmitted. They could get captured off the wire, or logged by a proxy. This keeps smart ops people up at night, and motivates a lot of “innovation”.

You can keep the simplicity of bearer tokens while avoiding the capture-in-flight problem by exchanging the tokens with secrets, and using the secrets to authenticate a timestamp. A valid HMAC proves ownership of the shared secret without revealing it. You’d then proceed as with bearer tokens.

A few words about TOTP

TOTP is basically HMAC(timestamp) stripped down to make it easy for humans to briefly memorize and type. As with passwords, you shouldn’t see TOTP in S2S applications.

A few words about PAKEs

PAKEs are a sort of inexplicably popular cryptographic construction for securely proving knowledge of a password and, from that proof, deriving an ephemeral shared secret. SRP is a PAKE. People go out of their way to find applications for PAKEs. The thing to understand about them is that they’re fundamentally a way to extract cryptographic strength from passwords. Since this isn’t a problem computers have, PAKEs don’t make sense for S2S auth.

Encrypted Tokens

HMAC(timestamp) is stateful; it works because there’s pairwise knowledge of secrets and the metadata associated with them. Usually, this is fine. But sometimes it’s hard to get all the parties to share metadata.

Instead of making that metadata implicit to the protocol, you can store it directly in the credential: include it alongside the timestamp and HMAC or encrypt it. This is how Rails cookie storage works; it’s also the dominant use case for JWTs. AWS-style request “signing” is another example (using HMAC and forgoing encryption).

By themselves, encrypted tokens make more sense in client-server settings than they do for S2S. Unlike client-server, where a server can just use the same secret for all the clients, S2S tokens still require some kind of pairwise state-keeping.

Macaroons

You can’t easily design a system where Alice takes her encrypted token, reduces its security scope (for instance, from read-write to read-only), and then passes it to Dave to use on her behalf. No matter how “sophisticated” we make the encoding and transmission mechanisms, encrypted tokens still basically express bearer logic.

Macaroons are an interesting (and criminally underused) construction that directly provides both delegation and attenuation. They’re a kind of token from which you can derive more restricted tokens (that’s the “attenuation”), and, if you want, pass that token to someone else to use without them being able to exceed the authorization you gave them. Macaroons accomplish this by chaining HMAC; the HMAC of a macaroon is the HMAC secret for its derived attenuated macaroons.

By adding encryption along with HMAC, Macaroons also express “third-party” conditions. Alice can get Charles to attest that Alice is a member of the super-awesome-best-friends-club, and include that in the Macaroon she delivers to Bob. If Bob also trusts Charles, Bob can safely learn whether Alice is in the club. Macaroons can flexibly express whole trees of these kinds of relationships, capturing identity, revocation, and… actually, revocation and identity are the only two big wins I can think of for this feature.

Asymmetric Tokens

You can swap the symmetric constructions used in tokens for asymmetric tokens and get some additional properties.

Using signatures instead of HMACs, you get non-repudiability: Bob can verify Alice’s token, but can’t necessarily mint a new Alice token himself.

More importantly, you can eliminate pairwise configuration. Bob and Alice can trust Charles, who doesn’t even need to be online all the time, and from that trust derive mutual authentication.

The trade-offs for these capabilities are speed and complexity. Asymmetric cryptography is much slower and much more error-prone than symmetric cryptography.

Mutual TLS

Rather than designing a new asymmetric token format, every service can have a certificate. When Alice connects to Bob, Bob can check a whitelist of valid certificate fingerprints, and whether Alice’s name on her client certificate is allowed. Or, you could set up a simple CA, and Bob could trust any certificate signed by the CA. Things can get more complex; you might take advantage of X.509 and directly encode claims in certs (beyond just names).

A few words about SPIFFE

If you’re a Kubernetes person this scheme is also sometimes called SPIFFE.

A few words about Tokbind

If you’re a participant in the IETF TLS Working Group, you can combine bearer tokens and MTLS using tokbind. Think of tokbind as a sort of “TLS cookie”. It’s derived from the client and server certificate and survives multiple TLS connections. You can use a tokbind secret to sign a bearer token, resulting in a bearer token that is confined to a particular MTLS relationship that can’t be used in any other context.

Magic Headers

Instead of building an explicit application-layer S2S scheme, you can punt the problem to your infrastructure. Ensure all requests are routed through one or more trusted, stateful proxies. Have the proxies set headers on the forwarded requests. Have the services trust the headers.

This accomplishes the same things a complicated Mutual TLS scheme does without requiring slow, error-prone public-key encryption. The trade-off is that your policy is directly coupled to your network infrastructure.

Kerberos

You can try to get the benefits of magic headers and encrypted tokens at the same time using something like Kerberos, where there’s a magic server trusted by all parties, but bound by cryptography rather than network configuration. Services need to be introduced to the Kerberos server, but not to each other; mutual trust of the Kerberos server, and authorization logic that lives on that Kerberos server, resolves all auth questions. Notably, no asymmetric cryptography is needed to make this work.

Themes

What are the things we might want to achieve from an S2S scheme? Here’s a list. It’s incomplete. Understand that it’s probably not reasonable to expect all of these things from a single scheme.

Minimalism

This goal is less obvious than it seems. People adopt complicated auth schemes without clear rationales. It’s easy to lose security by doing this; every feature you add to an application – especially security features – adds attack surface. From an application security perspective, “do the simplest thing you can get away with” has a lot of merit. If you understand and keep careful track of your threat model, “nothing at all” can be a security-maximizing option. Certainly, minimalism motivates a lot of bearer token deployments.

The opposite of minimalism is complexity. A reasonable way to think about the tradeoffs in S2S design is to think of complexity as a currency you have to spend. If you introduce new complexity, what are you getting for it?

Claims

Authentication and authorization are two different things: who are you, and what are you allowed to do? Of the two problems, authorization is the harder one. An auth scheme can handle authorization, or assist authorization, or punt on it altogether.

Opaque bearer token schemes usually just convey identity. An encrypted token, on the other hand, might bind claims: statements that limit the scope of what the token enables, or metadata about the identity of the requestor.

Schemes that don’t bind claims can make sense if authorization logic between services is straightforward, or if there’s already a trusted system (for instance, a service discovery layer) that expresses authorization. Schemes that do bind claims can be problematic if the claims carried in an credential can be abused, or targeted by application flaws. On the other hand, an S2S scheme that supports claims can do useful things like propagating on-behalf-of requestor identities or supporting distributing tracing.

Confinement

The big problem with HTTP cookies is that once they’ve captured one, an attacker can abuse it however they see fit. You can do better than that by adding mitigations or caveats to credentials. They might be valid only for a short period of time, or valid only for a specific IP address (especially powerful when combined with short expiry), or, as in the case of Tokbind, valid only on a particular MTLS relationship.

Statelessness

Statelessness means Bob doesn’t have to remember much (or, ideally, anything) about Alice. This is an immensely popular motivator for some S2S schemes. It’s perceived as eliminating a potential performance bottleneck, and as simplifying deployment.

The tricky thing about statelessness is that it often doesn’t make sense to minimize state, only to eliminate it. If pairwise statefulness creeps back into the application for some other reason (for instance, Bob has to remember anything at all about Alice), stateless S2S auth can spend a lot of complexity for no real gain.

Pairwise Configuration

Pairwise configuration is the bête noire of S2S operational requirements. An application secret that has to be generated once for each of several peers and that anybody might ever store in code is part of a scheme in which secrets are never, ever rotated. In a relatively common set of circumstances, pairwise config means that new services can only be introduced during maintenance windows.

Still, if you have a relatively small and stable set of services (or if all instances of a particular service might simply share a credential), it can make sense to move complexity out of the application design and into the operational requirements. Also it makes sense if you have an ops team and you never have to drink with them.

I kid, really, because if you can get away with it, not spending complexity to eliminate pairwise configuration can make sense. Also, many of the ways S2S schemes manage to eliminate pairwise configurations involve introducing yet another service, which has a sort of constant factor cost that can swamp the variable cost.

Delegation and Attenuation

People deploy a lot of pointless delegation. Application providers might use OAuth for their client-server login, for instance, even though no third-party applications exist. The flip side of this is that if you actually need delegation, you really want to have it expressed carefully in your protocol. The thing you don’t want to do is ever share a bearer token.

Delegation can show up in internal S2S designs as a building block. For instance, a Macaroon design might have a central identity issuance server that grants all-powerful tokens to systems that in turn filter them for specific requestors.

Some delegation schemes have implied or out-of-band attenuation. For instance, you might not be able to look at an OAuth token and know what it’s restrictions are. These systems are rough in practice; from an operational security perspective, your starting point probably needs to be that any lost token is game-over for its owner.

A problem with writing about attenuation is that Macaroons express it so well that it’s hard to write about its value without lapsing into the case for Macaroons.

Flexibility

If use JSON as your credential format, and you later build a feature that allows a credential to express not just Alice’s name but also whether she’s an admin, you can add that feature without changing the credential format. Later, attackers can add the feature where they turn any user into an admin, and you can then add the feature that breaks that attack. JSON is just features all the way down.

I’m only mostly serious. If you’re doing something more complicated than a bearer token, you’re going to choose an extensible mechanism. If not, I already made the case for minimalism.

Coupling

All things being equal, coupling is bad. If your S2S scheme is expressed by network controls and unprotected headers, it’s tightly coupled to the network deployment, which can’t change without updating the security scheme. But if your network configuration doesn’t change often, that limitation might save you a lot of complexity.

Revocation

People talk about this problem a lot. Stateless schemes have revocation problems: the whole point of a stateless scheme is for Bob not to have to remember anything about Alice (other than perhaps some configuration that says Alice is allowed to make requests, but not Dave, and this gets complicated really quickly and can quickly call into question the value of statelessness but let’s not go there). At any rate: a stateless bearer token will eventually be compromised, and you can’t just let it get used over and over again to steal data.

The two mainstream answers to this problem are short expiry and revocation lists.

Short expiry addresses revocation if: (a) you have a dedicated auth server and the channel to that server is somehow more secure than the channel between Alice and Bob.; (b) the auth server relies on a long-lived secret that never appears on the less-secure channel, and (c) issues an access secret that is transmitted on the less-secure channel, but lives only for a few minutes. These schemes are called “refresh tokens”. Refresh tends to find its way into a lot of designs where this fact pattern doesn’t hold. Security design is full of wooden headphones and coconut phones.

Revocation lists (and, usually, some attendant revocation service) are a sort of all-purpose solution to this problem; you just blacklist revoked tokens, for at least as long as the lifetime of the token. This obviously introduces state, but it’s a specific kind of state that doesn’t (you hope) grow as quickly as your service does. If it’s the only state you have to keep, it’s nice to have the flexibility of putting it wherever you want.

Rigidity

It is hard to screw up a random bearer token. Alice stores the token and supply it on requests. Bob uses the token to look up an entry in a database. There aren’t a lot of questions.

It is extraordinarily easy to screw up JWT. JWT is a JSON format where you have to parse and interpret a JSON document to figure out how to decrypt and authenticate a JSON document. It has revived bugs we thought long dead, like “repurposing asymmetric public keys as symmetric private keys”.

Problems with rigidity creep up a lot in distributed security. The first draft of this post said that MTLS was rigid; you’re either speaking TLS with a client cert or you’re not. But that ignores how hard X.509 validation is. If you’re not careful, an attacker can just ask Comodo for a free email certificate and use it to access your services. Worse still, MTLS can “fail open” in a way that TLS sort of doesn’t: if a service forgets to check for client certificates, TLS will still get negotiated, and you might not notice until an attacker does.

Long story short: bearer tokens are rigid. JWT is a kind of evil pudding. Don’t use JWT.

Universality

A nice attribute of widely deployed MTLS is that it can mitigate SSRF bugs (the very bad bug where an attacker coerces one of your service to make an arbitrary HTTP request, probably targeting your internal services, on their behalf). If the normal HTTP-request-generating code doesn’t add a client certificate, and every internal service needs to see one to honor a request, you’ve limited the SSRF attackers options a lot.

On the other hand, we forget that a lot of our internal services consist of code that we didn’t write. The best example of this is Redis, which for years proudly waved the banner of “if you can talk to it, you already own the whole application”.

It’s helpful if we can reasonably expect an auth control to span all the systems we use, from Postgres to our custom revocation server. That might be a realistic goal with Kerberos, or with network controls and magic headers; with tunnels or proxies, it’s even something you can do with MTLS – this is a reason MTLS is such a big deal for Kubernetes, where it’s reasonable for the infrastructure to provide every container with an MTLS-enabled Envoy proxy. On the other hand it’s unlikely to be something you can achieve with Macaroons or evil puddings.

Performance and Complexity

If you want performance and simplicity, you probably avoid asymmetric crypto, unless your request frequency is (and will remain) quite low. Similarly, you’d probably want to avoid dedicated auth servers, especially if Bob needs to be in constant contact with them for Alice to make requests to him; this is a reason people tend to migrate away from Kerberos.

Our Thoughts

Do the simplest thing that makes sense for your application right now. A true fact we can relate from something like a decade of consulting work on these problems: intricate S2S auth schemes are not the norm; if there’s a norm, it’s “nothing at all except for ELBs”. If you need something, but you have to ask whether that something oughtn’t just be bearer tokens, then just use bearer tokens.

Unfortunately, if there’s a second norm, it’s adopting complicated auth mechanisms independently or, worse, in combination, and then succumbing to vulnerabilities.

Macaroons are inexplicably underused. They’re the Velvet Underground of authentication mechanisms, hugely influential but with little radio airplay. Unlike the Velvets, Macaroons aren’t overrated. They work well for client-server auth and for s2s auth. They’re very flexible but have reassuring format rigidity, and they elegantly take advantage of just a couple simple crypto operations. There are libraries for all the mainstream languages. You will have a hard time coming up with a scenario where we’d try to talk you out of using them.

JWT is a standard that tries to do too much and ends up doing everything haphazardly. Our loathing of JWT motivated this post, but this post isn’t about JWT; we’ll write more about it in the future.

If your inter-service auth problem really decomposes to inter-container (or, without containers, inter-instance) auth, MTLS starts to make sense. The container-container MTLS story usually involves containers including a proxy, like Envoy, that mediates access. If you’re not connecting containers, or have ad-hoc components, MTLS can really start to take on a CORBA feel: random sidecar processes (here stunnel, there Envoy, and this one app that tries to do everything itself). It can be a pain to configure properly, and this is a place you need to get configurations right.

If you can do MTLS in such a way that there is exactly one way all your applications use it (probably: a single proxy that all your applications install), consider MTLS. Otherwise, be cautious about it.

Beyond that, we don’t want to be too much more prescriptive. Rather, we’d just urge you to think about what you’re actually getting from an S2S auth scheme before adopting it.

(But really, you should just use Macaroons.)

Gripes with Google Groups

If you’re like me, you think of Google Groups as the Usenet client turned mailing list manager. If you’re a GCP user or maybe one of a handful of SAML users you probably know Google Groups as an access control mechanism. The bad news is we’re both right.

This can blow up if permissions on those groups aren’t set right. Your groups were probably originally created by a sleep-deprived founder way before anyone was worried about access control. It’s been lovingly handcrafted and never audited ever since. Let’s say their configuration is, uh, “inconsistent”. If an administrator adds people to the right groups as part of their on-boarding, it’s not obvious when group membership is secretly self-service. Even if someone can’t join a group, they might still be able to read it.

You don’t even need something using group membership as access control for this to go south. The simplest way is a password reset email. (Having a list of all of your vendors feels like a dorky compliance requirement, but it’s underrated. Being able to audit which ones have multi-factor authentication is awesome.)

Some example scenarios:

Scenario 1 You get your first few customers and start seeing fraud. You create a mailing list with the few folks who want to talk about that topic. Nobody imagined that dinky mailing list would grow out to a full-fledged team, let alone one with permissions to a third party analytics suite that has access to all your raw data.

Scenario 2 Engineering team treats their mailing list as open access for the entire company. Ops deals with ongoing incidents candidly and has had bad experiences with nosy managers looking for scapegoats. That’s great until someone in ops extends an access control check in some custom software that gates on ops@ to also include engineering@.

Scenario 3 board@ gets a new investor who insists on using their existing email address. An administrator confuses the Google Groups setting for allowing out-of-domain addresses with allowing out-of-domain registration. Everyone on the Internet can read the cap table for your next funding round.

This is a mess. It bites teams that otherwise have their ducks in a row. Cleaning it up gets way worse down the line. Get in front of it now and you probably won’t have to worry about it until someone makes you audit it, which is probably 2-3 years from now.

Google Groups has some default configurations for new groups these days:

  • Public (Anyone in ${DOMAIN} can join, post messages, view the members list, and read the archives.)
  • Team (Only managers can invite new members, but anyone in ${DOMAIN} can post messages, view the members list, and read the archives.)
  • Announcement-only (Only managers can post messages and view the members list, but anyone in ${DOMAIN} can join and read the archives.)
  • Restricted (Only managers can invite new members. Only members can post messages, view the members list, and read the archives. Messages to the group do not appear in search results.)

This is good but doesn’t mean you’re out of the woods:

  • These are just defaults for access control settings. Once a group is created, you get to deal with the combinatorial explosion of options. Most of them don’t really make sense. You probably don’t know when someone messes with the group, though.
  • People rarely document intent in the group description (or anywhere for that matter). When a group deviates, you have no idea if it was supposed to.
  • “Team” lets anyone in the domain read. That doesn’t cover “nosy manager” or “password reset” scenarios.

Auditing this is kind of a pain. The UI is slow and relevant controls are spread across multiple pages. Even smallish companies end up with dozens of groups. The only way we’ve found to make this not suck is by using the GSuite Admin SDK and that’s a liberal definition of “not suck”.

You should have a few archetypes of groups. Put the name in the group itself, because that way the expected audience and access control is obvious to users and auditors alike. Here are some archetypes we’ve found:

  • Team mailing lists, should be called xyzzy-team@${DOMAIN}. Only has team members, no external members, no self-service membership.
  • Internal-facing mailing lists, should be called xyzzy-corp@${DOMAIN}. Public self-serve access for employees, no external members, limit posting to domain members or mailing list members. These are often associated with a team, but unlike -team mailing lists anyone can join them.
  • External-facing lists. Example: contracts-inbound@${DOMAIN}. No self-serve access, no external members, but anyone can post.
  • External member lists (e.g. boards, investors): board-ext@${DOMAIN}. No self-serve access, external members allowed, members and either members or anyone at the domain can post.

PS: Groups can let some users post as the group. I haven’t ran a phishing exercise that way, but I’m guessing an email appearing to legitimately come from board@company.com is going to be pretty effective.

There Will Be WireGuard

Amidst the hubbub of the Efail PGP/SMIME debacle yesterday, the WireGuard project made a pretty momentous announcement: a MacOS command line version of the WireGuard VPN is now available for testing, and should stabilize in the coming few months. I’m prepared to be wrong, but I think that for a lot of young tech companies, this might be the biggest thing to happen to remote access in decades.

WireGuard is a modern, streamlined VPN protocol that Jason Donenfeld developed based on Trevor Perrin’s Noise protocol framework. Imagine a VPN with the cryptographic sophistication of Signal Protocol and you’re not far off. Here are the important details:

WireGuard is orders of magnitude smaller than the IPSEC or OpenVPN stacks. On Linux, the codebase is something like 4500 lines. It’s designed to be simple and easy to audit. Simplicity and concision are goals of the whole system, from protocol to implementation. The protocol was carefully designed to make it straightforward to implement without dynamic memory allocation, eliminating whole classes of memory lifecycle vulnerabilities. The crypto underpinning WireGuard is non-negotiably DJB’s ChaPoly stack, eliminating handshake and negotiation vulnerabilities.

WireGuard is fast; faster than strongSwan or OpenVPN.

WireGuard is extremely simple to configure. In fact, it may be pretty close to the platonic ideal of configurability: you number both end of the VPN, generate keypairs, and point the client at the server, and you’re done.

Linux people have had WireGuard for many months now (WireGuard is so good that team members here at Latacora used to run Linux VMs to get it). But the most important use case for VPNs for startups is to get developers access to cloud deployment environments, and developers use MacOS, which made it hard to recommend.

Not for much longer.

It’s a little hard to overstate how big a deal this is. strongSwan and OpenVPN are two of the scariest bits of infrastructure startups operate for themselves. Nobody trusts either codebase, or, for that matter, either crypto protocol. Both are nightmares to configure and manage. As a result, fewer people set up VPNs than should; a basic building block of secure access management is hidden away.

We’re enthusiastic about WireGuard and think startups should look into adopting it as soon as is practicable. It’s simple enough to set up that you can just run it alongside your existing VPN infrastructure until you’re comfortable with it.

Death to SSH over the public Internet. Death to OpenVPN. Death to IPSEC. Long live WireGuard!

Dumb Security Questionnaires

It’s weird to say this but a significant part of the value we provide clients is filling out Dumb Security Questionnaires (hereafter DSQs, since the only thing more irritating than a questionnaire is spelling “questionnaire”).

Daniel Meiessler compains about DSQs, arguing that self-assessment is an intrinsically flawed concept.

Meh. I have bigger problems with them.

First, most DSQs are terrible. We get on calls with prospective clients, tell them “these DSQs were all first written in the early 1990s and lovingly handed down from generation to generation of midwestern IT secops staff. Oh, how clients laugh and laugh. But, not joking. That’s really how those DSQs got written.

You can tell, because they ask insane questions. “Document your intrusion detection deployment.” It’s 2018. Nobody is deploying RealSecure sensors anymore. I don’t need the most recent signatures for CVE-2017-8682. My threat actors aren’t exploiting Windows font library bugs on AWS VPCs.

This seems super obvious but: we meet companies all the time that got a DSQ and then went and deployed a bunch of IDS sensors, or set up automatic web security scanners, or, God help us, tried to get a WAF up and running.

So that is a reason DSQs are bad.

Another reason is that they’re mostly performative. Here’s a timeless DSQ question: “provide detailed network maps of all your environments”. Ok, network maps can be useful. But what is the DSQ owner doing with that information? You could draw pretty much anything. Connect your VPCs in ways that make them spell bad words. Nobody is carefully following sources and sinks and looking for vulnerabilities. The point of that question is: “are you sophisticated enough to draw a network map”.

The Vendor Security Alliance is an armada of high-status tech companies — Atlassian, Square, Dropbox, for some reason GoDaddy, a bunch of other companies, and Twitter. The purpose of the VSA was to build the ultimate DSQ. A lot of the VSA companies have excellent security teams. How I know that is, the VSA’s DSQ is full of performative questions I suspect were bikeshedded in by, for example, the one engineer at Docker who one time had to write a document called “Docker Encryption Standard” and now every vendor being hazed by the VSA has to provide their own Encryption Standard.

I do not believe the VSA is qualified to evaluate Encryption Standards! Well: Square, maybe. But I don’t think Square is the reason that question is there.

This brings me to my third and final problem with DSQs, which is that they’re based on a broken premise. That premise is: “most vendors have security teams that look something like the security team at Square”.

How I know that is, the VSA wants you to explain how you: segmented your network for least-privilege, enrolled all your application secrets in a secret-sharing scheme, run a bug bounty program, have data loss prevention running on all your endpoints, actively prevent OWASP-style web vulnerabilities, have a vulnerability management program tied into patch triage, run security awareness training, have a full complement of ISO-aspirational security policy documents, have a SAML SSO system, use 2FA, have complete classification of all the data you handle, an IR policy, and, of course, “threat modeling” as part of your SDLC.

Except for threat modeling, these are all good things. But come on.

I spent 10 years as an app pentester. I’ve killed women and children. I’ve killed just about everything that walks or crawled at one time or another. And I’m here to tell you: most banks would have a hard time checking all the boxes in the VSA. And they spend 8-9 digits annually on infosec programs. (I’m just guessing about that).

Technology vendors — young SaaS companies — are not banks, nor are they Square. A lot of the VSA DSQ items would be silly for them to attempt to do seriously.

What are the 5 most likely ways a SaaS company is going to get owned up?

  1. A developer is going to leave an AWS credential somewhere an attacker can find it.
  2. An employee password is going to get credential-stuffed into an admin interface.
  3. A developer is going to forget how to parameterize an ORDER BY clause and introduce an SQLI vulnerability.
  4. A developer is going to set up a wiki or a Jenkins server on an EC2 instance with a routable IP and an open security group.
  5. I’m sure there’s a 5th way but I’m drawing a blank.

Someone — and I am not volunteering — should write the DSQ that just nails these basic things. 10 questions, no diagrams.

Cryptographic Right Answers

We’re less interested in empowering developers and a lot more pessimistic about the prospects of getting this stuff right.

There are, in the literature and in the most sophisticated modern systems, “better” answers for many of these items. If you’re building for low-footprint embedded systems, you can use STROBE and a sound, modern, authenticated encryption stack entirely out of a single SHA-3-like sponge constructions. You can use NOISE to build a secure transport protocol with its own AKE. Speaking of AKEs, there are, like, 30 different password AKEs you could choose from.

But if you’re a developer and not a cryptography engineer, you shouldn’t do any of that. You should keep things simple and conventional and easy to analyze; “boring”, as the Google TLS people would say.

Cryptographic Right Answers

Encrypting Data

Percival, 2009: AES-CTR with HMAC.

Ptacek, 2015: (1) NaCl/libsodium’s default, (2) ChaCha20-Poly1305, or (3) AES-GCM.

Latacora, 2018: KMS or XSalsa20+Poly1305

You care about this if: you’re hiding information from users or the network.

If you are in a position to use KMS, Amazon’s (or Google’s) Hardware Security Module time share, use KMS. If you could use KMS but encrypting is just a fun weekend project and you might be able to save some money by minimizing your KMS usage, use KMS. If you’re just encrypting secrets like API tokens for your application at startup, use SSM Parameter Store, which is KMS. You don’t have to understand how KMS works.

Otherwise, what you want ideally is “AEAD”: authenticated encryption with additional data (the option for plaintext authenticated headers).

The mainstream way to get authenticated encryption is to use a stream cipher (usually: AES in CTR mode) composed with a polynomial MAC (a cryptographic CRC).

The problem you’ll run into with all those mainstream options is nonces: they want you to come up with a unique (usually random) number for each stream which can never be reused. It’s simplest to generate nonces from a secure random number generator, so you want a scheme that makes that easy.

Nonces are particularly important for AES-GCM, which is the most popular mode of encryption. Unfortunately, it’s particularly tricky with AES-GCM, where it’s just-barely-but-maybe-not-quite on the border of safe to use random nonces.

So we recommend you use XSalsa20-Poly1305. This is a species of “ChaPoly” constructions, which, put together, are the most common encryption constructions outside of AES-GCM. Get XSalsa20-Poly1305 from libsodium or NaCl.

The advantage to XSalsa20 over ChaCha20 and Salsa20 is that XSalsa supports an extended nonce; it’s big enough that you can simply generate a big long random nonce for every stream and not worry about how many streams you’re encrypting.

There are “NMR” or “MRAE” schemes in the pipeline that promise some degree of security even if nonces are mishandled; these include GCM-SIV (all the SIVs, really) and CAESAR-contest-finalist Deoxys-II. They’re interesting, but nobody really supports or uses them yet, and with an extended nonce, the security win is kind of marginal. They’re not boring. Stay boring for now.

Avoid: AES-CBC, AES-CTR by itself, block ciphers with 64-bit blocks — most especially Blowfish, which is inexplicably popular, OFB mode. Don’t ever use RC4, which is comically broken.

Symmetric key length

Percival, 2009: Use 256-bit keys.

Ptacek, 2015: Use 256-bit keys.

Latacora, 2018: Go ahead and use 256 bit keys.

You care about this if: you’re using cryptography.

But remember: your AES key is far less likely to be broken than your public key pair, so the latter key size should be larger if you’re going to obsess about this.

Avoid: constructions with huge keys, cipher “cascades”, key sizes under 128 bits.

Symmetric “Signatures”

Percival, 2009: Use HMAC.

Ptacek, 2015: Yep, use HMAC.

Latacora, 2018: Still HMAC.

You care about this if: you’re securing an API, encrypting session cookies, or are encrypting user data but, against medical advice, not using an AEAD construction.

If you’re authenticating but not encrypting, as with API requests, don’t do anything complicated. There is a class of crypto implementation bugs that arises from how you feed data to your MAC, so, if you’re designing a new system from scratch, Google “crypto canonicalization bugs”. Also, use a secure compare function.

If you use HMAC, people will feel the need to point out that SHA3 (and the truncated SHA2 hashes) can do “KMAC”, which is to say you can just concatenate the key and data and hash them and be secure. This means that in theory HMAC is doing unnecessary extra work with SHA-3 or truncated SHA-2. But who cares? Think of HMAC as cheap insurance for your design, in case someone switches to non-truncated SHA-2.

Avoid: custom “keyed hash” constructions, HMAC-MD5, HMAC-SHA1, complex polynomial MACs, encrypted hashes, CRC.

Hashing algorithm

Percival, 2009: Use SHA256 (SHA-2).

Ptacek, 2015: Use SHA-2.

Latacora, 2018: Still SHA-2.

You care about this if: you always care about this.

If you can get away with it: use SHA-512/256, which truncates its output and sidesteps length extension attacks.

We still think it’s less likely that you’ll upgrade from SHA-2 to SHA-3 than it is that you’ll upgrade from SHA-2 to something faster than SHA-3, and SHA-2 still looks great, so get comfortable and cuddly with SHA-2.

Avoid: SHA-1, MD5, MD6.

Random IDs

Percival, 2009: Use 256-bit random numbers.

Ptacek, 2015: Use 256-bit random numbers.

Latacora, 2018: Use 256-bit random numbers.

You care about this if: you always care about this.

From /dev/urandom.

Avoid: userspace random number generators, the OpenSSL RNG, havaged, prngd, egd, /dev/random.

Password handling

Percival, 2009: scrypt or PBKDF2.

Ptacek, 2015: In order of preference, use scrypt, bcrypt, and then if nothing else is available PBKDF2.

Latacora, 2018: In order of preference, use scrypt, argon2, bcrypt, and then if nothing else is available PBKDF2.

You care about this if: you accept passwords from users or, anywhere in your system, have human-intelligible secret keys.

But, seriously: you can throw a dart at a wall to pick one of these. Technically, argon2 and scrypt are materially better than bcrypt, which is much better than PBKDF2. In practice, it mostly matters that you use a real secure password hash, and not as much which one you use.

Don’t build elaborate password-hash-agility schemes.

Avoid: SHA-3, naked SHA-2, SHA-1, MD5.

Asymmetric encryption

Percival, 2009: Use RSAES-OAEP with SHA256 and MGF1+SHA256 bzzrt pop ffssssssst exponent 65537.

Ptacek, 2015: Use NaCl/libsodium (box / crypto_box).

Latacora, 2018: Use Nacl/libsodium (box / crypto_box).

You care about this if: you need to encrypt the same kind of message to many different people, some of them strangers, and they need to be able to accept the message asynchronously, like it was store-and-forward email, and then decrypt it offline. It’s a pretty narrow use case.

Of all the cryptographic “right answers”, this is the one you’re least likely to get right on your own. Don’t freelance public key encryption, and don’t use a low-level crypto library like OpenSSL or BouncyCastle.

Here are several reasons you should stop using RSA and switch to elliptic curve:

  • RSA (and DH) drag you towards “backwards compatibility” (ie: downgrade-attack compatibility) with insecure systems.
  • RSA begs implementors to encrypt directly with its public key primitive, which is usually not what you want to do
  • RSA has too many knobs. In modern curve systems, like Curve25519, everything is pre-set for security.

NaCl uses Curve25519 (the most popular modern curve, carefully designed to eliminate several classes of attacks against the NIST standard curves) in conjunction with a ChaPoly AEAD scheme. Your language will have bindings (or, in the case of Go, its own library implementation) to NaCl; use them. Don’t try to assemble this yourself.

Don’t use RSA.

Avoid: Systems designed after 2015 that use RSA, RSA-PKCS1v15, RSA, ElGamal, I don’t know, Merkle-Hellman knapsacks? Just avoid RSA.

Asymmetric signatures

Percival, 2009: Use RSASSA-PSS with SHA256 then MGF1+SHA256 in tricolor systemic silicate orientation.

Ptacek, 2015: Use Nacl, Ed25519, or RFC6979.

Latacora, 2018: Use Nacl or Ed25519.

You care about this if: you’re designing a new cryptocurrency. Or, a system to sign Ruby Gems or Vagrant images, or a DRM scheme, where the authenticity of a series of files arriving at random times needs to be checked offline against the same secret key. Or, you’re designing an encrypted message transport.

The allegations from the previous answer are incorporated herein as if stated in full.

The two dominating use cases within the last 10 years for asymmetric signatures are cryptocurrencies and forward-secret key agreement, as with ECDHE-TLS. The dominating algorithms for these use cases are all elliptic-curve based. Be wary of new systems that use RSA signatures.

In the last few years there has been a major shift away from conventional DSA signatures and towards misuse-resistent “deterministic” signature schemes, of which EdDSA and RFC6979 are the best examples. You can think of these schemes as “user-proofed” responses to the Playstation 3 ECDSA flaw, in which reuse of a random number leaked secret keys. Use deterministic signatures in preference to any other signature scheme.

Ed25519, the NaCl/libsodium default, is by far the most popular public key signature scheme outside of Bitcoin. It’s misuse-resistant and carefully designed in other ways as well. You shouldn’t freelance this either; get it from NaCl.

Avoid: RSA-PKCS1v15, RSA, ECDSA, DSA; really, especially avoid conventional DSA and ECDSA.

Diffie-Hellman

Percival, 2009: Operate over the 2048-bit Group #14 with a generator of 2.

Ptacek, 2015: Probably still DH-2048, or Nacl.

Latacora, 2018: Probably nothing. Or use Curve25519.

You care about this if: you’re designing an encrypted transport or messaging system that will be used someday by a stranger, and so static AES keys won’t work.

The 2015 version of this document confused the hell out of everyone.

Part of the problem is that our “Right Answers” are a response to Colin Percival’s “Right Answers”, and his included a “Diffie-Hellman” answer, as if “Diffie-Hellmanning” was a thing developers routinely do. In reality, developers simply shouldn’t freelance their own encrypted transports. To get a sense of the complexity of this issue, read the documentation for the Noise Protocol Framework. If you’re doing a key-exchange with DH, you probably want an authenticated key exchange (AKE) that resists key compromise impersonation (KCI), and so the primitive you use for DH is not the only important security concern.

But whatever.

It remains the case: if you can just use NaCl, use NaCl. You don’t even have to care what NaCl does. That’s the point of NaCl.

Otherwise: use Curve25519. There are libraries for virtually every language. In 2015, we were worried about encouraging people to write their own Curve25519 libraries, with visions of Javascript bignum implementations dancing in our heads. But really, part of the point of Curve25519 is that the entire curve was carefully chosen to minimize implementation errors. Don’t write your own! But really, just use Curve25519.

Don’t do ECDH with the NIST curves, where you’ll have to carefully verify elliptic curve points before computing with them to avoid leaking secrets. That attack is very simple to implement, easier than a CBC padding oracle, and far more devastating.

The 2015 document included a clause about using DH-1024 in preference to sketchy curve libraries. You know what? That’s still a valid point. Valid and stupid. The way to solve the “DH-1024 vs. sketchy curve library” problem is, the same as the “should I use Blowfish or IDEA?” problem. Don’t have that problem. Use Curve25519.

Avoid: conventional DH, SRP, J-PAKE, handshakes and negotiation, elaborate key negotiation schemes that only use block ciphers, srand(time()).*

Website security

Percival, 2009: Use OpenSSL.

Ptacek, 2015: Remains: OpenSSL, or BoringSSL if you can. Or just use AWS ELBs

Latacora, 2018: Use AWS ALB/ELB or OpenSSL, with LetsEncrypt

You care about this if: you have a website.

If you can pay AWS not to care about this problem, we recommend you do that.

Otherwise, there was a dark period between 2010 and 2016 where OpenSSL might not have been the right answer, but that time has passed. OpenSSL has gotten better, and, more importantly, OpenSSL is on-the-ball with vulnerability disclosure and response.

Using anything besides OpenSSL will drastically complicate your system for little, no, or even negative security benefit. So just keep it simple.

Speaking of simple: LetsEncrypt is free and automated. Set up a cron job to re-fetch certificates regularly, and test it.

Avoid: offbeat TLS libraries like PolarSSL, GnuTLS, and MatrixSSL.

Client-server application security

Percival, 2009: Distribute the server’s public RSA key with the client code, and do not use SSL.

Ptacek, 2015: Use OpenSSL, or BoringSSL if you can. Or just use AWS ELBs

Latacora, 2018: Use AWS ALB/ELB or OpenSSL, with LetsEncrypt

You care about this if: the previous recommendations about public-key crypto were relevant to you.*

It seems a little crazy to recommend TLS given its recent history:

  • The Logjam DH negotiation attack
  • The FREAK export cipher attack
  • The POODLE CBC oracle attack
  • The RC4 fiasco
  • The CRIME compression attack
  • The Lucky13 CBC padding oracle timing attack
  • The BEAST CBC chained IV attack
  • Heartbleed
  • Renegotiation
  • Triple Handshakes
  • Compromised CAs
  • DROWN (though personally we’re warped and an opportunity to play with attacks like DROWN would be in our “pro” column)

Here’s why you should still use TLS for your custom transport problem:

  • In custom protocols, you don’t have to (and shouldn’t) depend on 3rd party CAs. You don’t even have to use CAs at all (though it’s not hard to set up your own); you can just use a whitelist of self-signed certificates — which is approximately what SSH does by default, and what you’d come up with on your own.
  • Since you’re doing a custom protocol, you can use the best possible TLS cipher suites: TLS 1.2+, Curve25519, and ChaPoly. That eliminates most attacks on TLS. The reason everyone doesn’t do this is that they need backwards-compatibility, but in custom protocols you don’t need that.
  • Many of these attacks only work against browsers, because they rely on the victim accepting and executing attacker-controlled Javascript in order to generate repeated known/chosen plaintexts.

Avoid: designing your own encrypted transport, which is a genuinely hard engineering problem; using TLS but in a default configuration, like, with “curl”; using “curl”, IPSEC.

Online backups

Percival, 2009: Use Tarsnap.

Ptacek, 2015: Use Tarsnap.

Latacora, 2018: Store PMAC-SIV-encrypted arc files to S3 and save fingerprints of your backups to an ERC20-compatible blockchain.

You care about this if: you bother backing things up.

Just kidding. You should still use Tarsnap.

taps microphone #garfield #dank

Is this thing on? #brand #skittles