Scaling Instagram

  • The talk was broadly themed around the idea of incremental, just-enough improvements at the infra level, and realizing when to make the next one.
  • Instagram scaled to 30m MAUs from 2010-12, and to 300m MAUs by 2015 (when this talk was given).
    • 2012 Stack
    • 2015 Stack
  • Internal policy is to have all requests return within 3 seconds.
    • This seems almost too liberal; is this possibly still true?
  • Initially used “gearman” as a task queue, and this served them well for 18 months.
    • Once they outgrew this, they used a home-grown sharding scheme which took them a little further.
    • They eventually switched to Celery + RabbitMQ.
    • Gearman’s average enqueue time was 60ms at p50 and 1 second at p95.
      • With RabbitMQ this dropped to 5ms at p50 and tens of ms at p95.
  • Company culture of getting existing infra to work for them over switching to new infra unless absolutely necessary.
    • I can see how this would be a double-edged sword, but this seems like sound advice for a small engineering team.
  • Deployments started with fabric and git pull
    • Eventually moved to uploading an artifact to S3, and having Fabric pull this artifact down to each machine + perform the switch via ln
    • Once this grew cumbersome they built what seems to be a manual lock service, so engineers could ensure they weren’t stepping on each others’ toes.
    • This eventually led to a real CD pipeline.
  • “Don’t automate something until you understand it really well by doing it manually a lot”
  • Their search system started with Postgres LIKE queries, which is actually not that bad for prefix searches, but pretty much O(n) for anything else.
    • This gave way to a solr box that allowed more expressive searches. Lack of clustering led them to outgrow this piece of infra.
    • Next up was Elasticsearch, which worked well, but was too expensive in terms of ops load for a team of their size.
      • Confident that they could’ve made this work with a dedicated search team.
    • Used Facebook’s internal graph database (unicorn), which allowed for very expressive s-exp based searches.
  • The Explore page was entirely driven by the expressiveness of the searches they could perform.
    • Started of with top-liked images globally
    • Moved to things people you follow like
      • Realized this didn’t work because you don’t necessarily share the same taste as everyone you follow.
    • Moved to things people you liked like, which ended up working really well.

Follow-Ups

Edit