Build Log: Google Photos Downloader in Rust

Introduction

  • All my photos are in Google Photos
    • Used to go with Dropbox + Google Photos, but this was impractical.
    • Want to back things up locally.
  • Takeout is impractical and very flaky
    • The export to Google Drive works ok but takes days
    • And I’m not convinced that the incremental nature of this export works as expected.
  • I don’t want to stop using Google Photos, and it has an API, so I can just download my media periodically instead.
  • No downloaders exist (that I could find).
  • I’ll build one instead.
  • I recently read the Rust book and wanted to try using it for a real project that isn’t too trivial, so this seems like a good fit.

Design

Google Photos API

  • The API allows downloading pages of metadata, containing information about my media.
    • Each page is capped at 100 media items, but can contain fewer (not sure why; distribution?)
    • Media information within a page contains things like a (forever) ID and a download URL that’s valid for 60 minutes.
  • There’s also a search API that allows you to query for all media created after a certain date
    • This can be ambiguous; EXIF date, upload date, etc.?
    • Primary goal is completeness of the backed up data, so I’m not considering this seriously.
  • Continuation tokens, not page numbers, so metadata fetches can’t be parallelized.
  • Download requests are capped at 75k/day, all other API calls are capped at 10k/day.
    • Quotas reset at midnight pacific time.
  • Design
    • Design 1
      • Download all pages from beginning to end, saving the ID for each media item in a local database.
      • Iterate over items in the local database, request a download URL for each one, and download it.
      • Not ideal because of rate limits; extra overhead of a local database.
    • Design 2
      • Download all pages from beginning to end
      • Each page comes with temporary download URLs for the media it represents
      • Download all media before moving to the next page
      • Use the filesystem itself to decide whether or not we’ve already downloaded a file (use the media’s ID parameter as the filename)
  • The “correct” thing to do is to dynamically detect the mime type of downloaded (can this be calculated while streaming?) files and use that to set the extension
    • This sounds difficult, and I think most apps will “just cope” if I use .jpg and .mp4 as hardcoded extensions.
    • This seems to be the case so far, but this is something I want to fix.
  • Best way to detect quota overage? Attempt to track locally or just react to 529s?
    • Started with the former, but switched to the latter because it’s easier and more correct
    • No way to get the “live quota” at runtime via the API
  • Magnitude:
    • Google drive says I have 94 gigs of photos
    • Google dashboard says I have 198k photos
  • Quirks
    • Download photos by appending =d, videos by appending =dv
    • “For Android motion photos, only a video file is returned. For iOS Live Photos, a ZIP file containing both the photo and video files is returned”
      • I played around this a bit, and this appears to be untrue (for iOS)
      • iOS live photos show up as regular photos (no ZIP); they just respond to both =d and =dv
      • Worse, there’s no way to figure out if a given photo is a live photo or not based on the metadata; the only way is to try downloading with =dv
        • 200? Live photo. 404? Regular photo.
      • This has implications for “Design 2”; we can’t use just the filesystem to prevent redownloads anymore. Fix this by:
        • Every photo is potentially a live photo until proven otherwise
        • Try downloading every photo with both =d and =dv
        • If the latter returns a 404, write the ID to a non-motion-photos file (why not SQLite?)
        • At worst, this halves the download quota (75k → 37k)
        • When deciding whether or not to download a photo:
          • If the photo exists in non-motion-photos, check the filesystem for a photo with the id
          • If the photo doesn’t exist in non-motion-photos, check the fs (independently) for both a photo and a video
          • IMAGE/DIAGRAM WILL HELP HERE
        • Load the file into a HashSet on boot.

Rust

  • Rust has async/await now; should I use this?
    • At least for an initial pass, async IO seems like the right way to go for this project.
    • It’s heavily IO-bound, so I suspect that manual thread management is going to be more of a fine optimization for the long tail than a big win.
  • I’ve heard of tokio being an event loop for Rust; is this a competititor to async/await or complementary?
    • The latter. The rust stdlib contains the notion of (a trait) a future and the keywords async and await.
      • In addition, an async fn returns a future, and an await sets up a state machine as expected.
    • However, Rust does not come with an async runtime.
      • In practice, this means that there’s no way to wait on the top-level future in main (you can only await inside an async fn)
      • The simplest way to get around this is to use the futures crate, which contains methods to synchronously wait on a future
        • It also contains executors that can execute futures in simple thread pools.
      • Tokio (and possibly async-std) is a more involved runtimes that uses work-stealing thread pools (+ provides a reimplementation of the stdlib that works asynchronously)
        • Tokio also contains richer thread interop primitives (Rust has mpsc, but Tokio allows for unicast, multicast, fanout, etc.)
  • This is a bit overwhelming; what do I pick? Tokio, for two reasons:
    • Rust doesn’t natively allow reading/writing a file asynchronously, but Tokio does
    • reqwest/hyper is the most popular/mature HTTP client for Rust, and uses Tokio internally.
    • A bit confusingly, since the core Future trait is in standard Rust, there’s some level of interoperability possible here, using functions from the futures crate to operate on futures returned by tokio FS functions, for example.

Other Dependencies

  • A Google API client exists for Rust, but I’d rather work at the HTTP level directly for two reasons:
    • I learn more about how Rust deals with network IO
    • More granular control
    • Ditto for OAuth

Implementation

  • Overall the project took about a weekend’s worth of free time to write.
  • Not as much (not any, in fact) fighting with the borrow checker; reading the book first helped a lot here.
  • I could deal with corruption in-band (download to a temporary file and then atomically move it to its final location) or out-of-band (run an external program/script to detect corrupted media)
    • I’m going with the latter for simplicity for now, with the intention of moving over to the former.
  • Observability/metrics is hugely important, even in hobby projects like this.
  • Crate docs are very easily available + consistent + easy to switch to the right version.
  • Rust gotchas:
    • Rust analyzer is great when it works, but is very flaky!
    • Crates can have optional features that are enabled on-demand; this wasn’t super obvious to me and I spent a while wondering why tokio wouldn’t start until I read the docs more closely.
    • Some error messages are still a bit hard to parse; for example the solution here is to have the main function return Ok(()) (or any Result) rather than ().
      error[E0308]: mismatched types
        --> src/main.rs:29:1
         |
      29 | #[tokio::main]
         | ^^^^^^^^^^^^^^
         | |
         | expected enum `std::result::Result`, found `()`
         | help: try using a variant of the expected enum: `Ok(#[tokio::main])`
      30 | async fn main() -> Result<(), Box<dyn std::error::Error>> {
         |                    -------------------------------------- expected `std::result::Result<(), std::boxed::Box<(dyn std::error::Error + 'static)>>` because of return type
         |
         = note:   expected enum `std::result::Result<(), std::boxed::Box<(dyn std::error::Error + 'static)>>`
                 found unit type `()`
         = note: this error originates in an attribute macro (in Nightly builds, run with -Z macro-backtrace for more info)
      
      • Interestingly, this error makes a lot of intuitive sense to me now, after a few days of writing Rust, but this was incomprehensible when I’d just started.
    • Another example of a difficult error message; this one still doesn’t make any sense to me:
      error[E0277]: the trait bound `std::option::NoneError: std::error::Error` is not satisfied
        --> src/main.rs:43:51
         |
      43 |     let client_id = env.get("FERROTYPE_CLIENT_ID")?;
         |                                                   ^ the trait `std::error::Error` is not implemented for `std::option::NoneError`
         |
         = note: required because of the requirements on the impl of `std::convert::From<std::option::NoneError>` for `std::boxed::Box<dyn std::error::Error>`
         = note: required by `std::convert::From::from`
      
      error: aborting due to previous error
      
    • JSON deserialization wasn’t working for a while; this turned out to be an optional feature in reqwest that I hadn’t enabled.
      • On the whole deserializing incoming data directly into structs is really nice.
      • CODE_EXAMPLE
    • No date/time library built-in; Instant is useful for monotonic time comparisons but not much else.
      • chrono looks solid, but it’s a bit surprising that this isn’t built-in.
    • serde (the JSON library that reqwest uses) has a number of directives to guide deserialization. Useful, but this indirection was not obvious.
      • Ditto for figuring out that you had to manually pull in serde with the derive feature enabled to apply this to a struct
    • async functions can’t be called recursively without some indirection
  • How do you await multiple futures with Tokio? join!
  • How do you await a vector of futures with Tokio?
    • join_all existed in Tokio 0.1 but was removed in 0.2
    • This one took me a while, but I eventually realised that you can use utility functions from the futures crate because async functions return a std::Future regardless of the runtime in use.
  • async closures exist but are “unstable” so I can’t use them on stable Rust.
    • async blocks appear to be fine, though
  • A reqwest Response can give you a byte array of the response data as a Bytes instance
    • It was not obvious to me that this could effectively be cast to a &[u8].
  • Deployment
    • Local windows machine + WSL
    • Photo storage is on NTFS; not ideal, but changing this is non-trivial
    • Attempted to use Docker but this app needs terminal input to capture the initial OAuth token
    • Just using tmux instead; works ok for right now
  • Meant to be run in a cronjob-style setting, 1x-2x per day; your local dir will eventually converge with your data on Google Photos
    • Exception: if you have more than 10k pages’ worth of media (~1M media items), you will never progress past the 10000th page.
    • The only way to fix this is to have the app respond to rate limit violations by pausing until midnight PT without exiting and continuing where it left off.
  • I’ve only ever used OAuth libraries so far and never rolled my own, and it was never obvious to me that refresh tokens (at least the ones Google provides) basically live forever.
    • This marginally solves the docker deployment issue.

Execution

  • First run: downloaded 13k files and failed with a deserialization error
    • Not used to deserializing directly to structs, and used a mis-named field
  • Observability
    • I started by emitting a log line when every metadata page was done, but this wasn’t enough
      • Especially when there were are so many non-ideal conditions that don’t result in a panic/exit
      • The whole motion-photo messiness is another dimension here I want to track.
    • What I really need is metrics; what’s the best way to do this in Rust?
      • Passing an object/struct around sounds unappealing; I want to be able to tick a metric anywhere without worrying about architectural concerns
      • Does Rust support anything that looks like a singleton?
        • Not natively, but the lazy_static crate is just what I needed
        • CODE_SNIPPET
    • I set up metrics to track the number of cache hits on the fs (photos already downloaded) and the non-motion-photos HashSet (we know a media item is definitely not a motion-photo, so no need to waste a request to check if it is)
      • And to track the number of downloads and retries (and exhausted retries)
      • Having this information was invaluable, and is significantly more valuable than both tests and logs for a project of this sort.
  • This was working well so far, but I wasn’t saturating my network connection consistently IMAGE
    • A good way to fix this is to decouple the act of downloading a metadata page from the act of downloading the media it represents.
    • This almost certainly requires thread management + a queuing system, using backpressure to keep the two “processes” loosely coupled.
    • This is a lot more complex than the current architecture though, and I’m not convinced that this tradeoff makes sense.
    • As a more incremental improvement instead:
      • Download the next metadata page in parallel with the media items contained in the current metadata page
      • This minimizes the troughs in the network utilization with is a relatively minor change.
      • IMAGE

Conclusion

  • The app itself has worked great so far.
    • All my media was downloaded ~10 days after the first commit.
    • It turns out that I have 168k media items (counting motion photos twice) taking up ~208GB on disk, spanning ~ten years.
    • Many improvements are needed, but I could use this is as is for months to come without issues.
    • Haven’t set this up to run periodically yet.
  • So far I really enjoy writing Rust; I haven’t really enjoyed writing in a language this much since I started using Clojure
    • New languages I’ve learned after Clojure: Scala, Objective-C/Swift, Typescript, Go.
  • This isn’t news to anyone, but there’s a steep learning curve.
    • Reading the book (~700 pages) is almost the bare minimum required to get started with the language, which is not an easy ask.
  • Definitely going to be using it again, and I (more than ever) can’t wait for advent of code this year!
  • Tree data structures are super annoying though. :|

Original Notes

GP Design

  • Download the entire metadata blob on startup instead of relying on the search API to return just the “new” items
  • This ensures that we aren’t missing any intermediate data, at the cost of more work
  • Unfortunately, pagination uses continuations, so this can’t be parallelized :(
  • API quotas are 10k requests per day for metadata requests, and 75k reequests per day for media downloads.
  • This creates a very natural transition point at which caching metadata between runs starts making sense (1M media items).
  • baseUrls are included with metadata requests, and stay valid for 60 minutes; interesting!
    • so the design of “download all metadata, then download all images” isn’t optimal
    • it should be more like “download a page of metadata, enqueue downloads for any images in it, then download the next page”
  • “Fewer media items might be returned than the specified number” 😡
  • “Daily quotas reset at midnight Pacific Time (PT).”
  • For Android motion photos, only a video file is returned. For iOS Live Photos, a ZIP file containing both the photo and video files is returned.
    • I can’t unzip these files for idempotency, so whatever thing I write to consume this has to peek inside the ZIP file.
    • I looked at the actual responsee, and motion photos show up just like regular photos :/
  • filename vs mime type
  • sqlite or file for a list of motion photos?
  • dynamic quota: not possible apparently; why not just wait until the first 429 and panic?
  • According to https://myaccount.google.com/dashboard, I have 198201 photos, but the older ones are on the free tier.
    • Assuming 2MB per photo full-quality and 200KB per photo on the free tier,
    • The entire library is ~between 36GB and 360GB.
    • Can’t get more specific until I download everything, but just going by the count, I’ve downloaded ~23% of all our photos.

Initial

Do I need a runtime or not?

rust has async/await now: https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html

Does async/await imply that things like tokio are defunct?

No, Rust now includes async functions that return futures. Also you can await invocations of async functions that block the current control flow without blocking the current thread.

But, there isn’t anything built-in that can execute futures. The futures crate provides a few basic executors: https://docs.rs/futures/0.3.5/futures/executor/index.html And things like tokio/async-std provide more involved executors.

So do I need a runtime?

For a purely CPU-bound workload, it looks like the future-crate executors are sufficient. These executors don’t contain primitives for async IO, which I’m going to need, so I’m going to have to pick tokio or std-async. Other obvious differences with Tokio vs. futures:

  • The multithreaded executor uses a work-stealing approach (vs. a more vanilla fixed-size thread pool)
  • More flexible channels vs. just mpsc.
  • ???

Tokio or async-std? hyper (and reqwests) seems like the most mature HTTP library for Rust, and it uses tokio by default so tokio seems like the obvious choice; I don’t see a compelling reason that async-std is better, and I don’t want to waste time getting hyper working with it.

Do I need anything else to get started?

Probably not. Rough high-level design:

tokio + multithreaded + async IO

Dependencies

No real options for the google photos API; there’s an autogenerated API client, but I think it’s better to start manually because that’ll give me more control. There is a library for OAuth, but I just think I’ll learn a lot more if I implement this myself. I can fetch 100 metadata items at a time, so I have to make ~1000 requests per 100k media items.

Ongoing

  • The fact that crate docs are so easily available is great!
  • Rust analyzer is flaky, but is great when it works.
  • dead code warnings are very annoying when working on something
  • Tokio needs this incantation to be fully initialized:
    tokio = { "0.2.21", features = ["full"] }
    
  • Error messages are difficult to parse:
    error[E0308]: mismatched types
      --> src/main.rs:29:1
       |
    29 | #[tokio::main]
       | ^^^^^^^^^^^^^^
       | |
       | expected enum `std::result::Result`, found `()`
       | help: try using a variant of the expected enum: `Ok(#[tokio::main])`
    30 | async fn main() -> Result<(), Box<dyn std::error::Error>> {
       |                    -------------------------------------- expected `std::result::Result<(), std::boxed::Box<(dyn std::error::Error + 'static)>>` because of return type
       |
       = note:   expected enum `std::result::Result<(), std::boxed::Box<(dyn std::error::Error + 'static)>>`
               found unit type `()`
       = note: this error originates in an attribute macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    • Resolution here was to return Ok(()) in main
  • Typing use declarations by hand is annoying
  • This doesn’t work because the match arms are incompatible, even though the second arm causes an exit:
    match (client_id, secret) {
      (Some(client_id), Some(secret)) => authorize(client_id, secret).await.unwrap(),
      (_, _) => fail("Didn't get a FERROTYPE_CLIENT_ID and a FERROTYPE_SECRET")
    };
    
    • Fixing by using ? instead, but I’m not really sure how to resolve this if I wanted to do it this way.
  • Which also didn’t work:
    error[E0277]: the trait bound `std::option::NoneError: std::error::Error` is not satisfied
      --> src/main.rs:43:51
       |
    43 |     let client_id = env.get("FERROTYPE_CLIENT_ID")?;
       |                                                   ^ the trait `std::error::Error` is not implemented for `std::option::NoneError`
       |
       = note: required because of the requirements on the impl of `std::convert::From<std::option::NoneError>` for `std::boxed::Box<dyn std::error::Error>`
       = note: required by `std::convert::From::from`
    
    error: aborting due to previous error
    
    • 🤔
  • Going to just revert to panic for now
  • How do I accept CLI user input? There are crates for this (https://crates.io/crates/text_io and https://crates.io/crates/read_input), but I think I want to reimplement this stuff.
  • JSON deserializaiton wasn’t working for a while, until I saw “This requires the optional json feature enabled.” in the docs; this sort of thing needs to better publicized.
  • Not really sure how to deserialize straight to a struct, though; the example doesn’t work directly. Big surprise, I hadn’t enabled to “derive feature” for serde. :|
  • Rust doesn’t appear to have a date-time library built in (really?), but std::time::Instant seems good enough for the moment.
  • serde(flatten)
  • How to enqueue a number of downloads and await them all at once? (join_all, looks like, but this isn’t on tokio 0.2?)
  • async move
  • https://github.com/tokio-rs/tokio/issues/1660
  • https://stackoverflow.com/questions/59156473/what-is-the-difference-between-async-move-and-async-move
  • async move is supported by async closures are not?
  • Can a Bytes instance be used instead of an &[u8]?
  • chrono for time
  • WSL: libssl-dev + pkg-config
  • docker deployment: input?

Runtime

  • First run: downloaded 13k files and failed with:
    Token expiry is imminent (less than five minutes away); attempting to refresh token.
    Error: reqwest::Error { kind: Decode, source: Error("missing field `refresh_token`", line: 6, column: 1) }
    
    • Having timestamps on the logs will really help
  • Failed with a 503;
    Downloading page #56
    Current usage against quota: UsageAgainstQuota { metadata: 56, download: 9695 }
    Downloading page #57
    Current usage against quota: UsageAgainstQuota { metadata: 57, download: 9884 }
    Downloading page #58
    thread 'main' panicked at 'Failed to fetch metadata! Failed with status: 503 Service Unavailable', src/metadata.rs:131:13
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
  • Immutable singletons + the lazy_static crate
  • lazy_static mutability via Mutex + why isn’t get_mut possible here
  • sleep / delay
  • importance of metrics + observability over tests here
  • async recursive
  • improve throughput
Edit