This document outlines a strategy for building a robust AI-powered solution in tracking and logging beverage data via user submitted images. This system track what types of drinks (water, coffee, soda), an estimated volume of drink, and additional metadata such as container type and if the drink has caffeine or not.
I will build this solution on top of Google Cloud Platform, since that is the cloud provider I am most familiar with. I will choose Google products, but since this is a high level design, I will speak to what those products are and what they do in order to discuss their overall purpose in this design rather than their specific features.
The architecture of this system broadly relies on image processing, AI model selection, and storage options.
- Frontend
- The requirements of this project are to start on Mobile and be easily extended to Web at a later date. Therefore, the frontend should really only be responsible for capturing drink images and displaying the results. The frontend will be able to do this because it will be calling a backend that we expect other clients (i.e. Web apps) to be able to share a common API.
- In order to be able to write a cross-platform mobile codebase that plays well with web applications, I think React Native is likely the best choice given it's extensive ecosystem and JavaScript familiarity. Since this solution is "Google heavy", it might be worth considering Flutter because it can integrate well with Google Services I have to admit I do not have experience with it, nor do I know many developers who do. I think building with more "web-native" solutions offer the most flexibility and extendibility.
- Backend
- The backed for this solution needs to have an API Gateway to securely manage requests with Cloud Endpoints. Since having users and managing usage is a requirement, Firebase Authentication can make it easy to wire up checks against a user's history to offer any kind of limit that fits the business model.
- Image Storage
- We can save images to Cloud Storage so that we can provide the frontend with signed URLs which offloads initial bandwidth concerns and ensures low latency for uploads.
- Pipeline
- The simplest way to set up an AI powered workflow is to build an asynchronous pipeline. That is, when a user uploads an image, it goes into Cloud Storage, which immediately triggers an event for a cloud function to pre-process the image. If that is successful, the pipeline continues by pushing a task to Pub/Sub so that the rest of the execution (determining drink type, generating metadata, etc.) is decoupled and scalable.
- Intensive tasks (like running an AI model) can be performed as they are triggered on services that can auto-scale their resources.
- AI Hosting
- Google's Vertex AI offers a suite of over 200 models and tools that can do the image recognition. It is important in this design that the AI hosting is decoupled from the rest of the solution as much as possible because it is the "linchpin" of this product. We want to be able to swap out models with ease so that we can rapidly iterate our solution and have an upgrade path as models get better over time.
- Database & Logging
- Firestore can can persist things like analysis results, usage counts, and metadata
- Many of Google's services integrate seamlessly with Cloud Logging and Cloud Monitoring to provide real-time feedback of the product's usage and health.
- Scalability
- Images are not cheap to process, so this solution has to consider what options we have for running code on the client (such as compressing images before sending them over the wire). I am personally not too familiar with the options available here, but I think it is safe to assume that the JavaScript ecosystem is mature enough to have many options like that available.
- Because we're building this on Google Cloud Platform and using asynchronous workflows, this design can automatically scale to millions of users, even handling spikes, and releasing resources when not needed.
-
Model Selection
- The AI model we use for this product will depend on whether or not we already have good training data to build the metadata generation. If not, we can use an off-the-shelf solution like Cloud Vision API for image labelling and object detection. Or if we do have training data we could build a custom model with Vertex AI. In all likelihood we will want to build our own custom model so that we are able to build "competitive advantages" into what our AI solution is able to do over something generic.
-
Handling Edge Cases:
- The most common way to handle edge cases in a product like this is to implement "confidence thresholds". That is, if an image has a low confidence threshold, the UI can ask a user to resubmit a picture. For the edge case of there being multiple drinks in one photo, the model could process only one with the highest confidence, or perhaps ask the user to select what drinks to process.
-
Continuous Improvement
- The cool thing about a product like this is that as users are uploading their photos, that gives us data to further train our model on. Periodically retraining the AI can improve the accuracy and further develop the unique benefits of the product.
GCP doesn't have a native billing subscription service, but our API can integrate with the popular solution Stripe to handle payments. With Firebase Authentication already set up it would be relatively straightforward to build a dedicated microservice to handle subscription events, manage upgrading/downgrading user's benefits, and notifications.
We would also like to implement usage count checks on our own API layer so that the same business logic is applied to both Mobile and future Web applications.
We will also want to use Cloud Endpoints to enforce rate-limiting, ban lists, and prevent abuse from users who might try to use the service without paying.
- Error Handling
- We can use Cloud Logging to report any model failures or user-reported issues. It is also possible that we could store these issues and build an analytics dashboard to report recurring problems or trends.
- Testing
- It is really important in an AI-driven asynchronous system, that the frontend be able to correctly notify the user of any actionable feedback they might need such as retrying with a different setting- i.e. "We couldn't process your image due to low lighting" or "there was an unforeseen error processing your image, please try again." This system will need to rely a lot on user feedback so the frontend will have to have highly visible and easy to use forms for providing feedback. We will have to thoroughly test any fallback systems because our product is both automated and non-deterministic.
- The AI model itself will have to be continuously tested and retrained as we gain more data to train the model on. It will be important to train the model on "real-world" data such as images that have poor lighting, bad angles, too much noise, etc.
This framework provides a solid technical foundation for our AI-driven product as well as room for iterative enhancements as user behaviours and the underlying models evolve.
But importantly, there are more considerations to think about, such as security, data privacy. GCP offers robust, enterprise grade solutions in that regard, which can give us confidence that our design will be able to handle the compliance and regulatory standards we likely wouldn't want to handle ourselves.
GCP also has flexible pricing that will allow us to explore this solution without having to spend a lot of money before we know if it works or not- a key ingredient in building startups.
I hope you enjoyed this system design overview and I look forward to hearing your thoughts on it!
-Kevin