Here's a brief description of the design of the project

Disclaimer: I am not an expert in system design, and would welcome any feedback on the design/architecture.

Last Updated, April 07 2024

Data Sources and API Manager

In terms of Data Sources, there's a mix of APIs and Webscrapers and the data from the responses are generally in json. Because I'm using Webscrapers I wanted to make sure I had a long enough cooldown so that my ip wouldn't get blocked or banned, I know some of the websites I'm scrapping use Cloudflare so they might have a bot management service or something to prevent scraping.

It got to a point where I had multiple different data sources which had different request cooldown periods, different sources, and different frequency of retrieval (Some data was available to be pulled EOD while others T+1). To handle this I created an API manager which handles the pulling of the data. I thought it would be fun to explore Python's Asyncio because at work we started using threading to speed up a couple of our service request calls. Asyncio isn't the same as multi-threading, threading is preemptive, therefore it switches according to its scheduler. Asyncio is for coroutines: the developer has the control to decide when tasks switch execution by suspending certain tasks and resuming others.

Why use coroutines? I thought it would be interesting I tried finding different designs to handle this and couldn't find one. The idea was I have multiple different APIs with different cooldown periods therefore while API A is on cooldown API B can run, but at the same time some APIs might take longer to request the data due to the size of the responses and others are being inserted into the DB and some into a file store... This provides different opportunities to switch contexts while it waits.

Data Storage

There are three main data storages:

1) I'm using Cockroach Labs Serverless DB, I came across Cockroach Labs while searching for various DBs to use and thought it would be interesting to use. Their pitch is that they keep their DBs distributed and online to prevent interruptions and it is good for transaction-heavy workloads.

2) I'm using a file store to store .json files and ceertains .csv. The reasoning behind is that this data is not frequently read, therefore there is any strong use case for me to put it on the cloud DB and eat away at the storage ressources. They get downloaded, used for calculations and then those calculations are stored in the DBs. They remain as a source to cross-check or refernce the calculations.

3) With improvement to my calculations runtime, I started looking at Mongo DB for a cloud NoSQL storage. I am still in the process of implementing and designing how this will be used. But the plan is to have some of the calculations be stored in there and query the DB for the reporting and Dashboard layer.

Calculation Engine

My Calc Engine, is my module that has the models and logic used for various calculations. Initially, some of these calculations are pretty intensive and had taken a long time to run. I've tried going from Pandas -> Vectorization -> Numpy -> Numba. Each improvement had a huge performance gain, however when using Numba, I noticed a reduciton in accuracy in some of those calculations. That was to be expected since Numba njit requires the code to be basic Python without 3rd party libraries. The calculation perfomance went from ~15 mins to ~ 86 seconds (performed on the largest data set). This was painful becuase I wanted to perform these calculations often and with different inputs. Eventually I took a look at Polars and got an spectacular performance uplift. It took a while to re-write some of these calculations. I tried using good polars practice by vectorizing most of the code to make use of Polars parallelization properties. Final result: ~4 seconds calculation runtime for the same large dataset. I haven't fully converted all my code to polars for the sole reason that some of their functions are marked as unstable meaning their functionality or definitions might change in the future, but have really enjoyed the impact polars' perfomance improvement in my project.

The calculations are done through batch jobs, some calculations that are done daily, others weekly, and others monthly (after aggregating data).

Analytical Layer of Database

This is where I have the Data models, the relations and tables in which the calculations are stored in the DB. These will feed into both the Reporting Layer and Visualization Layer

Reporting Layer

Why have a Separate Reporting Layer and Dashboard Layer? I wanted to have some logical processing happening before the Dashboard, recommendations, insights, etc... I viewed the Dashboard as a separate entity that can give the user the ability to View these insights, as well as the raw data. As the project grows I think this will be the right approach: separating the general visualization from the logic and processing.

Dashboard/Visualization Layer

Synthesized information from the reporting layer and visualization of the data. Gives the ability to the user to explore the data.


Ideation/Feature process

When you have a blank slate project with endless possibilities, I find it helpful to have some sort of process to help decide the order of features to tackle. When I first started, I focused on the big building blocks: Data + Apis (without the manager), Data Models + Data Storage, Calculation Engine, and Visualization (without dashboard). Now that I have those core infrastructure built, I use the following process to decide what to work on next:

0) Check the backlog which is organized by priority, and take the highest priority item. But let's look at how I put an idea into the backlog and rank it (because that's probably more interesting).

1) Ideation: Find an idea or topic. This can come from a tweet, an article I read, a section of a book, a podcast or interview I listened to, etc...

2.1) Given that idea, I try to judge how important it would be. Is it something I want immediately in the project or is it something that may be nice to have? Will the use be frequent or is it sporadically useful?

2.2) Do I already have the data for that idea? How hard is the acquisition of said data? (How much work? Is it feasible? Do I have a source?) Do I have to review or learn any theory? (ex: read a textbook chapter, read a book, research paper, review some math/stats?)

3) After assessing all this information, I try to size the work effort and guesstimate if the reward is worth pursuing immediately or if I am better off taking the top item from the backlog and placing this idea somewhere in the backlog.