In the world of data integration and APIs, seemingly simple tasks can sometimes lead to complicated issues. We recently solved a tricky problem while working with an external API that appeared straightforward at first, but got more complex as the investigation progressed.
The Issue: A Never-Ending Loop of Records
Despite a well-thought-out setup, our client, a trucking software company, noticed something strange happening with their process of pulling and clearing records from an external API: Truckstop’s RMIS (Registry Monitoring Insurance Services). Insurance records weren't being cleared from the Delta Queue as expected, resulting in the same records being processed over and over. This redundancy was consuming excess computing power and hitting the API rate limit which caused RMIS to automatically disable our clients’ credentials until they could be manually reset, which then stopped us from receiving updates all together.
The RMIS system works like this:
- Pull the list of updated record IDs from the Delta Queue
- Request individual records for each ID in the list
- Confirm that the records have been processed by sending a “CLEAR” message to the API
One problematic detail of this process is that you must complete the above 3 steps within 2 minutes or the system will assume that you had some issue processing the records and re-populate them back on the Delta Queue.
The Investigation: The Cache Puzzle
At first, we assumed that we were attempting to process too many records and that our 2-minute window was closing on us before we were able to clear the queue. With that in mind we reduced the number of requested records per cycle from 100 down to 40, but the problem persisted.
The processing system is spread across multiple small repos for modularity, which made tracking down the issue tricky. We added logging in several places and added additional DataDog Traces, which allowed us to view the request as it traversed multiple deployed services. We also increased the log level of several logs which were set to DEBUG level but should have been at WARN or ERROR level, and this turned out to be the key.
We discovered that internally we were trying to compare a list of expected records with the list of records that was being returned by RMIS by their ID. If the incoming record was not in the list of existing records then we discarded that record and didn’t add it to the list to clear. A previous process should have already discovered the records which were not in the list and sent them to be processed and cleared separately. We were seeing repeated failure cases where an expected record was not in the list, and in many cases the list was completely empty! How could that be?
We delved into the system to find the root of the issue and discovered that the cache management logic was creating namespace collisions. The heart of the problem was that worker queues, which had been setup to handle these requests asynchronously in an orderly way, were becoming overloaded as processing any given customer’s data began to overlap with the next cron run cycle, which runs every 10 minutes. As each cron job cycle created a new cache and deleted the old one, the long processing times led to multiple processes running at once and interfering with each other. This meant the records were never marked as processed and removed from the Delta Queue, causing them to be constantly reprocessed.
The Solution: Bringing in Unique Identifiers
We addressed this issue by generating a namespace key (also known as a Domain Key) for the cache using a value that is unique for each run of the cron. This ensured that every process had a unique cache, by avoiding namespace collisions. We also moved the cache deletion to occur later in the same request after it had been read, allowing multiple processes to exist together without conflicts.
Here’s how the new process worked:
- Unique Request Identifiers: Each cron job run generated a unique request identifier included in the cache domain key.
- Deferred Cache Deletion: The cache deletion was moved to after the data had been processed and compared, ensuring no premature deletions.
- Isolated Processing: With unique identifiers, each process operated independently, allowing the system to manage multiple requests without conflict.
The Outcome: A Smoother Integration Process
After implementing these changes, the recurring record issue was resolved! The system was now able to process records efficiently without unnecessary reprocessing, API rate limits were no longer being hit, and compute resources were used optimally.
This case study underscores the significance of careful cache management and the use of unique identifiers in systems where processing needs to happen concurrently. By solving the mystery of the recurring RMIS records, our client was able to streamline their integration process, establishing a model for best practices in API handling and data processing.