In the ever-evolving world of Data Engineering, many tools and technologies can help you streamline your ETL pipelines. Choosing the right one for your team can be challenging with many options. In this post, we will explore some of the ETL options we evaluated for Database Replication in our Data Warehouse and how we chose Artie to streamline our ETL pipelines.
The Problem
The problem we faced is a common one. We needed to get data from a transactional MySQL database into our Snowflake Data Warehouse. We had a few options to choose from, and we evaluated each one based on the following criteria:
- Ease of use
- Scalability
- Cost
- Maintenance
- Flexibility
We could’ve written an Airflow DAG to do the job, but we wanted something more accessible and flexible to maintain. We didn’t want to have to keep a DAG for every table we wanted to move from MySQL to Snowflake, freeing our data team to focus on more critical tasks.
The Headache Solution
Ultimately, we bounced around between a couple of no-code ETL tools like Stitch & Airbyte. We initially landed on Airbyte because they had an existing Airflow integration and Source and Target connectors for MySQL and Snowflake. It was easy enough to set up. However, we found a few headaches with this approach:
- Airbyte wasn’t correctly picking up deletes from the MySQL database.
- Incremental loads were starting to take up more and more time.
- Data latency between the sync runs.
- If there was a data issue and we needed to reset the Airbyte Sync with a full load, it would take 3+ hours to complete, often outside our SLA.
- We had to enable new fields and columns in the Airbyte UI manually.
I want to iterate that Airbyte is a fantastic tool and that we still use it for some of our other ETL pipelines. It just wasn’t cutting it for our MySQL to Snowflake replication.
The Artie Solution
I was introduced to Artie by a former colleague. Artie is a real-time database replication tool that streams your transactional database into your Data Warehouse. Artie handles high throughput with low latency. Artie automatically handles schema evolution and uses atomic operations to guarantee data consistency. It’s straightforward to set up and maintain.
Some of the things that we noticed right away after adopting Artie:
- We got up and running with a whole load of our transactional data in less than an hour (a third of the time Airbyte took to do the full load).
- Deletes were adequately picked up from our MySQL database and replicated into Snowflake.
- We no longer needed to worry about incremental loads taking up more and more time (data is streaming in real-time).
- Their UI was straightforward to use
- Artie automatically picked up new fields and columns and replicated them into Snowflake.
Instead of trying to trigger the Airbyte Syncs from Airflow, we now knew that our data was always up-to-date in Snowflake. We no longer had to worry about data latency between sync runs. We could focus on more critical tasks, and Artie could handle the rest. This diagram is how Artie ended up fitting into our Data Stack.
When we set up our Account with Artie, we set up the replication with a couple of clicks. On a follow-up call with their team, they were ready to help us with setup, but we had already completed it because it was so easy to use.
Other Notable Features
Before we wrap up, I want to mention a few additional benefits that Artie provides that we’ve found to be practical/cool:
Analytics Dashboard: Artie provides an analytics dashboard that allows you to monitor the health of your replication jobs. You can see how many records have replicated. More specifically, you can see how many records have been inserted, updated, and deleted.
Schema Evolution Notifications: Artie will notify you of schema changes in your source database. These notifications are handy for maintaining data consistency.
Snowflake Eco Mode: Artie has a Snowflake Eco Mode that allows customers to minimize time and maximize resource utilization of their Snowflake virtual warehouse. This is a great way to save costs (we don’t care about replicating data in real-time during off-hours and would instead save money on our Snowflake bill).
Additional Sources & Destinations: Although we only utilize the MySQL source to Snowflake destination, Artie supports several other sources and destinations. This flexibility is excellent for future use cases. To see a list of supported sources and destinations, check out their documentation.
Conclusion
In conclusion, Artie has been a fantastic tool for us. It has helped us streamline our ETL pipelines and has allowed us to focus on more critical tasks. We no longer have to worry about maintaining Airbyte Syncs for every single table/field that we want to move from MySQL to Snowflake. We can trust that our data is always up-to-date in Snowflake and that Artie will handle the rest. If you’re looking for a real-time database replication tool, I highly recommend checking out Artie.