Decision Guide to Choosing a Data Integration Tool
Data integration is the process of taking data from many different sources in your organization and centralizing it to a data warehouse or data lake. Several data integration tools have recently arisen to make this a much simpler process than integrating data manually, but choosing between them can be a daunting task.
This guide covers important considerations when selecting a data integration tool and makes recommendations based on your market segment.
Having all of your data in one place is an essential step to building a single source of truth for your organization. With a single source of truth containing all of the data used in an organization, data teams can feel more confident that downstream consumers have a unified view of data for use cases such as business intelligence to support decision-making, automation of operational processes, machine learning based on customer data, or output to customers.
Before the Modern Data Stack era, data integration was a significant hurdle for data engineers. They would write complicated, custom scripts to extract and load data and schedule the pipelines with cron jobs or Airflow. These pipelines were often brittle, leading to downstream data quality issues. This EL (Extract and Load) step is vastly simplified with a data integration tool (also known as an ELT tool or ETL tool).
Instead of spending valuable engineering resources developing custom connectors across disparate sources (such as relational databases, Google Analytics, ad networks, and SaaS tools), anyone can set up robust pipelines for data ingestion using these data integration tools. The use of a data integration tool eliminates development time and minimizes operational efforts.
Types of data integration tools
Here are the main types of data integration tools:
- Open-source: These are suitable tools for avoiding a hefty price tag or having your data in another company’s cloud. The downside is that you have to deploy and maintain these solutions on your own. These rely on the open-source community, so sometimes connectors cannot be reliable and scalable. Tools in this category include Airbyte, Portable, Meltano, Singer, and PipelineWise.
- SaaS (software as a service): These are managed data integration tools which are frequently self-service as well, making it effortless for you to get started. Solutions in this category are also more reliable and scalable than using an open-source alternative. The most popular pricing option is paying for the amount of data integrated monthly. This can range from a hundred to thousands of dollars per month. Tools in this category include Fivetran, Stitch, Matillion, Alooma, Xplenty, and Hevo Data.
- Enterprise: These are data integration platforms that solve not just a data integration process but also various problems such as data transformation, data governance, data management, data ops, etc. that are necessary especially for use with enterprise business intelligence tools. Solutions in this category are expensive and catered to many needs of big enterprises. Tools in this category include solutions from IBM, Informatica, Microsoft, Oracle, and SAP.
What to look for when choosing a data integration tool
Choosing the right vendor can be challenging, though, as there are many factors to consider.
This article will discuss some core considerations when choosing a data integration tool and suggest the right one for different organizations.
Data source coverage
The first criterion for choosing a data integration tool is to ensure that all data sources are covered. In other words, how thoroughly the tool can integrate with the data producers in your company. Some typical data producers of a company might include the following: an order processing system, a CRM, web tracking, marketing tools, accounting software, data sets in spreadsheets, unstructured data in a data storage solution like S3 etc.
You can list out all of the source systems your company is using or plan to use in the future and check if the data integration tool covers everything. Having all data sources in one solution reduces the platform and operational costs. A safe bet is to pick the solution with the most integrations if all other criteria are equal.
Some tools have limitations in the number of sources that you can integrate, too, so make sure you check this before committing to a vendor.
Sometimes, your organization will have one or multiple outlier sources that no vendor supports. If this is the case, look for a vendor that allows the development of custom sources.
Some tools offer multiple ways to add a custom source, so look for the most familiar options here. For example, with Stitch, you can write a Singer tap, write a program to send data to an Import API, or use incoming webhooks. With Fivetran, you have similar options, but you can also develop a Cloud Function on AWS, GCP, or Azure that Fivetran will call.
If you are trying to build a single source of truth for your organization, you will most likely have one destination. Most data integration tools support the most common data stores as destinations, such as popular data warehouses (e.g. BigQuery, RedShift, Snowflake, and Synapse), data lakes (e.g. S3, Databricks), and relational databases (e.g. MySQL, PostgreSQL, and SQL Server).
However, if you want to load data into multiple destinations, be aware that some tools (like Stitch) do not support this. If you plan to separate raw data for security purposes, you might have multiple destinations or use a reverse ETL tool (to load transformed data back to an application).
Pricing is a significant factor that can solely dictate which data integration solution you choose. The three types of data integrations tools mentioned above all have different pricing structures.
- Open-source: you pay for hosting the solution, typically ranging from $0 - $50 ($0 if you can utilize another running server). But a substantial cost that often gets neglected is the engineering cost in time spent deploying and operating the solution.
- SaaS: The typical pricing model is to bill based on the data integrated after the initial load. You are looking at paying $100 to $5,000 a month depending on how much new data the company generates and whether your sources support incremental loads.
- Enterprise: With enterprise solutions, there are many factors in deciding how much you pay. But a reasonable estimate is north of $5,000 per month.
Security is another crucial consideration when choosing a data integration system. There are two main things to consider here: how the solution connects to your data source and where/how it stores your data.
If you are using a managed solution (SaaS or some Enterprise), the software that runs the extraction logic will have to connect to your data sources. For native cloud sources, this is not a problem since the tool will most likely call an API to get the data. However, for relational database sources, you have to consider different connection topologies.
The first and easiest option is to expose your database to the internet and use firewall rules to allow access only to the data integration tool. This is the most insecure option and will not be available for many organizations due to pre-existing security policies. The second option is to tunnel the connection through an SSH server. This keeps your database secured behind a layer of protection but will incur additional server hosting and networking costs. The last and most secure option is setting up a site-to-site VPN (a virtual private network) with the solution provider. Many enterprises choose to go with this approach, but it is also technically complicated and expensive.
Where and how the data integration tool stores your data is another consideration. In some regions or some industries, organizations have to store their data in the same physical area. Most providers provide options for you to either store and process your data in the US or EU.
If any of these two factors limit you, the safest option is to choose an open-source alternative. With an open-source solution, you can deploy it in a closed network with your data sources and control what data is stored where.
Control over how your data is replicated can be crucial to deciding on a tool. You can choose to have:
- Full data replication: does a full extract of the source every time the integration runs. This is suitable for small tables or tables without a primary key.
- Time-based/key-based incremental: queries the source and filter based on an incremental ID column or a created/modified time column. This is easy to set up and can work well in most situations.
- Change data capture (CDC): reads the database logs, determines and only extracts what changed since the last successful integration. This is more complex to set up and requires admin permission of your database. However, this often is the most efficient method and can also capture hard deletes.
For most of your sources, you would want to go with either time-based/key-based incremental or CDC. Choosing between the two boils down to whether or not you need to capture hard deletes, rows that are dropped entirely from the table instead of being marked as deleted (soft delete). Not all databases and vendors (like the free Matillion Data Loader) support CDC replication, so be sure to check the documentation before choosing.
Another factor to consider is whether or not you want to customize what tables and columns to replicate. This factor would not be an issue for smaller sources as most tools provide you with a way to do that via the UI. However, if you want to choose 50+ tables from a database with hundreds of tables, it can be tedious to click your way through the UI. Things will undoubtedly get out of hand if you have to go in and select the primary key and modified date for each of these tables, especially if business processes related to compliance or security require this customization. Here, look for a vendor that allows you to do this via API or command line.
Data transformation is the last step in the ELT process, with extract and load covered by the data integration tool. Many providers like Stitch or Matillion offer different transformation products. Many others like Fivetran or Airbyte offer integration with the popular open-source transformation tool dbt.
Which data integration solution to choose here depends on your team’s preference for which transformation tool to use. It is relatively easy to set up and maintain an end-to-end data pipeline if you choose solutions like Fivetran and dbt. A note here, though, if you want more customization on your workflows, such as different schedules for different sources or complex dependencies, you are better off using a separate orchestration tool like Airflow or Luigi for that.
Which data integration vendor is right for you?
Hopefully, reading through the earlier part, you understand that choosing a data integration tool is no simple task. In the following section, I will attempt to suggest what data integration tool is suitable for different types of companies.
If you are an early-stage startup, you will likely be constrained by money and engineering capacity.
Since your company is relatively new, chances are you will not have a lot of data (less than 5 million new rows per month) with little to no data infrastructure in place. The number of sources will be minimal, too, compared to companies in a later stage.
If you or your team have a technical background and some engineering resources, go with Airbyte. Airbyte is an open-source tool with 100+ connectors and new ones added monthly. You will need to set up a server to host the application, but everything is super easy to set up via the UI after that. It also integrates with dbt to trigger data transformation in one tool after loading the data. Using the recommended VM size, you are looking at paying ~$30 per month to host Airbyte (even less if you schedule the instance to turn off after usage).
If you or your team do not have a technical background and don’t have too many sources (less than ten), go with Stitch or Airbyte Cloud. Stitch is dead simple to get started. You just need to sign up for an account and configure sources & destinations. Stitch has connectors to 130+ sources and many popular data destinations (but you can only have one destination). You pay $100 per month for 5 million rows extracted, excluding the initial load. Airbyte has a new managed offering, Airbyte Cloud, which promises cheaper pricing by charging by compute time or rows changed, whichever is cheaper. Try their calculator for more details on pricing.
If you work for a growth-stage startup, you will likely have access to more funding and engineering time. The list of data sources will also be growing, with new ones added frequently. The amount of monthly new data that your company generates will range from 5 million to 100 million rows (or even more). You’ll need to keep your total data infrastructure costs in mind as usage of a data integration tool will increase your data warehousing costs and potentially downstream tool (e.g. business intelligence) costs as well.
I would recommend going with Fivetran. Fivetran has 150+ source connectors developed and tested in-house. Fivetran also has excellent customer support and system SLA (service level agreements) for some tiers. You also have the option to select the cloud provider that Fivetran will deploy to and set up VPN tunnels for advanced security. Fivetran has recently acquired HVR (stream capability coming?) and is actively working on an alternative to do CDC with read-only permissions. With this option, expect to pay between $1,000 to $5,000, or even more depending on your data throughput.
From Airbyte's roadmap, it looks like they are planning to invest in connector quality and SLA, so keep an eye on them in 2022.
Data Consultancy or Marketing agency
If you work for a data consultancy or a marketing agency, you will likely work with many customers. If your business model is to have the customer pay for their data integration tool, use the other sections here as a guide to choosing the right one. However, if your firm decides to do data integration for the customer, you can select Airbyte or Fivetran. Go with Airbyte when you want to help clients deploy on their infrastructure. Consider Fivetran if you desire a hands-off approach. You also have an option to purchase Fivetran credits in bulk at a discount and use it for all your customers.
For an enterprise, there is no easy general suggestion as the requirements are vastly different. One corporation can use a suite of solutions from SAP, while the next uses one from Oracle. Your best bet here is to go with your organization’s existing enterprise data infrastructure vendor solely if it can cover all your data needs. If not, you can use that tool in conjunction with a tool like Fivetran.
About the Author
Tuan Nguyen is the CTO of Joon Solutions, a Data as a Service agency helping companies build their first modern data stacks. Tuan's professional experiences revolve around building and managing data science teams, analytics infrastructure, and analytics use cases.