Identity & Entity Resolution: The Build vs. Buy Debate

How should organizations actually create Customer 360° views?

Last week, Hightouch announced a new feature that allows teams to define rules to perform identity and entity resolution directly in their data warehouse. At Brooklyn Data, we’ve helped companies find solutions for identity resolution for years, supporting a mix of custom solutions and purchased solutions, such as those performed by Customer Data Platforms, dedicated tools like Zingg, and now, Hightouch. As more organizations work to build 360° views of their customers and as the breadth of possible identity resolution solutions proliferates, we took a step back to reflect on the overall state of the market. We’ve shared our perspectives below, and ultimately will provide a framework to help you decide the best way to resolve identities and organizations for your particular company.

What is Identity (and Entity) Resolution?

Identity resolution is the process of unifying different data sets to build a single view of a customer. Ultimately, this gives organizations a 360° view of that customer’s behavior and traits that can then power personalized marketing, advertising, and customer service.

Identity resolution is just one example of the broader concept of “entity resolution.” An entity can be anything from a user, company, pet, or physical product. It is any person, place, or thing that you want to roll data across sources and touchpoints up to. For example, a cable company might perform entity resolution to unify all records associated with a household, the unit that they sell to, rather than focusing primarily on unifying individual person records via identity resolution. Identity resolution is a type of entity resolution where the entity consolidated is a user record.

Identity resolution is essential because businesses engage with their customers across multiple platforms, touchpoints, and devices. To understand the customer journey and provide relevant and accurate future experiences and touchpoints, you need a full view of the customer’s attributes and engagement. The same is also true of other entities the business cares about. Proper entity resolution will enable data scientists to build stronger predictive models for each entity.

The remainder of this article will focus on the key strategies to accomplish entity and identity resolution. Generally speaking, the approach is to tie all of your data from multiple sources together in a single view by joining on a unique identifier.

Determining the Type of Resolution Your Organization Needs

The essential question you need to answer to ground this exploration is: “What entities do I care about?” Understanding who your buyer is, and other essential “nouns” associated with them, will dictate the types of entities and identities that you need to resolve.

For example, most B2C companies care first and foremost about packaged identity resolution, as their primary buyers are individual people. These companies may care about additional entities, however. For example, a pet store may want to track separate entities for each pet that an individual person owns. This would allow the pet store to personalize offers to the pet owner based on discounts, birthdays, and more for each individual pet.

Most B2B companies sell their products at the company level so that entity is the most important one. Other essential entities might include workspaces or sub-accounts within each company.

Another dimension to consider is the level of fidelity needed with these resolved entities. In cases where high fidelity is required, such as personalized lifecycle marketing, deterministic resolution (with a clear rules-based approach) is preferable. On the other hand, if an organization wants to consolidate as many entities as possible and can tolerate lower fidelity, such as in a paid-advertising audience scenario, probabilistic resolution can rely on black-box AI algorithms to consolidate entities.

Building Entity and Identity Resolution Solutions

Organizations that want full control over their entity resolution processes (including identity resolution) can do this within their own data warehouses or graph databases with SQL, dbt, and/or machine learning algorithms. Brooklyn Data has helped companies of all sizes and stages of their data discovery journey build, maintain, and optimize building entity resolution processes. Here are the steps a company needs to take:

Pre-processing. Ensure that your data is in the same format. For example, you’ll want to strip phone numbers of non-numerical characters and standardize casing within names. If dealing with geospatial identifiers, geocoding and standardizing addresses can be of great impact down the line.
Indexing. Use algorithms to index records and block (reduce the problem set by only comparing records with similar indexes). This will simplify the data set for the next step, the comparison phase.
Comparing. Find matches between different data points that you’ll later use to unify disparate records. The comparison can be done with multiple semantic categories, such as name, email, phone, and more. You’ll choose semantic categories depending on data completeness and how you wish to balance the risk of false positives vs. false negatives. You can highly configure this step, prioritizing certain factors over others with weights, utilizing exact matches or similarity scores, and more.
Classifying. Determine which records you’ll want to unify by assigning a score depending on the semantic categories that matched. Having more than one match categories matching, for example, indicates a stronger match and would result in a higher matching score.
Merging Unify records based on which profile they were classified to belong to.

Buying Identity and Entity Resolution Solutions

Identity resolution solutions for sale can be broken into two fundamental categories, differentiated by their relationship to an organization’s data warehouse.

Packaged Customer Data Platforms (CDPs) such as Amperity, ActionIQ, and Treasure Data popularized the vision of a Customer 360° and initially made identity resolution accessible for the broader market a decade ago. Traditional CDPs work by collecting and storing data within their own platform, though some now can be hybridized to directly utilize data stored within the warehouse. They employ a structured and predefined method of identity resolution within that closed system. CDPs solve for identity resolution but may require significant implementation costs to support other forms of entity resolution. The closed nature of CDPs continues to be attractive for organizations that are in the early stages of handling data. Over time, if the organization does not grow their data practices, they may have trouble escaping vendor lock in and solving for their data needs on their own.

Warehouse-native solutions such as Zingg, Truelty, and Fullcontact give organizations configurable tools to manage entity resolution within their own data warehouse. Hightouch’s new identity resolution feature falls into this latter category, allowing organizations to configure their own identity and entity resolution use cases within the data warehouse. These solutions offer some of the ease and user-friendliness of CDP-based identity solutions, while allowing for some of the in-depth configurability of a truly home-built entity resolution system.

At Brooklyn Data, we believe that warehouse-native solutions outperform black-box solutions found in some traditional packaged CDPs. In particular, these solutions offer:

Quicker time to market. Packaged CDPs can take months to implement depending on organizational complexity, as they require companies to onboard data into their systems and store it in rigid data models. Warehouse-native identity resolution solutions work immediately from the data that organizations are already storing.
Flexibility. Some packaged CDP resolution is limited to identities and black box systes that are not easily configured to different organizations’ use cases. Some CDPs are beginning to offer hybrid solutions that utilize the warehouse and offer configurability, but this is not the norm across the industry. Warehouse-native solutions allow organizations to configure different sets of rules for multiple identity and entity resolution processes.
Cost Efficiency. Packaged CDPs store data within their black-box systems, which themselves are actually relying upon other cloud data warehouses. They pass this cost forward to their users. Warehouse-native identity resolution solutions don’t require duplicative storage and compute costs. If a company already has a data team and is managing data in its warehouse, it is more cost efficient to utilize that existing set of warehouse data models.

Given the clear benefits of warehouse-native buyable entity resolution solutions, in the next section, we’ll narrow our focus to discuss the merits of building your own entity resolution logic vs. buying a warehouse-native solution.

Considerations for Building vs. Buying Identity Resolution Solutions

There are several key dimensions to consider when deciding between building and buying warehouse-native identity and entity resolution solutions. To oversimplify: building requires more technical expertise but can deliver more precise results, while buying allows faster time to market and accessibility at the expense of complete configurability.

Flexibility and Completeness

Unsurprisingly, custom-built solutions offer organizations unparalleled flexibility to determine exactly how resolution should proceed. Flexibility varies amongst the warehouse-native identity resolution platforms. Many of these, such as Zingg, rely on probabilistic AI models, which can be powerful but are not fine-tuned for every use case and may become a black box. These models may have a few built-in features to help tune them further, such as weights assigned to different identifiers and controls for fuzzy or exact matches. Others, such as Hightouch, allow for fully configurable deterministic rules that give organizations more flexibility in determining resolution outputs.

Speed to Market

By opting for a standalone solution, you can quickly implement the tool, saving time and reducing development and maintenance overhead. This is especially beneficial when extensive logic refinement is not necessary, and an out-of-the-box solution can meet your requirements.

Ease

Standalone solutions typically offer intuitive user interfaces and simplified workflows. They are specifically designed to handle entity resolution tasks and often come with documentation and support resources to assist users in understanding and utilizing the tool effectively. While some technical knowledge may be required, these tools generally provide a straightforward setup and usage experience.

The simplicity of an in-house built solution can vary based on the implementation approach. If the in-house solution is built with simplicity in mind, it can provide a streamlined and user-friendly experience tailored to the specific needs of the organization. However, building an in-house solution from scratch may require significant technical expertise and development resources, potentially making it more complex to implement and maintain.

Data Governance

As long as your organization performs resolution within the warehouse, you should retain full control over your data. Warehouse-native and homebuilt solutions for identity resolution will generally provide equal levels of control for your data privacy and security. Data security only becomes an issue when a solution, such as a traditional packaged CDP, requires you to store your data in a third-party environment.

Cost

In the short term, it may be cheaper to buy an identity resolution solution, as this can help your organization quickly solve for identity resolution without investing meaningful hours with your developers or with outside consultants to build custom solutions. Over the long term, after initial investments are complete, home-built solutions may be more cost-effective, as most purchasable identity resolution solutions have recurring licensing or subscription fees.

Another factor to consider is compute costs in the warehouse. Entity resolution can be a computationally intensive process (since it essentially consists of a cartesian join between records). Standalone solutions have dealt through this issue, but home-built solutions must be mindful of efficient practices. Depending on the size of the data, computing costs can increase if you’re not cognizant of performance.

Closing Thoughts

As more data becomes easily available in a centralized warehouse, more companies are faced with the need to identify different representations of the same entity. Entity and identity resolution are complex problems that can be approached from multiple angles. Choosing the right tool for the job largely depends on factors such as budget, degree of customization depending on the business needs, size, and composition of the team involved. Brooklyn Data provides a straightforward approach to assessing and implementing Composable CDPs. We'll evaluate your current data infrastructure, identify gaps, and guide you through the transition process. Our focus is on practical solutions that make your data work harder for you. With Brooklyn Data, you get clear, actionable strategies for modern data management.