We don’t want to pretend that Syndetic can help save lives. It can’t. We know that everyone is struggling right now, from the stress of losing loved ones or losing their livelihood, from being isolated, or from working from home with a family. During this difficult time though, there may be space for companies to take stock of their business and see where a data strategy fits in. Often, data is one of the most overlooked assets a company has. It sits there, not reflected on the balance sheet, waiting for its value to be unlocked. We propose that companies think through the following questions when evaluating whether or not they have a high potential data business:
Is my data clean and easy to ingest?
Having a well-structured, clean dataset is table stakes for getting into the data business. If there are inconsistencies or messiness in the data, you are only putting work on your customers in order to make the data useable. And your first priority in making your data valuable is in making it useable. How will your customers want to ingest the data? This may depend on your dataset and your target customer base. If your dataset is somewhat small (in the millions of rows or less) or if your customers will only be interested in a subset of the entire dataset, you will probably make it available via API. If your dataset is very large, or you are selling a dataset that is more valuable only if you buy the whole thing, you will probably be delivering your data as a flat file via FTP or S3. You may do a combination of the two.
Is my data hard to get?
There is a lot of data out there, and the internet is big. If you’ve built a valuable dataset in the course of building some other business, and that data is proprietary to you, it is obviously more valuable than if it can just be scraped off the internet. Data buyers are looking to give their business an edge. If you are selling them data that everyone else will have, the edge goes away.
Is my data comprehensive?
The definition of comprehensive here varies depending on the dataset, of course. Some datasets are a mile wide and an inch deep and others are an inch wide and a mile deep. If you are selling data about fluctuating airfares, for example, it may be very important to your customer that you cover every airline in every country. Or it may be very important to them that you cover every airfare across distribution channels (on airline websites, travel aggregators, through corporate travel agents, etc.) Comprehensive does not necessarily mean that the data be “big.” So long as it is complete (not a lot of missing values) and is sufficient to solve your customer’s particular need, the data is comprehensive.
Is my data easy to join against other datasets?
If a customer is interested in your data, in all likelihood it is because they want to join your data against another dataset they already have. This could be open data, proprietary data, or data they’ve bought from another external seller. Your customer will not tell you what other datasets they intend to join your data against, they will simply tell you whether they are interested in buying your data and how much they’re willing to pay for it. However, it doesn’t take a ton of time to conduct a thought exercise where you imagine that if you were a customer interested in your dataset, what data might you want to join it against? Perhaps you are a marketplace and you are thinking about selling some aggregated transaction data (buying trends by product category or geographic location, for example). One potential buyer for that data might be the brands of the products being sold. They might want to buy your data to understand what their most popular products are for inventory planning, or what competitor’s products sell better in certain places for their product development team. You have the transactional information, but they know much more about their own products than you do. So make it easy for them: if you have their item number in addition to your product ID, include it in the dataset. If you break down your geographies into esoteric categories (e.g. “Southern region”), don’t forget to include zip code or city, which is more likely to exist in other datasets.
Is my data complete?
If I agree to buy your data, and your dataset has one million rows, I should expect there to be one million values for every field in the dataset. Unfortunately, this is very often not the case. It may be that it is simply very difficult to get a value for all million rows across all fields (the data is locked up inside another source, or was not entered into the system for some reason, or used to exist and disappeared). However the more difficult it is, the more valuable your dataset becomes simply by being complete.
Is my dataset easy to explain to another person?
Here at Syndetic is that we are obsessed with two questions:
1. How do you convey the meaning of a dataset?
2. How do you convey the value of a dataset?
We felt from the start that data dictionaries are not sufficient to fully convey either. It used to be 20 years ago that a database would spit out a list of table names, field names and types, and that was a data dictionary. But somehow along the way the phrase data dictionary got co-opted to mean the artifact that is exchanged between companies and meant to comprehensively explain and market a dataset. This is somewhat absurd. No one would expect to sell a dress simply by providing its measurements. And unlike dresses, you can’t take great pictures of your data, from all different angles.
So we are left with approximations: What does the dataset cover? How complete is it? Where did it come from? How many records does it contain? How far back in time does it go? What can I expect from it tomorrow? What should I use it for? Was it sourced legally? Ethically? How stable is it? How relevant is it to me and my other datasets?
Syndetic helps answer these questions by automatically summarizing the dataset using statistics (cardinality, coverage rates, min-max-average, and more) and doing so on a continuous basis. This produces a living document that, unlike a static data dictionary in the form of a spreadsheet or PDF, never gets out of date. For the more qualitative questions that can only be answered by a person, we give you the tools you need to annotate and describe your dataset however you’d like. In this way you can comprehensively convey the data’s meaning, and therefore its value.