Building a data business in uncertain times

We don’t want to pretend that Syndetic can help save lives. It can’t. We know that everyone is struggling right now, from the stress of losing loved ones or losing their livelihood, from being isolated, or from working from home with a family. During this difficult time though, there may be space for companies to take stock of their business and see where a data strategy fits in. Often, data is one of the most overlooked assets a company has. It sits there, not reflected on the balance sheet, waiting for its value to be unlocked. We propose that companies think through the following questions when evaluating whether or not they have a high potential data business:

Is my data clean and easy to ingest?

Having a well-structured, clean dataset is table stakes for getting into the data business. If there are inconsistencies or messiness in the data, you are only putting work on your customers in order to make the data useable. And your first priority in making your data valuable is in making it useable. How will your customers want to ingest the data? This may depend on your dataset and your target customer base. If your dataset is somewhat small (in the millions of rows or less) or if your customers will only be interested in a subset of the entire dataset, you will probably make it available via API. If your dataset is very large, or you are selling a dataset that is more valuable only if you buy the whole thing, you will probably be delivering your data as a flat file via FTP or S3. You may do a combination of the two.

Is my data hard to get?

There is a lot of data out there, and the internet is big. If you’ve built a valuable dataset in the course of building some other business, and that data is proprietary to you, it is obviously more valuable than if it can just be scraped off the internet. Data buyers are looking to give their business an edge. If you are selling them data that everyone else will have, the edge goes away.

Is my data comprehensive?

The definition of comprehensive here varies depending on the dataset, of course. Some datasets are a mile wide and an inch deep and others are an inch wide and a mile deep. If you are selling data about fluctuating airfares, for example, it may be very important to your customer that you cover every airline in every country. Or it may be very important to them that you cover every airfare across distribution channels (on airline websites, travel aggregators, through corporate travel agents, etc.) Comprehensive does not necessarily mean that the data be “big.” So long as it is complete (not a lot of missing values) and is sufficient to solve your customer’s particular need, the data is comprehensive.

Is my data easy to join against other datasets?

If a customer is interested in your data, in all likelihood it is because they want to join your data against another dataset they already have. This could be open data, proprietary data, or data they’ve bought from another external seller. Your customer will not tell you what other datasets they intend to join your data against, they will simply tell you whether they are interested in buying your data and how much they’re willing to pay for it. However, it doesn’t take a ton of time to conduct a thought exercise where you imagine that if you were a customer interested in your dataset, what data might you want to join it against? Perhaps you are a marketplace and you are thinking about selling some aggregated transaction data (buying trends by product category or geographic location, for example). One potential buyer for that data might be the brands of the products being sold. They might want to buy your data to understand what their most popular products are for inventory planning, or what competitor’s products sell better in certain places for their product development team. You have the transactional information, but they know much more about their own products than you do. So make it easy for them: if you have their item number in addition to your product ID, include it in the dataset. If you break down your geographies into esoteric categories (e.g. “Southern region”), don’t forget to include zip code or city, which is more likely to exist in other datasets.

Is my data complete?

If I agree to buy your data, and your dataset has one million rows, I should expect there to be one million values for every field in the dataset. Unfortunately, this is very often not the case. It may be that it is simply very difficult to get a value for all million rows across all fields (the data is locked up inside another source, or was not entered into the system for some reason, or used to exist and disappeared). However the more difficult it is, the more valuable your dataset becomes simply by being complete.

Is my dataset easy to explain to another person?

Here at Syndetic is that we are obsessed with two questions:

1. How do you convey the meaning of a dataset?
2. How do you convey the value of a dataset?

We felt from the start that data dictionaries are not sufficient to fully convey either. It used to be 20 years ago that a database would spit out a list of table names, field names and types, and that was a data dictionary. But somehow along the way the phrase data dictionary got co-opted to mean the artifact that is exchanged between companies and meant to comprehensively explain and market a dataset. This is somewhat absurd. No one would expect to sell a dress simply by providing its measurements. And unlike dresses, you can’t take great pictures of your data, from all different angles.

So we are left with approximations: What does the dataset cover? How complete is it? Where did it come from? How many records does it contain? How far back in time does it go? What can I expect from it tomorrow? What should I use it for? Was it sourced legally? Ethically? How stable is it? How relevant is it to me and my other datasets?

Syndetic helps answer these questions by automatically summarizing the dataset using statistics (cardinality, coverage rates, min-max-average, and more) and doing so on a continuous basis. This produces a living document that, unlike a static data dictionary in the form of a spreadsheet or PDF, never gets out of date. For the more qualitative questions that can only be answered by a person, we give you the tools you need to annotate and describe your dataset however you’d like. In this way you can comprehensively convey the data’s meaning, and therefore its value.

Why does everyone hate their data dictionary?

As cofounder of Syndetic, I’ve talked to a lot of people about their data dictionaries. At this point, probably dozens of people, ranging from data governance managers at large enterprises to founders of early stage tech companies. And every single one of the lot hates their data dictionary. When I say hates, I mean that they say something like “Ugh. I won’t even show it to you. It’s an embarrassment.”

Why does everyone hate their data dictionary? A sort of meta-spreadsheet, a data dictionary on its face sounds like a relatively simple thing. It is a document describing the meaning of a dataset. Typically this includes field names and types (e.g. string, text, varchar) and maybe some annotations that describe the lineage of the data (where did it come from) and the business definition. But as with many workflows that are captured in spreadsheets, things can go awry very quickly.

  1. They are difficult to maintain.

The first person to create a data dictionary for their company usually has great intentions. They may be the first data scientist hired there, or the first data governance professional. They are diligent and organized, and dedicated to the mission of ferreting out every last bit of information about their information. They meticulously craft a spreadsheet (or google sheet) that contains the best information available to them at the time. They double and triple-check it for accuracy. But then, of course, things go off the rails.

Maintaining a data dictionary is not a full time job. And so, the person who created it cannot be expected to be thinking about this document at all times. They go back to doing their day job, and bit by bit, changes start to happen. Engineers change schemas without thinking to alert the person who created or keeps the data dictionary. Data salespeople call their prospects and walk them through the fields of the dataset, but realize that the annotations don’t quite make sense for the use case of their prospect. So they make a copy of the spreadsheet, change the annotation, and send it out. Product teams buy a system to capture data that used to be captured manually. And the dictionary very quickly gets out of date, often before anyone even realizes.

It is dangerous to have any company asset that is so dependent on one person in the organization, in this case the creator of the dictionary. If that person leaves the organization, all history of the document often goes with them, and a new person in that role may be tempted to just scrap it and start over. But then the problem repeats.

  1. They are necessary, but not sufficient, to fully explain the data.

Oftentimes, datasets are shared with business analysts or other non-technical people inside organizations who are tasked with assessing whether the data they are being provided is useful. For these people, receiving a data dictionary containing a bunch of field names and types is the standard. But it doesn’t really help them make the assessment they need to judge whether the data is of high quality; whether it is of better quality than can be purchased elsewhere by a different provider, or whether it has improved or decreased in quality over time. For these people, a data dictionary is often filed away and they go about other means to try to assess the data’s quality. Can you send me a sample of 100 rows? What are the coverage rates for each field? What are the most common values that I’m likely to see?

If it all looks good, and they start receiving the data, often they will move on to the next data provider and next assessment. Rarely do teams have the resources to monitor their incoming data files on an ongoing basis for anomalies, like a sudden increase in null values in a particular field. Even more rarely do teams conduct regular data assessments to ask for new sample sets or statistics on the data. They simply move on.

Data as a product is very different than an application or a service because its value is dependent on many other variables besides the data being good or not. For example, the usability of the data is extremely important. You can have the most complete dataset in the world on say, university rankings, but if the data is not usable, it is worthless. By useable, I really mean that it can be easily joined with other datasets. And that’s because people in the market to buy data on university rankings aren’t just curious whether Stanford is ranked #1 again this year. They want to answer questions that require the data to be joined with data on say, student populations, geography, or fundraising. Rarely can a dataset be so valuable in isolation. Data providers should understand this, and work as hard as possible to make their data easily combined with other datasets. 

Another area where data as a product is special is that it is (usually) a collection of facts. However the data was collected, if a dataset contains information about property transactions, there is an objective truth to the amount of those transactions. A prospective buyer of that dataset is primarily concerned with whether the data is actually accurate or not. If it’s not, it’s not only worthless, but also potentially very damaging to that company’s business, as decisions will be made (such as pricing) in reliance on that data. It is in every data provider’s interest to invest as many resources as possible to the accuracy of its data with rigorous testing and monitoring.

  1. They’re ugly.

The standard in data dictionaries is the good old Excel spreadsheet, closely followed by a word document that has been saved as a PDF. It’s curious to me that for all of the time and money that companies spend on product marketing, they do not do a good job of marketing their actual product, which is the data itself. Software companies often pride themselves on design and on making their application as user-friendly and intuitive as possible. But when they receive an inquiry about their data, they send over a spreadsheet. Surely there is a better way.

  1. They cause confusion within the organization.

As with any workflow trapped in a spreadsheet, users of the spreadsheet often don’t know if they can trust it. Before sending out the dictionary to an important prospect, a salesperson may look it over and ask a few people in engineering or product if it’s still accurate. They are unlikely to know. If a current customer has a problem with the data, like if the file breaks, and they call up the support team at the data provider, the support team is going to check the actual file that was sent to the customer. They are not going to check the data dictionary. And so you have a reference document that is not really reliable, which sows confusion among many teams that need to work closely together to support their product. Confusion = time wasted that can be spent on other more valuable things.

Hate your data dictionary? Drop us a line at inquiries@getsyndetic.com.

Getting started

After you create your Syndetic account, we want to make it as easy possible for you to get started creating shareable data dictionaries. After you first create your account, you are brought to this screen:

Welcome screen

There are two workflows: one for users who already maintain a data dictionary and would like to import it into Syndetic, and one for users who are starting from scratch and want to get started sharing information about their datasets.

At Syndetic we think of everything in terms of datasets. A dataset is the slice of data that you want to explain to another person. These may be different bundles of data that you sell, or they may be different categories of data that you work with for internal purposes. For example, let’s say that your company sells data on financial institutions. You may have one dataset related to asset managers, another related to prime brokers, and a third related to stock exchanges. Each of these datasets needs to be individually packaged into a product, which means it needs to be explained and marketed. Collectively, all of the explanations of all of the datasets together comprise your data dictionary.

If you already have a data dictionary, simply click on the Upload your data dictionary button and send it to us. We’ll take your dictionary in whatever format it lives now – Excel spreadsheet, Google sheet, Word document, PDF – and break it down into its component datasets and load them into Syndetic for you. We turn around dictionaries in 1-2 business days. Once loaded, we’ll send you an email and explain how to manage and start sharing your datasets.

We do the work so you don’t have to!

If you don’t have a data dictionary, you hate the one you have (most people do!), or you want to start fresh, click Create a dataset.

So fresh and so clean

Now you’ll be prompted to give a name and description to your dataset. Remember, this is to identify a slice of data that you want to share with another person. Use a name and description that you think will be most helpful in explaining the dataset to someone who is not intimately familiar with your database. You can always change it later.

I love describing data over coffee.

Once you’ve described your dataset, you’ll be brought to the screen to upload your data extracts. A data extract is the data itself; it is required in order to get the automatically generated statistics (like coverage rates, top values, and character ranges) and automatic samples. You can use Syndetic without loading a data extract, but it is not nearly as valuable. As we like to say around here, statistics are worth 1,000 spreadsheets! We want to make it as easy as possible for the recipient of your dictionary to get a sense of the shape of the data you are sharing. These simple statistics (along with a small sample set) are the best way to convey the meaning and value of your data.

So click on Upload data extract and you will be brought here.

The self-serve version of Syndetic accepts csv formatted files with headers in the first row and files hosted on the internet.

If you have an Excel spreadsheet, you can click File — Save As and select csv from the drop-down menu to re-save your file in a CSV format. Note that in order to save an Excel file as a csv, your spreadsheet will need to have only one tab.

Excel-lent

The Enterprise version of Syndetic allows you to hook up your database directly so that you can pull extracts as often as you’d like, and get automatically refreshed statistics and samples. Contact us for more info. We can connect to any database that speaks ODBC/JDBC, the standard for database connections. Even with the Standard version of Syndetic, every time you upload a new data extract, the stats and samples will automatically refresh.

Once you’ve loaded your dataset, you can get started on the fun part – annotating your fields. We think this part is also the most important, because it’s where you convey the meaning behind the data. Where did this data set come from? What does it mean to your customer? How do you want to market this dataset? Maybe that’s customer-specific, because different customers may use the same dataset for entirely different purposes. Knowing the intended use case of your customer and describing the dataset for that use case is key. To start annotating, click on Manage fields.

When you click on a field, you are shown the common values and statistics on that field, to help with your annotating. When you click Edit for that field, you are taken to this page.

Describe your field just how you’d like

Because these annotations will appear in your final, published dictionary, think about how you want to describe this field to your customer.

  • Display name is useful if your data contains hard-to-read or especially long field names that you can’t change, e.g. sales_id_quart_20170412
  • Description is meant for describing the field from a business perspective to your customer. Keep it relatively short so that it fits neatly in your dictionary summary table on the published page.
  • Lineage is meant for describing where the data comes from – is it system generated from your app? Was it entered manually by a field sales rep? Does it come from a database you integrated into your tech stack years ago from a company you merged with? Data lineage is important for understanding the history behind this data and why it is what it is.
  • Notes are meant for any extra information about this field that may be helpful to your customer in understanding this field in particular. It’s almost like an introduction to the field, and will show up in bright blue in the published page.

For every field you have the option to publish or hide the statistics and values for that field. You also have the option to hide that field from the sample sets that are auto-generated by Syndetic. This may be useful for fields that contain particularly sensitive information. If you choose to hide a field from the sample sets by unchecking that checkbox, on the published page it will look like this:

From the main datasets page, you can click on Preview at any time to view your dictionary live and see how it will appear to other people you share the link with. This should help with your annotations along the way.

Any changes you make to your dictionary using the management layer are immediately reflected on the published page. It’s as fast as just hitting refresh.

Your organization

We want annotating to be collaborative, which is one of the many benefits of using Syndetic over a spreadsheet. You may want to involve data engineers, customer-facing relationship managers, or salespeople in your annotation process. To add users to your organization, click on Add a User.

Teamwork makes the dream work

When you add a user, you set their email and password, and an email will be sent to their address notifying them that they’ve been added to your organization. You can set permissions for each user that determine their capabilities:

  • Allow this user to manage other users means this user can create other users for your organization
  • Allow this user to create and edit objects means this user can add datasets, annotate fields, and edit content.

By default anyone who signs up from the Syndetic home page creates their own organization, so if you want to sign up your whole team, only one person should create an account from the home page, and they should invite the other users from their team. For enterprise packages, contact us.

From your organization’s homepage, you can also Add a Logo which will be included on all of your published dictionaries, and manage your billing. Make sure to add a logo so you get a snappy published dictionary page like this!

A logo makes your page look nice

Introducing Syndetic

At Syndetic we make software for data companies. Syndetic literally means connective. It comes from the Greek word syndetikos, which means to bind together. We chose this name because data is connective. While it has become a cliche that data is the new oil, we actually see data as the connective tissue that binds companies to each other. They exchange it in the course of transacting; they ingest it to power their businesses, and they sell it (or give it away) to add value to the greater ecosystem. We’re starting today with tools for companies that sell data, which are often misunderstood. Data-as-a-service is relatively new, but more and more companies are offering a data product alongside their core business because they have, in the course of building that business, built a valuable dataset as well.

Companies that sell data need to do two things in addition to building their dataset:

  1. Convey the meaning of their data
  2. Convey its value

Achieving these two things is surprisingly difficult. Why? For one thing, there is a lack of tools in the market specifically designed for DaaS, which means there are a lot of hacked together solutions out there. For another, data is by its nature fluid. Building software for a thing that is constantly changing is difficult. Lastly, there is increased competition in the data marketplace, as more and more data companies are founded in every vertical, and more incumbents launch data products. 

So when a salesperson at a data company calls up a prospect and says “I have some really valuable data I’d like to sell you,” the first thing the prospect is likely to ask is “Okay, what kind of data is it? And how do I know that it’s valuable to me?” And then the salesperson will say, “Let me send you our data dictionary, which explains our data schema, and a sample of the data so you can see what you think.”

Today, data dictionaries are almost always spreadsheets. Some companies keep a spreadsheet in a folder in a shared drive, some use a Google sheet, and some use a Word document that consists of a list of field names, types, and their business meaning. Who manages that spreadsheet often depends on the size of the company and the structure of their data organization – at a tech company or a startup it might be a data scientist, but at a large enterprise it might be a data steward or person responsible for a data governance program. Some companies have lots of process around the spreadsheet – there is an owner who is in charge of the whole spreadsheet, or maybe certain tabs or fields. If someone else who uses the spreadsheet needs to make an update to it, they will send a request to the owner, or submit a ticket through a project management tool. There is some approval process to make the change. Only certain people are allowed to send the spreadsheet out to prospects or current customers. When a field name or definition changes, chaos reigns. Who is in charge of updating the spreadsheet? Who is responsible for letting the customer know so their data pipeline doesn’t break?

This is why we are introducing Syndetic: a web platform for conveying the meaning and value of your datasets. We are purpose-built for DaaS companies, so you don’t need to hack another tool to work for data. We know that it is fluid. We know that what engineering does affects the business side, and vice versa. We know that depending on the use case, different fields may mean different things to different people. 

Go to www.getsyndetic.com to get started – upload your current data dictionary, or create one from scratch.

Allison and Steve, Cofounders