Getting Started

1.  Getting data into Syndetic

The first step in building your data shop is to get the data to us. By connecting to your data, we automatically generate your shop and keep it up to date so your customers never get stale data. You have a few options:

  • S3
    • We will create an S3 bucket for you and send you credentials to push files to us. If you are creating multiple data products, you can create a subfolder for each product. As data updates, you should push revisions to us. It is important to send us complete updated revisions of any files rather than diff files.
  • Upload
    • If you prefer a more manual approach or don’t have that many files, you can simply upload files to Syndetic as you build out your shop.

2. Creating datasets

A dataset is equivalent to a data product. This can be a table in a database, or a file that you wish to sell. Your customers will be able to drill down into the data in two different ways: by selecting different packages, which slice the file vertically (by number of fields included) or by engaging different filters, which slice the data horizontally (by number of rows included). 

  • Creating packages

A package is a data offering that you create from a dataset. For example, if I have a dataset about companies that contains 100 fields covering all types of metadata on those companies – information about its financials, its employees, and its products, I may choose to make three packages out of a single dataset: one package that includes fields 1-20, which are all related to financials, one package that includes fields 20-30, which covers employees, and one that includes fields 31-100, which relate to its products. The important thing to note is that each package must rely on the same primary id field (e.g., company ID) in order to create different packages from the same dataset.

Note: It is ok to have only one package, where you are effectively making the entire dataset available at once.

When you create a package, you will be asked to describe the package (e.g. “Financial information on companies, including market cap, revenue, and PE ratio), set the ID field, and configure which fields you would like to set as filters. Filters allow your customers to slice the data so that they can purchase only data on companies with more than 1,000 employees, for example. Or only technology companies. You get the idea.

  • Pricing

Syndetic allows you to set a price per row for every package that you create. Think of your shop as an ecommerce site as you would for any other type of product. You can update pricing at any time, but you cannot offer different prices to different customers. You can instead offer a discount code to customers, which gives them a discount off of the list price on your shop. They can use this code at checkout. 

  • Blurring   

Blurring refers to whether you’d like the search results for your customers to be blurred out instead of showing them actual results from your dataset. In general we would recommend not blurring out results, as this makes a customer less likely to purchase, except in a few cases:
 – If the dataset is very small (say under 100 rows), and search results are likely to return just a few rows
 – If your pricing per row is very high, i.e. each row of data is extremely valuable, you may not want customers to see even a few rows of data

  • Shop settings

The shop settings page is where you control the look and feel of your shop. Add your company logo, set the color scheme, and add your company name.

The custom domain is where your shop lives. You can link to your data shop from your company’s website, share it with customers directly as part of your sales process, or use it yourself on sales calls by sharing your screen as you walk your customers through your data offerings. Write to if you are having trouble wiring up your data shop to your company’s website.

Note that in order to make your shop visible to others, it must be made Live. By default, when you are building out your shop it is in editing mode. When you are ready to make your shop live, click the toggle.

  • Home page

If you have multiple datasets, your home page is where you will drive your customers to. You can add a title, subtitle, and hero graphic to your homepage.

  • Managing your customers

To see a list of customers who have purchased data from your shop and learn more about their purchases, click on the Customers link on the left panel. Here, you can see customer names, contact information, their search, and amount sold.

If you have any questions, write to us at

Data vendor tear sheets and why they matter

In our last post, we explored how the meaning and importance of data dictionaries have changed over time. 20 years ago, a data dictionary referred to a list of field names and types spit out by a database. Now, a data dictionary serves as a sort of spec sheet for a dataset: it must be both accurate and attractive, putting the dataset in its best light so potential buyers understand what the data means and also why it’s valuable.

Let’s move one level up to the so-called vendor tear sheet. A tear sheet is a one page summary document describing a data vendor and the dataset offering. It too must serve several functions – to describe the vendor at a high level, such as legal company name and contact information, but also to describe the data product as a whole – what type of data it is, if it’s a time series dataset or not, how often the data is updated, how the data is delivered, what are the legal and contractual requirements when buying the dataset, does it contain any PII, etc. etc. At Syndetic we’ve read hundreds of vendor tear sheets, spoken with dozens of data buyers, and come to the conclusion that just as with data dictionaries, it’s time for tear sheets to be reformed.

Recently a group of researchers in the machine learning community published a paper called Datasheets for Datasets, proposing a standardized process for documenting datasets. They cite inspiration from the electronics industry, in which electronic parts “no matter how simple or complex” are accompanied by a datasheet outlining their characteristics, recommended use cases, test results, and so on. The authors propose an analogous standard in machine learning, where every dataset is accompanied by a datasheet outlining its motivation, composition, collection process, and recommended uses.

We’d like to expand this recommendation beyond the machine learning community to argue that any dataset exchanged between two companies should be accompanied by a standard datasheet, or tear sheet. Both sides of the market benefit. For data providers, completing the tear sheet encourages reflection on how their data is sourced, delivered, maintained, and used. Often a tear sheet also acts as an SLA between buyer and seller, with the data provider promising to notify their customer if, for example, the data is updated less frequently than it is stated in the tear sheet. The burden to monitor conformance to tear sheets today still falls predominantly upon the buyer. For data buyers, a standard tear sheet makes it easier to compare data vendors and datasets with each other. It encourages accountability and transparency within the industry. It also gives the buyer something to refer to if the data doesn’t seem to conform to the specs outlined in the tear sheet after purchase. In this way, it acts as another contract between data buyer and seller – not legally binding like the data licensing agreement signed by the parties, but equally important to the success of the ongoing relationship.

Managing tear sheets is becoming increasingly complex for both data providers and data buyers. This makes the adoption of a standard all the more urgent. Account managers at data providers must manage tear sheets across all sales channels: direct to customer and indirectly through data aggregators which have proliferated in the past 12 months. Just keeping track of which tear sheets have been sent to who can be a nightmare. Similarly, data buyers must manage the tear sheets they receive from vendors and view in the aggregators. Important points about the dataset are often lost in Sharepoint attachments or in the personal notes of an employee. Standards, and a centralized tear sheet management system, make it much more likely that these attributes don’t get lost.

Some industries are starting to adopt their own tear sheet standards, such as the one adopted by the Alternative Data Council within the financial services industry. We are a member of the council, and we built the recommended FISD standard into Syndetic so that our customers, data providers who often sell into financial services, can create and manage their tear sheets from one central system. You can create one for free here.

We expect that other industries will follow the lead of financial services and move to adopt their own best practices and standards for data tear sheets. We will be closely following the standards as they evolve. If you know of an industry working to adopt a standard data tear sheet, drop us a line at

Why does everyone hate their data dictionary?

As cofounder of Syndetic, I’ve talked to a lot of people about their data dictionaries. At this point, probably dozens of people, ranging from data governance managers at large enterprises to founders of early stage tech companies. And every single one of the lot hates their data dictionary. When I say hates, I mean that they say something like “Ugh. I won’t even show it to you. It’s an embarrassment.”

Why does everyone hate their data dictionary? A sort of meta-spreadsheet, a data dictionary on its face sounds like a relatively simple thing. It is a document describing the meaning of a dataset. Typically this includes field names and types (e.g. string, text, varchar) and maybe some annotations that describe the lineage of the data (where did it come from) and the business definition. But as with many workflows that are captured in spreadsheets, things can go awry very quickly.

  1. They are difficult to maintain.

The first person to create a data dictionary for their company usually has great intentions. They may be the first data scientist hired there, or the first data governance professional. They are diligent and organized, and dedicated to the mission of ferreting out every last bit of information about their information. They meticulously craft a spreadsheet (or google sheet) that contains the best information available to them at the time. They double and triple-check it for accuracy. But then, of course, things go off the rails.

Maintaining a data dictionary is not a full time job. And so, the person who created it cannot be expected to be thinking about this document at all times. They go back to doing their day job, and bit by bit, changes start to happen. Engineers change schemas without thinking to alert the person who created or keeps the data dictionary. Data salespeople call their prospects and walk them through the fields of the dataset, but realize that the annotations don’t quite make sense for the use case of their prospect. So they make a copy of the spreadsheet, change the annotation, and send it out. Product teams buy a system to capture data that used to be captured manually. And the dictionary very quickly gets out of date, often before anyone even realizes.

It is dangerous to have any company asset that is so dependent on one person in the organization, in this case the creator of the dictionary. If that person leaves the organization, all history of the document often goes with them, and a new person in that role may be tempted to just scrap it and start over. But then the problem repeats.

  1. They are necessary, but not sufficient, to fully explain the data.

Oftentimes, datasets are shared with business analysts or other non-technical people inside organizations who are tasked with assessing whether the data they are being provided is useful. For these people, receiving a data dictionary containing a bunch of field names and types is the standard. But it doesn’t really help them make the assessment they need to judge whether the data is of high quality; whether it is of better quality than can be purchased elsewhere by a different provider, or whether it has improved or decreased in quality over time. For these people, a data dictionary is often filed away and they go about other means to try to assess the data’s quality. Can you send me a sample of 100 rows? What are the coverage rates for each field? What are the most common values that I’m likely to see?

If it all looks good, and they start receiving the data, often they will move on to the next data provider and next assessment. Rarely do teams have the resources to monitor their incoming data files on an ongoing basis for anomalies, like a sudden increase in null values in a particular field. Even more rarely do teams conduct regular data assessments to ask for new sample sets or statistics on the data. They simply move on.

Data as a product is very different than an application or a service because its value is dependent on many other variables besides the data being good or not. For example, the usability of the data is extremely important. You can have the most complete dataset in the world on say, university rankings, but if the data is not usable, it is worthless. By useable, I really mean that it can be easily joined with other datasets. And that’s because people in the market to buy data on university rankings aren’t just curious whether Stanford is ranked #1 again this year. They want to answer questions that require the data to be joined with data on say, student populations, geography, or fundraising. Rarely can a dataset be so valuable in isolation. Data providers should understand this, and work as hard as possible to make their data easily combined with other datasets. 

Another area where data as a product is special is that it is (usually) a collection of facts. However the data was collected, if a dataset contains information about property transactions, there is an objective truth to the amount of those transactions. A prospective buyer of that dataset is primarily concerned with whether the data is actually accurate or not. If it’s not, it’s not only worthless, but also potentially very damaging to that company’s business, as decisions will be made (such as pricing) in reliance on that data. It is in every data provider’s interest to invest as many resources as possible to the accuracy of its data with rigorous testing and monitoring.

  1. They’re ugly.

The standard in data dictionaries is the good old Excel spreadsheet, closely followed by a word document that has been saved as a PDF. It’s curious to me that for all of the time and money that companies spend on product marketing, they do not do a good job of marketing their actual product, which is the data itself. Software companies often pride themselves on design and on making their application as user-friendly and intuitive as possible. But when they receive an inquiry about their data, they send over a spreadsheet. Surely there is a better way.

  1. They cause confusion within the organization.

As with any workflow trapped in a spreadsheet, users of the spreadsheet often don’t know if they can trust it. Before sending out the dictionary to an important prospect, a salesperson may look it over and ask a few people in engineering or product if it’s still accurate. They are unlikely to know. If a current customer has a problem with the data, like if the file breaks, and they call up the support team at the data provider, the support team is going to check the actual file that was sent to the customer. They are not going to check the data dictionary. And so you have a reference document that is not really reliable, which sows confusion among many teams that need to work closely together to support their product. Confusion = time wasted that can be spent on other more valuable things.

Hate your data dictionary? Drop us a line at

Introducing Syndetic

At Syndetic we make software for data companies. Syndetic literally means connective. It comes from the Greek word syndetikos, which means to bind together. We chose this name because data is connective. While it has become a cliche that data is the new oil, we actually see data as the connective tissue that binds companies to each other. They exchange it in the course of transacting; they ingest it to power their businesses, and they sell it (or give it away) to add value to the greater ecosystem. We’re starting today with tools for companies that sell data, which are often misunderstood. Data-as-a-service is relatively new, but more and more companies are offering a data product alongside their core business because they have, in the course of building that business, built a valuable dataset as well.

Companies that sell data need to do two things in addition to building their dataset:

  1. Convey the meaning of their data
  2. Convey its value

Achieving these two things is surprisingly difficult. Why? For one thing, there is a lack of tools in the market specifically designed for DaaS, which means there are a lot of hacked together solutions out there. For another, data is by its nature fluid. Building software for a thing that is constantly changing is difficult. Lastly, there is increased competition in the data marketplace, as more and more data companies are founded in every vertical, and more incumbents launch data products. 

So when a salesperson at a data company calls up a prospect and says “I have some really valuable data I’d like to sell you,” the first thing the prospect is likely to ask is “Okay, what kind of data is it? And how do I know that it’s valuable to me?” And then the salesperson will say, “Let me send you our data dictionary, which explains our data schema, and a sample of the data so you can see what you think.”

Today, data dictionaries are almost always spreadsheets. Some companies keep a spreadsheet in a folder in a shared drive, some use a Google sheet, and some use a Word document that consists of a list of field names, types, and their business meaning. Who manages that spreadsheet often depends on the size of the company and the structure of their data organization – at a tech company or a startup it might be a data scientist, but at a large enterprise it might be a data steward or person responsible for a data governance program. Some companies have lots of process around the spreadsheet – there is an owner who is in charge of the whole spreadsheet, or maybe certain tabs or fields. If someone else who uses the spreadsheet needs to make an update to it, they will send a request to the owner, or submit a ticket through a project management tool. There is some approval process to make the change. Only certain people are allowed to send the spreadsheet out to prospects or current customers. When a field name or definition changes, chaos reigns. Who is in charge of updating the spreadsheet? Who is responsible for letting the customer know so their data pipeline doesn’t break?

This is why we are introducing Syndetic: a web platform for conveying the meaning and value of your datasets. We are purpose-built for DaaS companies, so you don’t need to hack another tool to work for data. We know that it is fluid. We know that what engineering does affects the business side, and vice versa. We know that depending on the use case, different fields may mean different things to different people. 

Go to to get started – upload your current data dictionary, or create one from scratch.

Allison and Steve, Cofounders