There’s no question dealing with huge datasets is a very challenging technical problem. Whether it’s distributing search queries (ie Google), calculating algorithms from huge datasets (ie flight services), or coming up with predictions from many data points (ie weather or stock market). Storing big data, and computing actionable insights from that data requires expertise, and lots of powerful machines.

But without the data, you don’t have that analysis. And sometimes getting that data is the more challenging aspect that prevents many copycat businesses from popping up. Here’s why:

1) It takes a loooong time to get all that data.

If you ask people why Google is the better search engine, the answer would be obvious: they have better algorithms. That’s definitely the major reason why but another is because their index is much bigger than Yahoo’s. If you’re indexing twice as many pages, you have higher quality pages to choose from. It’s common sense.

Building that huge index takes lots and lots of time. That’s the biggest reason why you don’t see any search engine startups these days. Google has an advantage because it started crawling the web since 1999, before many blogs and websites even got started. Compare this to Bing which only launched in 2009, a mere 3 years ago. They not only had to start crawling every page created since then, but also ALL the content created before then.

2) The data lives in a walled garden

Crawling the entire world wide web is hard. But suppose you had the idea of building a search index in 1990, giving you a 10 year head start. Aside from the complexities/expenses of storing the data, and normalizing everything, crawling the web really isn’t that hard, on a high level.

You just use curl, wget or even a headless browser like PhantomJS, and just distribute that crawler so it crawls as many pages as possible.

But what if you wanted to crawl and index all Twitter tweets since the beginning of time? The Twitter API doesn’t give you access to past tweets before a certain point, and if you built a crawler, they’d banned you in a minute if they saw so many accesses from the same IP addresses.

Which is why companies like DataSift have created a semi-barrier to entry with their partnership with Twitter. You can buy access to huge datasets of past tweets from them. We’re talking about years worth of data from 50+ million Twitter users. Data that can’t be found through regular crawling or through their API.

Sometimes getting around a problem requires business development, not engineering prowess.

3) It relies on millions of users generating content

MyFitnessPal is one of my favorite products. They’re a calorie counting app for the web, iPhone and Android. One of the features that makes them so popular is their enormous food database of millions of foods – American foods, Canadian foods, Asian foods, fast foods, you name it.

Where did they get all this data? They probably took it from a few public sources in the beginning. But when users discovered a food that wasn’t in the database they had the choice of entering it themselves. This new food would then be in the database available for anyone to search. After 7 years (they were created in 2005), their database grew to the size it is now.

Most people go for the calorie counter that has the largest database – not the one with the prettier UI or trendiest name. Which gives MyFitnessPal a huge competitive advantage because people will stop using an app if they can’t find what they’re searching for.

Comments are closed.