How Data and Models Feed Computing
This post is the second in a three-part series on artificial intelligence by DigitalOcean’s Head of R&D, Alejandro (Alex) Jaimes. (Click here to read the first installment.)
Not every company, nor every developer will have the resources or the time to collect vast amounts of data to create models from scratch. Fortunately, the same repetition that I described in my last post occurs within and across industries. Because of this, particularly with deep learning, we’ve seen two very important trends:
- (1) creation and sharing of public data to build models; and
- (2) sharing of the models themselves even when the data is not released.
While the companies that have the most data may never release it, such data is not a requirement for every problem. It’s clear, however, that teams that leverage existing public models and combine public and proprietary datasets will have a competitive advantage. They must be “smart” about how they use and leverage the data they are able to collect, again with an AI mindset and strategy in mind.
Supervised and Unsupervised Learning
The majority of successes in AI so far have been based on supervised learning, in which machine learning algorithms are fed with labeled data—labeled data refers to a sample group that can be identified with a meaningful label or tag—versus unlabeled data. Labeling data is expensive, time consuming, and difficult (e.g., maintaining the desired quality, dealing with subjectivity, etc). For this reason, the ideal algorithms will be “unsupervised”—in other words, learning from unlabeled data. While promising, those algorithms have not shown the success levels needed to have the desired impact. Teams should then rely on creative strategies to leverage existing datasets, and combine supervised and unsupervised methods for now.
A number of companies offer labeling and data collection services. But there are ways to use algorithms to simplify the manual labeling process (e.g., with a “small” dataset one can create an algorithm that labels a much larger unlabeled dataset, so that humans have to correct errors made by the algorithm instead of labeling all of the data from scratch), or to create synthetic datasets (e.g., by using algorithms to generate “fake” data that looks like the original data). The bottom line is that no matter what size the project is, there are almost always alternatives to either obtain new data or augment existing datasets.
AI as a Service
Generally, significant efforts are required in developing models to perform tasks in accurate, efficient ways. For that reason, many companies and teams focus on specific verticals—building functionalities that are limited, but that work well in practice (versus the ideal of building a “human-like” AI capable of doing many things at once).
In some cases, those functionalities can be applied across domains. Developing a speech recognition system from scratch, for example, is a major effort, and most companies and teams that need it would be better off using a service than building it from scratch.
As the AI industry advances, we can expect to see more and more of those functionalities coming from specific vendors and open source initiatives, similar to the way software is built today: combinations of libraries, APIs, and open source and commercial components, coupled with custom software for specific applications.
In addition, given the nature of AI, building an infrastructure that quickly scales as needs shift is a major challenge. This implies that running AI will mostly happen on the cloud. Note that in the new AI computing paradigm, growing datasets, experimentation, and constant “tweaking” of models is a critical component.
Therefore, AI will be used as a cloud-based service for many applications. That’s a natural progression and in many ways leads to the commoditization of AI, which will lead to greater efficiency, opportunities, innovation, and positive economic impact. In our next installment, we’ll explore what all of this means for today’s developers.
In line with the trends we’re seeing in research and industry, we’re releasing a powerful set of tools that allow developers to easily re-use existing models, work with large quantities of data, and easily scale, on the cloud. We encourage you to take a look at our machine learning one-click. What other tools or functionalities would you be interested in having us provide? Feel free to leave feedback in the comments section below.
Alejandro (Alex) Jaimes is Head of R&D at DigitalOcean. Alex enjoys scuba diving and started coding in Assembly when he was 12. In spite of his fear of heights, he's climbed a peak or two, gone paragliding, and ridden a bull in a rodeo. He's been a startup CTO and advisor, and has held leadership positions at Yahoo, Telefonica, IDIAP, FujiXerox, and IBM TJ Watson, among others. He holds a Ph.D. from Columbia University.
Learn more by visiting his personal website or LinkedIn profile. Find him on Twitter: @tinybigdata.