Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

Machine Learning Yearning

Machine Learning Yearning is a book with technical strategy for AI Engineers, in the era of deep learning.
Andrew Ng summarizes his fruitful experience in building industrial machine learning products to this book: /Machine Learning Yearning/. It can be learnt and used by AI engineers and AI PMs to investigate issues in real world ML products and to deliver great AI products.
Moreover, the book was written in an interesting and straightforward way by using lots of examples. I really enjoyed the reading process and summarized it to the following outlines. Enjoy. 😊

Why ML Strategy

For example, building a cat picture startup which use DL to detecting cats in pictures. But tragically, your learning algorithm’s accuracy is not yet good enough. You are under tremendous pressure to improve your cat detector. What do you do?

  • Get more data
  • Collect a more diverse training set
  • Train the algorithm longer, by running more gradient descent iterations.
  • Try a bigger neural network, with more layers/hidden units/parameters.
  • Try a smaller neural network.
  • Try adding regularization (such as L2 regularization).
  • Change the neural network architecture (activation function, number of hidden units, etc.)

If you choose well among these possible directions, you’ll build the leading cat picture platform, and lead your company to success. If you choose poorly, you might waste months.

How do you proceed? This book tells you how.

In general, you obtain the best performance when you (i) Train a very large neural network, so that you are on the green curve above; (ii) Have a huge amount of data.
 
Many other details such as neural network architecture are also important, and there has been much innovation here. But one of the more reliable ways to improve an algorithm’s performance today is still to (i) train a bigger network and (ii) get more data.

Setting up development and test sets

  1. Don’t assume your training distribution is the same as your test distribution. Try to pick test examples that reflect what you ultimately want to perform well on, rather than whatever data you happen to have for training.
  2. Your dev and test sets should come from the same distribution
    • E.g. suppose your team develops a system that works well on the dev set but not
      the test set. If your dev and test sets had come from the same distribution, then you would
      have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious
      cure is to get more dev set data.

    • But if the dev and test sets come from different distributions, then your options are less clear. Either you have overfit the dev set or the test set is harder than or different from test set.
  3. How large should be the dev/test set?
    • Dev set size depends on what accuracy needed for peformance improvement. Dev set of 1,000 for 10,000 examples will have a good chance to detect an improvement of 0.1%. If performance improvement means a lot to company (like ads, web search, product recommendations, etc.), it deserves to have large dev set.
    • Test set is said to use 30% heuristically. But for large (e.g. > 1 billion) data, there is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms.
  4. Establish a single-number evaluation metric for your team to optimize
    • Classification accuracy is an example of a single-number evaluation metric.
    • In contrast, Precision and Recall (ROC) is not a single-number evaluation metric.
    • Having a single-number evaluation metric speeds up your ability to make a decision when
      you are selecting among a large number of classifiers.

  5. Optimizing and satisficing metrics -- Is another way to combine multiple evaluation metrics.

   | Classifier | Accuracy | Running time |
| ---------- | -------- | ------------ |
| A | 90% | 80ms |
| B | 92% | 95ms |
| C | 95% | 1,500ms |

  • E.g. The table above, we may define running time as satisficing metric -- below 100ms is acceptable. Define accuracy as optimizing metrics -- the larger, the better, as long as within the bound of satisficing metric.
  1. When to change dev/test sets and metrics
    • Iterate fast: It is better to come up with something imperfect and get going quickly, rather than overthink this.
    • It is quite common to change dev/test sets or evaluation metrics during a project. It’s not a big deal! Just change them and make sure your team knows about the new direction.

Basic Error Analysis

  1. When you start a new project, especially if it is in an area in which you are not an expert, it is hard to correctly guess the most promising directions.
  2. So don’t start off trying to design and build the perfect system. Instead build and train a basic system as quickly as possible—perhaps in a few days. Then use error analysis to help you identify the most promising directions and iteratively improve your algorithm from there.
  3. Carry out error analysis by manually examining ~100 dev set examples the algorithm misclassifies and counting the major categories of errors. Use this information to prioritize what types of errors to work on fixing.
  4. Consider splitting the dev set into an Eyeball dev set, which you will manually examine, and a Blackbox dev set, which you will not manually examine. If performance on the Eyeball dev set is much better than the Blackbox dev set, you have overfit the Eyeball dev set and should consider acquiring more data for it.
  5. The Eyeball dev set should be big enough so that your algorithm misclassifies enough examples for you to analyze. A Blackbox dev set of 1,000-10,000 examples is sufficient for many applications.
  6. If your dev set is not big enough to split this way, just use the entire dev set as an Eyeball dev set for manual error analysis, model selection, and hyperparameter tuning.

Bias and Variance

There are two major sources of error in machine learning: bias and variance.

Roughly:

  • The bias is the error rate of your algorithm on your training set when you have a very large training set.
  • The variance is how much worse you do on the test set compared to the training set in this setting.

With the concepts in mind. Let's look at several examples:

Training Error Dev Error
1% 11%

We estimate the bias as 1% and the variance as 10%. Thus, it has high variance. The classifier has very low training error, but it is failing to generalize to the dev set. This is also called overfitting.

Training Error Dev Error
15% 16%

We estimate the bias as 15%, and variance as 1%. This classifier is fitting the training set poorly with 15% error, but its error on the dev set is barely higher than the training error. This classifier therefore has high bias, but low variance. We say that this algorithm is underfitting.

Training Error Dev Error
15% 30%

We estimate the bias as 15%, and variance as 15%. This classifier has high bias and high variance: It is doing poorly on the training set, and therefore has high bias, and its performance on the dev set is even worse, so it also has high variance. The overfitting/underfitting terminology is hard to apply here since the classifier is simultaneously overfitting and underfitting.

Training Error Dev Error
0.5% 1%

This classifier is doing well, as it has low bias and low variance. Congratulations on achieving this great performance!

Training Error Dev Error
15% 30%

Considering this use case with condition that the classification task is hard (e.g. speech recognition with tons of background noise) that even human being can only achieve 14% accuracy. The bias can variance in this case can be divided as:

  • Optimal error rate ("unavoidable bias", "Bayes error rate", "Bayes rate"): 14%. Suppose we decide that, even with the best possible speech system in the world, we would still suffer 14% error. We can think of this as the "unavoidable" part of a learning algorithm’s bias.
  • Avoidable bias: 1%. This is calculated as the difference between the training error and the optimal error rate.
  • Variance: 15%. The difference between the dev error and the training error.

Techniques for reducing avoidable bias

  • Increase the model size (such as number of neurons/layers): This technique reduces bias, since it should allow you to fit the training set better. If you find that this increases variance, then use regularization, which will usually eliminate the increase in variance.
  • Modify input features based on insights from error analysis: Say your error analysis inspires you to create additional features that help the algorithm eliminate a particular category of errors. (We discuss this further in the next chapter.) These new features could help with both bias and variance. In theory, adding more features could increase the variance; but if you find this to be the case, then use regularization, which will usually eliminate the increase in variance.
  • Reduce or eliminate regularization (L2 regularization, L1 regularization, dropout): This will reduce avoidable bias, but increase variance.
  • Modify model architecture (such as neural network architecture) so that it is more suitable for your problem: This technique can affect both bias and variance.

One method that is not helpful:

  • Add more training data: This technique helps with variance problems, but it usually has no significant effect on bias.

Techniques for reducing variance

  • Add more training data: This is the simplest and most reliable way to address variance, so long as you have access to significantly more data and enough computational power to process the data.
  • Add regularization (L2 regularization, L1 regularization, dropout): This technique reduces variance but increases bias.
  • Add early stopping (i.e., stop gradient descent early, based on dev set error): This technique reduces variance but increases bias. Early stopping behaves a lot like regularization methods, and some authors call it a regularization technique.
  • Feature selection to decrease number/type of input features: This technique might help with variance problems, but it might also increase bias. Reducing the number of features slightly (say going from 1,000 features to 900) is unlikely to have a huge effect on bias. Reducing it significantly (say going from 1,000 features to 100—a 10x reduction) is more likely to have a significant effect, so long as you are not excluding too many useful features. In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data. But when your training set is small, feature selection can be very useful.
  • Decrease the model size (such as number of neurons/layers): Use with caution. This technique could decrease variance, while possibly increasing bias. However, I don’t recommend this technique for addressing variance. Adding regularization usually gives better classification performance. The advantage of reducing the model size is reducing your computational cost and thus speeding up how quickly you can train models. If speeding up model training is useful, then by all means consider decreasing the model size. But if your goal is to reduce variance, and you are not concerned about the computational cost, consider adding regularization instead.

Two methods that are not helpful:

  • Modify input features based on insights from error analysis: Say your error analysis inspires you to create additional features that help the algorithm to eliminate a particular category of errors. These new features could help with both bias and variance. In theory, adding more features could increase the variance; but if you find this to be the case, then use regularization, which will usually eliminate the increase in variance.
  • Modify model architecture (such as neural network architecture) so that it is more suitable for your problem: This technique can affect both bias and variance.

Comparing to human-level performance

How to obtain human-level performance?

  1. Ease of obtaining data from human labelers. For example, since people recognize cat images well, it is straightforward for people to provide high accuracy labels for your learning algorithm.
  2. Error analysis can draw on human intuition. Suppose a speech recognition algorithm is doing worse than human-level recognition. Say it incorrectly transcribes an audio clip as “This recipe calls for a pear of apples,” mistaking “pair” for “pear.” You can draw on human intuition and try to understand what information a person uses to get the correct transcription, and use this knowledge to modify the learning algorithm.
  3. Use human-level performance to estimate the optimal error rate and also set a “desired error rate.”

To surpass human-level performance?

In general, so long as there are dev set examples where humans are right and your algorithm is wrong, then many of the techniques described earlier will apply. This is true even if, averaged over the entire dev/test set, your performance is already surpassing human-level performance.

Training and testing on different distributions

When you should train and test on different distribution?

Most of the academic literature on machine learning assumes that the training set, dev set and test set all come from the same distribution.

In the early days of machine learning, data was scarce. We usually only had one dataset drawn from some probability distribution.

But in the era of big data, we now have access to huge training sets, such as cat internet images. Even if the training set comes from a different distribution than the dev/test set, we still want to use it for learning since it can provide a lot of information.

Golden rule:

Choose dev and test sets to reflect data you expect to get in the future and want to do well on.

For example we have a cat detector mobile app, instead of putting all 10,000 user-uploaded images into the dev/test sets, we might instead put 5,000 into the dev/test sets. We can put the remaining 5,000 user-uploaded examples into the training set. This way, your training set of 205,000 examples contains some data that comes from your dev/test distribution along with the 200,000 internet images.

Why adding internet images will work (or not)?

  1. Your neural network can apply some of the knowledge acquired from internet images to mobile app images.
  2. It forces the neural network to expend some of its capacity to learn about properties that are specific to internet images. If these properties differ greatly from mobile app images, it will “use up” some of the representational capacity of the neural network. Thus there is less capacity for recognizing data drawn from the distribution of mobile app images, which is what you really care about. Theoretically, this could hurt your algorithms’ performance.

For the second effect:

  • Fortunately, if you have the computational capacity needed to build a big enough neural network—i.e., a big enough attic—then this is not a serious concern. You have enough capacity to learn from both internet and from mobile app images.
  • But if you do not have a big enough neural network (or another highly flexible learning algorithm), then you should pay more attention to your training data matching your dev/test set distribution.

Another way to mitigating impact of Internet pictures is weighting data.

by giving 𝛽 < 1 weight. If you set 𝛽=1/40, the algorithm would give equal weight to the 5,000 mobile images and the 200,000 internet images. You can also set the parameter 𝛽 to other values, perhaps by tuning to the dev set.

This type of re-weighting is needed only when:

  1. you suspect the additional data (Internet Images) has a very different distribution than the dev/test set,

  2. if the additional data is much larger than the data that came from the same distribution as the dev/test set (mobile images).

Data Synthesis

E.g. Your speech system needs more data that sounds as if it were taken from within a car. Rather than collecting a lot of data while driving around, there might be an easier way to get this data: By artificially synthesizing it.

Keep in mind that artificial data synthesis has its challenges: it is sometimes easier to create synthetic data that appears realistic to a person than it is to create data that appears realistic to a computer.For example, suppose you have 1,000 hours of speech training data, but only 1 hour of car noise. If you repeatedly use the same 1 hour of car noise with different portions from the original 1,000 hours of training data, you will end up with a synthetic dataset where the same car noise is repeated over and over. Human being is hard to find this, but computer will overfitting the 1 hour car noise easily.

End-to-end deep learning

For example, suppose you have 1,000 hours of speech training data, but only 1 hour of car noise. If you repeatedly use the same 1 hour of car noise with different portions from the original 1,000 hours of training data, you will end up with a synthetic dataset where the same car noise is repeated over and over.

When should you and shouldn't you use them? Here is an example.

Suppose you want to build a speech recognition system. You might build a system with three components:

MFCC and Phonemes are both hand-engineered features. These hand-engineered components limit the potential performance of the speech system. However, allowing hand-engineered components also has some advantages:

  • The MFCC features are robust to some properties of speech that do not affect the content, such as speaker pitch. Thus, they help simplify the problem for the learning algorithm.
  • They can also help the learning algorithm understand basic sound components and therefore improve its performance.

Now, consider the end-to-end system:

This system lacks the hand-engineered knowledge. Thus, when the training set is small, it might do worse than the hand-engineered pipeline. If the learning algorithm is a large-enough neural network and if it is trained with enough training data, it has the potential to do very well.

Andrew Ng is skeptical about end-to-end learning for autonomous driving.

For example:

You can use machine learning to detect cars and pedestrians. But if we want to train an end-to-end sterring direction recognition system, we would need a large dataset of (Image, Steering Direction) pairs. It is very time-consuming and expensive to have people drive cars around and record their steering direction to collect such data.

Directly learning rich outputs:

One of the most exciting developments in end-to-end deep learning is that it is letting us directly learn y that are much more complex than a number.

This is an accelerating trend in deep learning: When you have the right (input,output) labeled pairs, you can sometimes learn end-to-end even when the output is a sentence, an image, audio, or other outputs that are richer than a single number.

2394
TOC
Comments
Write a Comment