Machine Learning Jargon

Essential terminology for newcomers

Machine learning as a whole can be quite overwhelming when you are getting started, so I’ve compiled a list of core concepts and terminology I consider necessary for understanding and starting with machine learning.

Activation Function

An activation, or activation function, for a neural network is defined as the mapping of the input to the output via a non-linear transform function at each “node”, which is simply a locus of computation within the net. Each layer in a neural net consists of many nodes, and the number of nodes in a layer is known as its width.

Artificial Neural Network

An artificial neuron network (ANN) is a computational model based on the structure and functions of biological neural networks. Information that flows through the network affects the structure of the ANN because a neural network changes — or learns, in a sense — based on that input and output.

Back propagation

Backpropagation is an algorithm to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward.

Bayesian Optimization

Bayesian Optimization is an optimization algorithm that is used when three conditions are met:

  1. you do not have a black box (or any box) that evaluates the gradient of your objective function
  2. your objective function is expensive to evaluate.


Classification is concerned with building models that separate data into distinct classes. Well-known classification schemes include decision trees and support vector machines. As this type of algorithm requires explicit class labelling, classification is a form of supervised learning.


Clustering is used for analyzing data which does not include pre-labeled classes, or even a class attribute at all. Data instances are grouped together using the concept of “maximizing the intraclass similarity and minimizing the interclass similarity,” as concisely described by Han, Kamber & Pei. k-means clustering is perhaps the most well-known example of a clustering algorithm. As clustering does not require the pre-labeling of instance classes, it is a form of unsupervised learning, meaning that it learns by observation as opposed to learning by example.


For mathematical purposes, a convolution is the integral measuring how much two functions overlap as one passes over the other. Think of a convolution as a way of mixing two functions by multiplying them: a fancy form of multiplication. From the Latin convolvere, “to convolve” means to roll together.

Convolutional neural network

Convolutional networks are a deep neural network that is currently the state-of-the-art in image processing. They are setting new records in accuracy every year on widely accepted benchmark contests like ImageNet.

Cross Validation

Cross-validation is a deterministic method for model building, achieved by leaving out one of k segments, or folds, of a dataset, training on all k-1 segments, and using the remaining kth segment for testing; this process is then repeated k times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible.

Feature Engineering

Feature engineering is the art of extracting useful patterns from data that will make it easier for Machine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification.

Gradient Descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Deep Belief Network

A deep-belief network is a stack of restricted Boltzmann machines, which are themselves a feed-forward autoencoder that learns to reconstruct input layer by layer, greedily. Pioneered by Geoff Hinton and crew. Because a DBN is deep, it learns a hierarchical representation of input. Because DBNs learn to reconstruct that data, they can be useful in unsupervised learning.

Decision Trees

Decision trees are top-down, recursive, divide-and-conquer classifiers. Decision trees are generally composed of 2 main tasks:

  • Tree pruning is the process of removing the unnecessary structure from a decision tree in order to make it more efficient, more easily-readable for humans, and more accurate as well. This increased accuracy is due to pruning’s ability to reduce overfitting.


Inference refers inferring unknown properties of the real world. This can be confusing both because learning and inference refer to figuring out unknown things. An example from human vision: when you see something, your brain receives a bunch of pixels and you need to infer what it is that you are seeing (e.g., a tree).


A layer is the highest-level building block in a (Deep) Neural Network. A layer is a container that usually receives weighted input, transforms it and returns the result as output to the next layer. A layer usually contains one type of function like ReLU, pooling, convolution etc. so that it can be easily compared to other parts of the network. The first and last layers in a network are called input and output layers, respectively, and all layers in between are called hidden layers.


Learning refers to learning unknown parameters in your model. Your model might be wrong, so these parameters might have nothing to do with the real world — but they can still be useful in modeling your system

LSTM Long Short-Term Memory

Long Short-Term Memory networks were invented to prevent the vanishing gradient problem in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies.


This is when you learn something too specific to your data set and therefore fail to generalize well to unseen data. This causes it to do badly for unseen data. Another example is if you are a student and you memorize the answers to all 1000 questions in your textbook without understanding the material.


Pooling, max pooling and average pooling are terms that refer to downsampling or subsampling within a convolutional network. Downsampling is a way of reducing the amount of data flowing through the network, and therefore decreasing the computational cost of the network. Average pooling takes the average of several values. Max pooling takes the greatest of several values. Max pooling is currently the preferred type of downsampling layer in convolutional networks.


Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.


Regression is very closely related to classification. While classification is concerned with the prediction of discrete classes, regression is applied when the “class” to be predicted is made up of continuous numerical values. Linear regression is an example of a regression technique.

Recurrent Neural Network

A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. At each time step, an RNN calculates a new hidden state (“memory”) based on the current input and the previous hidden state. The “recurrent” stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs.

Recursive Neural Network

Recursive neural networks learn data with structural hierarchies, such as text arranged grammatically, much like recurrent neural networks learn data structured by its occurrence in time. Their chief use is in natural-language processing, and they are associated with Richard Socher of Stanford’s NLP lab.


Tomas Mikolov’s neural networks, known as Word2vec, have become widely used because they help produce state-of-the-art word embeddings. Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus.


A vector is a data structure with at least two components, as opposed to a scalar, which has just one. For example, a vector can represent velocity, an idea that combines speed and direction: wind velocity = (50mph, 35 degrees North East). A scalar, on the other hand, can represent something with one value like temperature or height: 50 degrees Celsius, 180 centimeters.

Senior Engineering Lead @Humi, Regularly writing about software engineering and technology

Senior Engineering Lead @Humi, Regularly writing about software engineering and technology