How is a bias identified in learning algorithms

What is algorithmic bias?

Have the text read to you

Required reading time: 7 minutes

Algorithmic prejudices arise from incorrect data and / or its processing. They can cause discrimination against certain groups of people or minorities through intelligent systems. One example is the discrimination against female applicants as part of an automated selection process. But how do incorrect data come about?

Today, machine learning and algorithms are the basis for decisions that affect individual fates or entire population groups. Intelligent assistants calculate the suitability of applicants, analyze the most efficient route or obstacles for self-driving cars and identify cancer on X-rays. Data is the blood in the veins of such machines: It is the basis for self-learning systems and the ultimate template for all subsequent calculations and recommendations.

What is machine learning?

This is because modern learning algorithms use predefined collections of information (e.g. texts or images) to recognize patterns or logical connections and to reveal regularities on which later decisions can be based. You learn by means of examples. An algorithm is only as good as the information it is based on. It is precisely this fact that becomes a challenge with the advancing spread of machine learning.

"However, an algorithm is only as good as the data with which it works."

(from English, Barocas / Self)


Discrimination through intelligent systems

Because data is generated and processed by people and, like their creators, is not perfect. For example, they reflect widespread prejudices or only capture certain groups of people. If an intelligent system works on the basis of such a data set, the result is often discrimination.

"Algorithmic bias occurs when a computer system reflects the implicit values ​​of the people involved in coding, collecting, selecting, or using data to train the algorithm."

(from English, Wikipedia)


There are numerous examples to demonstrate this challenge. Almost all of the major tech companies that work with AI have already encountered the problem. In 2015, for example, a Google algorithm identified people with dark skin as gorillas. In October 2018, Amazon hit the headlines as an intelligent system sorted out applications that contained the words "women" or "women college".

Computer scientist Joy Boulamwini demonstrates what serious consequences algorithmic bias can have, e.g. in image recognition.


It is left to your own imagination what would happen if a self-driving car used similar software to identify obstacles, for example.


How algorithmic prejudices arise

But how do these algorithmic prejudices arise? Solon Barocas from Cornell University and Andrew D. Selbst from Yale Law School define five technical mechanisms in the processing of data that can influence its informative value.

If you want to deal with the technical mechanisms in detail, we recommend reading the 56-page English language PDF's "Big Data's Disparate Impact" by Solon Barocas and Andrew D. Selbst. For the sake of completeness, it should be mentioned that the two authors refer to data mining, a close relative of machine learning, in their explanations.The goal of both methods is to identify patterns in data. Data mining is about that Find new Pattern, in machine learning around the Recognize acquaintance Template.

“By definition, data mining is always a form of statistical (and therefore apparently reasonable) discrimination. The purpose of data mining is to create a rational basis on which one can distinguish between individuals (...). "

(from English, Barocas / Self)


A simplified version of the main sources of error is as follows:

1. The subjective definition of target variables

A target variable translates a problem into a question. It therefore defines What a data scientist wants to find out. Even the definition of the target variables by an expert is a challenging, subjective process that can (even unintentionally) lead to discrimination. It is not for nothing that it is called the “art of data mining”. For example, let's say the target variable is the best employee in the company. In order to identify this person, the word “best” must first be defined in measurable values. This classification can be influenced by the individual perspective of the data scientist and thus lead to discrimination.

2. The wrong handling of training data

Modern algorithms based on machine learning require training data (for training the algorithms) and test data (for testing functionality).

  • Incorrect labeling: Training data is identified by humans in some cases (supervised learning). This decides in advance which picture shows a dog and which picture shows a cat. If this assignment is incorrect, it has a direct impact on the learning outcome.
  • Sampling Bias: The training data set comprises a majority of the population (e.g. fair-skinned people), while another part is underrepresented (e.g. dark-skinned people). Fair-skinned people then get better ratings on average.
  • A historical distortion exists when an algorithm is trained on the basis of an old data set that picks up on past values ​​and morals (such as the role of women).

3. Inaccurate feature selection

The selection of features is a decision about which attributes are taken into account and then incorporated into the analysis of data. It is considered impossible to capture all attributes of a subject or to consider all environmental factors in a model. For this reason, details may not be given enough attention, for example, and the resulting recommendations may be imprecise. Let's say we want to find the most suitable candidate for an open position. Graduation from an elite university is defined as a qualifying criterion. However, neither the final grade nor the length of study are taken into account. Ignoring these features can mean that the best candidate is not identified. It is therefore crucial to consider the context and find the right balance between features and the size of the dataset.

4. Masking / hidden discrimination

Masking describes the deliberate (hushed up) discrimination by decision-makers with prejudices, e.g. through the deliberate distortion of a data collection by a programmer.

"A biased programmer could intentionally implement discrimination, for example by including discriminatory features in the definition of the target variable."

(from English, Barocas / Self)

Combating algorithmic bias

To make matters worse, collecting and generating large amounts of data takes a lot of time (and money). Many data scientists therefore fall back on existing information collections and download them from the Internet. Biased data sets are spreading so rapidly and affecting many different systems around the world. For example, more than 15 million engineers imported a word library provided by Google called Word2Vec, which is known to contain all sorts of historical prejudices.

The high costs reduce the motivation of those responsible to set up incorrect data records again. Since the algorithms are often a well-kept secret of the companies, it is also difficult for victims of discrimination to create legally valid evidence or to get access to the data or their processing processes.

This fact, and the human factor behind the algorithmic bias, are currently the subject of heated debate by scientists, specialists, politicians and journalists. Organizations such as the Algorithmic Justice League or AI Now are actively committed to combating algorithmic bias. The first proposed solutions, for example, call for a diversification of the industry, which to this day mainly employs white, male specialists. Other experts propose comprehensive legal measures, for example to force companies to make their algorithms transparent and publish them.

Conclusion: Artificial intelligence and machine learning are only as good as the person who designs them. Data scientists and programmers are more than ever in the critical light of the public due to the growing popularity of new technologies. Placing these professionals under general suspicion or bringing more diversity to the industry does not, however, solve the problem of algorithmic bias alone. Diversity and empowerment are important, but everyone - regardless of their origin or gender - can be influenced by conscious or unconscious bias. Therefore, above all, the technical process of data processing must be questioned and -if possible - be optimized. The legal requirement for transparency can also motivate companies to prioritize and improve the quality of data and their processing processes.


  • Image: Photo by Bekir Mülazımoğlu, EyeEm
  • Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai,
  • Teaching Fairness to Artificial Intelligence: Existing and Novel Strategies against Algorithmic Discrimination under EU Law, Dr. Philipp Hacker, LL.M. (Yale),
  • Big Data’s Disparate Impact, Solon Barocas & Andrew D. Selbst,