We believe in meritocracy as one of the cornerstones of Western civilization. This idea is so embedded in our culture that we seldom question it. We praise those driven individuals who combine effort and talent to make it to the top. They smile in the cover of the magazines, the dreams of children and the nightmares of adults. Sports might be the activity where the blood, sweat and tears feel most real (maybe I should say relatable) and therefore are most notoriously broadcasted to the world.

The beautiful game is no exception. But what if we were wrong about the meritocracy assumption? What if there were random events that sneak in the player selection process? What if such events were not negligible, but heavily influenced what players that make it to the top?


I write fiction and non-fiction.
I write open-source software.
I create generative art.

All of these are available for free in different media. If you like what I do, and want me to keep creating, you can contribute using the links below.

Patreon Become a Patron!
Paypal


One trimester

Three or four months of difference at 25 year old is nothing1. Moreover, for most purposes in adult life, such a small difference in age is considered zero. But elite athletes are not ordinary adults and we don’t care about most of their purposes in life. We want them to do one thing extraordinarily well.

What could a trimester difference do to a 25 year old, prime time, football player? You might still find yourself asking this question and this question might sound reasonable. But the data looks quite different. Let’s take a look.

The database

I compiled a medium size database of football players from www.soccerwiki.com. This data base is not perfectly complete, but contains a good enough number of players to provide support to the general idea. Here’s a table of the number of players by country:

country ARG BRA COL ENG ESP FRA GER ITA URU
n 2995 4606 1716 3967 3322 2453 2038 3297 1316

Here’s a month by month representation of the number of players born in each country.

This looks quite strange, there’s a clear trend, with most players being born in the first months of the year. Maybe it’s a seasonal thing?

Well…It’s quite difficult to support the idea that players born on any season are better. For example, January in Brazil is blazing hot summer while it’s freezing cold winter in Germany. Still, both countries seem to produce the most number of players in the first month of the year. In fact, the data show that being born on the first trimester after the cutoff date (e.g, January to March if cutoff is January 1st) increases the chances of being a top player by a huge margin.

Analysis by position

There might be some logic behind this phenomenon. Maybe the birth effect is big for positions most strongly associated with bare physical strength. Thus, we could expect to find a stronger effect for those in defensive positions (i.e, Goalkeeper up to Defensive Midfielders) and find a broader distribution of birth dates for those positions that are reserved for the creatives and magicians of the ball, the offensive players.

Again, the evidence for a strong bias favoring players born on the first trimester is compelling. What’s going on here?

The one percent difference

This is not a post about how the top athletes distinguish themselves from the rest because of their talent and effort. This post is about something completely out of their control: their birth dates. Yes, it sounds crazy, but the month of birth plays a huge role in the success of a football player.

Football schools around the world are normally organized around the calendar year. It makes sense, if you are going to make a kid tournament, instead of having kids of different ages competing together, you grab all the kids born on a same year and make them play against each other (5 year-old kids vs other 5 year-old kids). For example, when I was playing these tournaments, I played in the 1991 category.

Scouts and coaches might decide to promote a strong/talented kid. In my case, I could play for the 1990 or 1989 categories. But we all agree that it would be unfair to downgrade me to play for the 1995 category. I would have a huge developmental advantage. So far so good, but there’s something that escapes from everyone: one year is a lot of time. The developmental difference between a 5 year old kid born at the beginning of the year (e.g, January 1st) vs one at the very end (e.g, December 31st) is huge.

Let’s take a less extreme case, kids born with a 180 days difference. That amount of days corresponds to an extra fraction of life lived (aka experience, aka time kicking the ball) of 0.1. It might sound like very little but that 10% advantage will definitely compound over the competitive career.

Logic behind

It turns out to be that scouts and coaches think they are selecting the best, but they are selecting the oldest. The January 1st cutoff is arbitrary and favors those born close to the date, who will be slightly stronger/faster/better only because they have had more time to play around in the world and, basically, grow. If you look closely, the only country that does not follow the pattern is the UK. But, yes, you guessed it, the cutoff date for the UK is August 31st.

Players that are selected as the best normally start and play more minutes each game [Experience based, Citation needed]. They get better training and more opportunities to play, either because they play more or because coaches pay more attention to them. Each extra minute reinforces the small initial difference, creating a real one. Of course, talent and effort play a role. But it consistently turns out to be that players born earlier on the calendar year end up being better than the other players. This phenomenon is known as relative age effect 2.

Whenever we see skewed age distributions we should suspect it is likely due to arbitrary cuts. The magic ingredients go like this:

  1. Make an arbitrary cut.
  2. Select those that are a tad ahead, those closest to the cut (older people relative to cut).
  3. Differentiate training experience by giving more opportunities to those already ahead.
  4. Harvest the best players (older people relative to cut).

This is difficult to detect because it’s a self-fulfilled profecy for scouts (after all, they want their picks to end up being good). Basketball doesn’t have this kind of problem, mostly because availability of opportunities to train and the absence of strict cutoff dates. But in other sports, this effect is even more marked. In ice hockey, you are mostly constrained to the winter season to play and you need a rink, which puts more barriers on training3.

Market Value

I couldn’t scrape enough to make a case about market value. But a quick Google search lead me to a paper suggesting that, although month of birth has a big effect in the frequency of players, once the pros made it, their salaries are not related to such factor 4.

The bad news

This effect transcends beyond sports. Turns out that arbitrary cutoffs during development affect academic performance. Kids born near the cutoff perform better at school and go to better universities. I cannot stop thinking that I was somewhat fortunate in this account. Overall, school never felt like too difficult. But maybe it’s because I had had enough days on earth and was mature enough to grasp concepts. Maybe it’s not only that “Math and Science came easy to me and I liked them”. Maybe I was born at the correct moment and I wouldn’t be pursuing academic endeavors if I hadn’t been born around the cutoff date in my country (July 1st).

Dividing school kids by age is retrograde. I would like to see a continuous progress with no boundaries on dates. I would like to see no kid be made to wait because he isn’t old enough yet and, more importantly, I would like to see no hurrying kids through content because their age says so. I would like to see people moving through school at their own pace5. Maybe we should start looking at performance adjusted by age, it might be the antidote we are looking for.

The silverlining

We can account for this effect and attempt to correct it! Some countries have started to make adjustments to the birth effect in football. An easy way of doing it is using two cutoff dates and training coaches (and scouts) against their bias.

Solving this issue might require clubs to have two teams per year. This measure doesn’t have to go all the way until they are pro, but at least until kids are old enough, so that we keep below the 6-8 month difference. This will possibly increase the costs for everyone and makes logistics more difficult. However, we would see a huge increase in the production of players (and/or their quality) if we just change this arbitrary dates.

Wait a second

There’s a possible hypothesis I’ve been evading. The reasoning will be like this:

  1. The frequency of Football players is a proportion of the people born in a country.
  2. The frequency of births in the whole country is higher in the months that the Football Associations of each country use as cutoffs.

If both things were true, then that explains why we see a peak of players in January (or whatever the specific month of cutoff date is). I will use data from my country to show that births do not follow Football.

Argentina

My country is well known for the production and exportation of football players. I had access to data on births in Buenos Aires. I later gained access to the data for the whole country, which showed mostly the same trend (you can see it at the very bottom of the article). The number of births in Argentina are somewhat constant. If anything, March seems to be the month with most consistent high amount of births.


Code

For those of you wondering about the analysis, here’s the R code used to produce this article.

The database was grown using custom functions and compiled into a football_database.csv of selected countries. I didn’t have fast and easy way to access www.soccerwiki.com so getting the selected countries was on itself a challenge (yes, scraping is the name of the game and it took a long time (~1-4 sec per player)).

Some packages we will use:

library(dplyr)
library(tidyr)
library(ggplot2)
# Read the data
df <- read.csv("data_for_posts/football_database.csv", stringsAsFactors = FALSE)

head(df)
##   country   id  first_name last_name      birth year month day position
## 1     ARG  584      Julián   SPERONI 1979-05-18 1979     5  18       GK
## 2     ARG 1353 Juan Carlos    RAPONI 1980-05-07 1980     5   7     M(L)
## 3     ARG 1752       Julio      ARCA 1981-01-31 1981     1  31    M(LC)
## 4     ARG 2324       Lucas   RIMOLDI 1980-08-07 1980     8   7  DM,M(C)
## 5     ARG 2341     Nicolás  BURDISSO 1981-04-12 1981     4  12    D(RC)
## 6     ARG 2395    Leonardo TALAMONTI 1981-11-12 1981    11  12     D(C)

We need to fix the column position. For this analysis, we don’t really care about left, center or right positioning of each player.

# Fix the position column
df <- df %>% mutate(place1 = str_extract(string = position, pattern = "\\([A-Z]+\\)"),
              position = gsub(df$position, pattern = "\\([A-Z]+\\)", replacement = ""),
              place2 = str_extract(string = position, pattern = "\\([A-Z]+\\)"),
              position = gsub(df$position, pattern = "\\([A-Z]+\\)", replacement = "")
              ) %>% 
        separate(position, into=c("pos1", "pos2"), by=",", remove = FALSE)

We will only use pos1 as the position for the rest of the analysis.

# Barplot not shown
ggplot(df, aes(month)) +
  geom_bar()+
  facet_wrap(~pos1, scales="free")

# lollipop plot country
country_lollipop <- df %>%
                    group_by(country,month) %>%
                    count() %>%
                    ggplot(aes(month, n)) +
                    geom_point() +
                    facet_wrap(~country, scales="free_y") +
                    geom_linerange(aes(ymin=0, ymax=n))+
                    theme_bw()+ scale_x_continuous(breaks=1:12,
                                                 labels=month.abb)+
  ylab("Number of Players") +
  xlab("")+ theme(axis.text.x = element_text(angle = 45, hjust=1))+
  ggtitle("Player Distribution by month of birth")


# lollipop plot position
pos_lollipop <- df %>%
                    group_by(month, pos1) %>%
                    count() %>% ungroup() %>%
                    mutate(pos1 = factor(pos1,
                                         levels=c("GK", "D", "DM", "M", "AM", "F"),
                                         labels=c("GoalKeeper",
                                                  "Defender",
                                                  "Defensive-Midfielder",
                                                  "Midfielder",
                                                  "Attacking-Midfielder",
                                                  "Forward"))) %>%
                    ggplot(aes(month, n)) +
                    geom_point() +
                    facet_wrap(~pos1, scales="free_y") +
                    geom_linerange(aes(ymin=0, ymax=n))+
                    theme_bw()+ scale_x_continuous(breaks=1:12,
                                                 labels=month.abb)+
  ylab("Number of Players") +
  xlab("")+ theme(axis.text.x = element_text(angle = 45, hjust=1))+
  ggtitle("Player Distribution by month of birth")

The birth date difference plot and other analysis for Argentina were done with the following code.

# We asume 365 days per year

percentage_days <- tibble(years = rep(c(1:100),4), 
                          year_days = years * 365,
                          days_difference = sort(rep(c(90,270,180, 360), 100))
                          ) 


main.plot <- ggplot(percentage_days,
         aes(years, days_difference/year_days,
             color=factor(days_difference)))+
  geom_line(lwd=1)+
  theme_bw()+
  theme(legend.position = "bottom")+
  ylab("Fraction of life")+
  ggtitle("Competitive advantage")+
  scale_color_grey()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(color='Birthdate difference (days)') 

  
inset.plot <- ggplot(filter(percentage_days, between(years, 5, 10)),
                     aes(years, days_difference/year_days,
             color=factor(days_difference)))+
  geom_line(lwd=1)+
  theme_bw()+ ylab("")+
  theme(legend.position = "none")+
  scale_color_grey()

plot.with.inset <-
  cowplot::ggdraw() +
  cowplot::draw_plot(main.plot) +
  cowplot::draw_plot(inset.plot, x = 0.3, 0.4,
                     width = 0.6, height = 0.5)

Ciudad de Buenos Aires births data.

nacimientos <- read.csv("datos_nacimientos.csv",
                        stringsAsFactors = FALSE)

# We need this to translate months to english
meses <- c("Enero", "Febrero", "Marzo", "Abril", 
           "Mayo", "Junio", "Julio", "Agosto",
           "Septiembre", "Octubre", "Noviembre", "Diciembre")

english_month <- month.abb[1:12]

translate_df <- data.frame(mes = meses,
                           english_month = english_month)

nacimientos <- nacimientos %>%
  left_join(translate_df, by="mes") %>%
  mutate(english_month = factor(english_month, levels = month.abb)) %>%
  arrange(year)
# Bar plot, not shown
ggplot(nacimientos,
       aes(english_month, nacimientos)) +
  geom_col() +
  facet_wrap(~year)

# Boxplot 
caba_boxplot <- ggplot(nacimientos,
       aes(english_month, nacimientos)) +
  geom_boxplot(fatten=2.5, fill="orange", alpha=0.8)+
  theme_bw()+
  geom_point(alpha=0.8)+
  ggtitle("Births by month", subtitle = "Years 2005 to 2018")+
  ylab("Number of birhts")+ xlab("")+
  labs(caption="Source: Gobierno de la Ciudad de Buenos Aires")

date <- seq(as.Date("2005-01-01"), by="1 month", length.out= nrow(nacimientos))

# lineplot not shown
ggplot(nacimientos, aes(x=date, y=nacimientos)) +
  geom_hline(yintercept = mean(nacimientos$nacimientos), lty=2, color="gray50")+
  geom_line()+
  theme_bw()+
  xlab("Año")+
  ggtitle("Nacimientos en Ciudad de Buenos Aires") +
  labs(caption="Fuente: Gobierno de la Ciudad")

# Tabla summary
sum_nacimientos <- nacimientos %>%
                      group_by(english_month) %>%
                      summarise(mean_nac = mean(nacimientos))

Births in Argentina

df <- readxl::read_excel("data_for_posts/nacidos_vivos.xlsx")

df <- mutate(df, MES = janitor::excel_numeric_to_date(as.numeric(MES)),
             year = lubridate::year(MES),
             month = lubridate::month(MES)) %>%
  filter( year > 2005) %>%
  group_by(MES) %>%
  mutate(births = sum(CANTIDAD)) %>%
  arrange(MES)
  
# TilePlot
ggplot(filter(df, !(year==2017 & month == 12)),
              aes(x=month, y = year, fill=log10(births)))+
  geom_tile(color="white")+
  scale_fill_viridis_c()+
  scale_x_continuous(breaks=1:12, expand=c(0.01,0.01))+
  scale_y_continuous(breaks=2006:2017, expand=c(0.01, 0.01))+
  theme(panel.background = element_rect(fill=NA),
        legend.position = "bottom")+
    ggtitle("Births in Argentina")+
  labs(caption="Source: Ministerio de Salud")


  1. It’s actually 1.32 % of life.

  2. Great general info here https://en.wikipedia.org/wiki/Relative_age_effect

  3. I cannot help but relate this with Science. Limited resources, too many hands trying to pull a piece to themselves… add on some age effect, differential training/opportunities (quite often bought with money and good connections). We are probably loosing so many great minds.

  4. You can find the paper here

  5. Salman Kahn expresses it way better than I could ever do in his TED talk.