I have seen a couple of presentations that use plots of PubMed query results. I have even seen some of them in papers. I just think it’s really cool, so I wanted to play with something that could provide the data.

A couple of google searches lead me into two nice options to do this in R.

  1. RISmed package, see CRAN or Dave Tang’s Blog
  2. A custom approach from Kristoffer Magnusson

I went with the custom approach, decided to borrow heavily from Kristoffer’s repo, and did a few modifications here and there. Mainly, I updated libraries, included some dplyr output to make it cleaner, and separated functions into several files.

You can find the updated code in the following repo: https://github.com/matiasandina/pubmed_query

The logic of the code is to loop over the search terms and the years, performing queries to PubMed each time. To make things more friendly we wrap everything into a main function that performs some checks and handles the multiple calls to the working functions. This main function, query_pubmed(), expects a query (character vector), and 2 years for the time interval (yrStart and yrMax).

The function is somewhat self contained, if it can’t find things on the local computer it will source from GitHub1. Here’s a little demo of the main function query_pubmed(). Since we are using internet to get the data, I assume the user will be able to source from GitHub (these calls are often performed via devtools::source_url).

Little demo

Let’s look for the term hiv in publications from the 1970 until today. PubMed requests us to limit the traffic to ~3 queries per second. Thus, queries will take a while because the function has Sys.sleep(0.5) in between iterations. You will see a progress bar for each term (not shown here for simplicity).

hiv <- query_pubmed("hiv", 1970, 2018)
PubMed publications containing the term HIV relative to the total number of publications.

Figure 1: PubMed publications containing the term HIV relative to the total number of publications.

I chose to keep the graphic output as simple as possible (aka use ggplot2 defaults) and return a data.frame that can be fed into a custom ggplot2 call later, if the users feel like it. Here’s a glimpse of the returned object.

##   query_term year count total_count         freq
## 1        hiv 1970     1      219356 0.0004558799
## 2        hiv 1971     0      223605 0.0000000000
## 3        hiv 1972     0      227803 0.0000000000
## 4        hiv 1973     0      231079 0.0000000000
## 5        hiv 1974     0      235052 0.0000000000
## 6        hiv 1975     0      249157 0.0000000000

Making things faster

Total publication numbers should not change2. Thus, if we don’t want to waste time grabbing the total number of publications over and over, we can either:

  1. Use get_totals()
  2. Get it from GitHub

I will do my best, but I can’t be certain I will keep running the function and pushing once a year to GitHub (as in forever)3. I don’t feel like waiting, I already have a recent version in the repo.

# Option one: run the function
# total_table_updated <- get_totals(1947,2018)

# Option two
total_table_updated <- read.csv('https://raw.githubusercontent.com/matiasandina/pubmed_query/master/data/total_table_updated.csv')

Having this object around will speed the main function (it will not query PubMed every year for the totals). Here’s a graph of the number of publications by year:

Total publications in PubMed by year. Science is growing :)

Figure 2: Total publications in PubMed by year. Science is growing :)

Multiple terms

We can use multiple terms to query, just make a character vector. For example, let’s add aids and hepatitis b:

aids_hepB <- query_pubmed(c('aids', 'hepatitis b'), 1970, 2018)
## Searching for: aids
## Searching for: hepatitis b
## All done!

Because we saved the previous object in the environment, we don’t have to query again, we can merge the data and plot all together.

bind_rows(hiv, aids_hepB) %>%
  ggplot(aes(year, freq, fill=query_term)) +
  geom_area(position = "identity",
  geom_line(lwd=0.4, color="black")+
  scale_fill_manual(values = c("#EBCF02",
  theme(legend.position = "bottom",
        legend.title = element_blank())+
  ylab("Relative frequency (%)")
Frequency of query terms. Relative to total number of PubMed entries per year.

Figure 3: Frequency of query terms. Relative to total number of PubMed entries per year.

We see that the term aids came first in the literature, before the virus was identified, in the early 1980s. Although strongly correlated with aids, hiv is a term with higher frequency. Research for hepatitis b seems to have kept a constant relative level, growing as much as the total body of research.

If you are really dying to know about my session info, be my guest:

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## Matrix products: default
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] bindrcpp_0.2.2 dplyr_0.7.6    httr_1.3.1     XML_3.98-1.16 
## [5] ggplot2_3.0.0 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.19     highr_0.7        pillar_1.3.0     compiler_3.5.1  
##  [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.1      digest_0.6.17   
##  [9] evaluate_0.12    memoise_1.1.0    tibble_1.4.2     gtable_0.2.0    
## [13] pkgconfig_2.0.2  rlang_0.2.2      curl_3.2         yaml_2.2.0      
## [17] blogdown_0.8     xfun_0.3         withr_2.1.2      stringr_1.3.1   
## [21] knitr_1.20       devtools_1.13.6  rprojroot_1.3-2  grid_3.5.1      
## [25] tidyselect_0.2.4 glue_1.3.0       R6_2.3.0         rmarkdown_1.10  
## [29] bookdown_0.7     purrr_0.2.5      magrittr_1.5     backports_1.1.2 
## [33] scales_1.0.0     htmltools_0.3.6  assertthat_0.2.0 colorspace_1.3-2
## [37] labeling_0.3     stringi_1.1.7    lazyeval_0.2.1   munsell_0.5.0   
## [41] crayon_1.3.4

  1. Granted, several packages are needed to run the code. I assume users will know how to install.packages(...) if in need.

  2. There are some variations in the recent years as the data base updates, but shouldn’t be significant for these purposes.

  3. Yes, I guess I could automate it but reaching diminishing returns here…