Abstract

A lot of relational datasets are processed by SQL queries to reduce coding steps and execution time. Since SAS has been the preferred programming language in pharma for decades, PROC SQL has been the most common way to use SQL in the clinical context. But as more and more users tend to open-source programming, R packages which facilitate work on clinical trial datasets are rapidly gaining importance. The R package SQLDF offers a simple way to use SQL queries. Its coding style shows a big similarity to PROC SQL in SAS and enables traditional SAS users to easily use SQL in R. This paper is an addition to the poster which demonstrates the similarity of the PROC SQL procedure in SAS and the SQLDF package in R and can be used as handy cheat sheet for SAS users to step into R programming.The paper will show how to reproduce the examples in R.

SQLDF

citation("sqldf")
## 
## To cite package 'sqldf' in publications use:
## 
##   Grothendieck G (2017). _sqldf: Manipulate R Data Frames Using SQL_. R
##   package version 0.4-11, <https://CRAN.R-project.org/package=sqldf>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {sqldf: Manipulate R Data Frames Using SQL},
##     author = {G. Grothendieck},
##     year = {2017},
##     note = {R package version 0.4-11},
##     url = {https://CRAN.R-project.org/package=sqldf},
##   }
## 
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

Per default, PROC SQL does not connect to any external database management system (DBMS), but operates directly on SAS datasets. SQLDF on the other hand needs an external DBMS. When executing an SQLDF query on a R dataframe, the database is automatically built and after the execution deleted again. It is basically not visible for the user. By default, when executing an SQLDF query an SQLite database is build in the backend. Other possible DBMS are PostgreSQL, H2 and MySQL. Depending on which backend database SQLDF is working, there can be some minor differences in the behavior of some queries. Also PROC SQL can be connected to any external DB. In this case, PROC SQL supports every SQL query element, the DBMS supports. If PROC SQL and SQLDF work on the same DBMS, there will not be any different behavior. Differences shown on this paper refer to the default settings PROC SQL operating on SAS datasets and SQLDF connecting to an SQLite database.

First of all the package sqldf must be installed and loaded to R-Studio.

install.packages("sqldf")
library(sqldf)

Similarities between PROC SQLand SQLDF

In this chapter we are going to see some similarities between SQLDFand PROC SQL.

select, where, order by

The basic functions select, where and order by work the same ether in SQLDF as in PROC SQL. To show the functions in SQLDFwe create an example dataset as following. The command seed(2) assigns a number on which the random number generator starts to generate pseudo random numbers. We should always use it, to reproduce the same results despite using random numbers.

set.seed(2)
randomdf <- data.frame(fruit = sample(c('apple','orange','banana'), 10, replace = TRUE, prob = c(0.3,0.3,0.4)), number = round(runif(10, min=1, max=6)), name = sample(c('Harry','Ron','Hermione', 'Dumbledore', 'Voldemort'), 10, replace = TRUE, prob = c(0.2,0.2,0.2,0.2,0.2)), years = round(runif(10, min=1, max=90)))
randomdf
##     fruit number       name years
## 1  banana      4  Voldemort     2
## 2   apple      2   Hermione    16
## 3  orange      5      Harry    73
## 4  banana      2        Ron    78
## 5   apple      3   Hermione    47
## 6   apple      5 Dumbledore    57
## 7  banana      6        Ron    76
## 8   apple      2   Hermione    26
## 9  orange      3      Harry    60
## 10 orange      1        Ron    14

select indicates the columns to choose, where indicates the rows which are going to be displayed in the new dataset and order by describes on which column the data should be sorted. On default the sorting occurs ascending unless we add the parameter desc for “descending” after the column name as we can see in the following example.

sqldf("select fruit, number, years
      from randomdf
      where number < 5
      order by fruit, years desc")
##    fruit number years
## 1  apple      3    47
## 2  apple      2    26
## 3  apple      2    16
## 4 banana      2    78
## 5 banana      4     2
## 6 orange      3    60
## 7 orange      1    14

Text Operators

To show the function of Text operators we use the same dataframe as in the last chapter. We can see that unlike PROC SQL SQLDFis not case sensitive. The values “banana” appear on the output dataset even though we are looking for values with a big “B”. Using the same query in PROC SQL will lead to an output which contains only the value “apple”.

sqldf("select fruit
      from randomdf
      where fruit like 'B%' 
      or fruit like '%le'")
##    fruit
## 1 banana
## 2  apple
## 3 banana
## 4  apple
## 5  apple
## 6 banana
## 7  apple

Join two datasets

Joining two dataframes is the same in SQLDF as in PROC SQL. To show the function in SQLDF we create two dataframes as follows:

countries <- data.frame(ID = c(1,2,3),country_name= c('Germany', 'Italy', 'France'))
cities <- data.frame(zip = c(10713, 60329, 75001, 1030), city_name =c('Berlin', 'Frankfurt', 'Paris', 'Vienna'), country_id = c(1,1,3,4))
countries
##   ID country_name
## 1  1      Germany
## 2  2        Italy
## 3  3       France
cities
##     zip city_name country_id
## 1 10713    Berlin          1
## 2 60329 Frankfurt          1
## 3 75001     Paris          3
## 4  1030    Vienna          4

(Inner) Join

Using the command join causes an inner join. Instead of join we could also use inner join and it leads to the same output. An inner join only displays the observations with matching values on both tables.

sqldf("select a.ID, a.country_name, b.city_name, b.zip
      from countries as a
      join cities as b
      on a.ID = b.country_id")
##   ID country_name city_name   zip
## 1  1      Germany    Berlin 10713
## 2  1      Germany Frankfurt 60329
## 3  3       France     Paris 75001

Left Join

A left join displays all observations from the left table, but only the ones with matching values from the right table. The missing values are filled with “NA”.

sqldf("select a.ID, a.country_name, b.city_name, b.zip
      from countries as a
      left join cities as b
      on a.ID = b.country_id")
##   ID country_name city_name   zip
## 1  1      Germany    Berlin 10713
## 2  1      Germany Frankfurt 60329
## 3  2        Italy      <NA>    NA
## 4  3       France     Paris 75001

Right Join

Similar to the left join, the right join displays all observations from the right table and only the ones with matching values from the left table.

sqldf("select a.ID, a.country_name, b.city_name, b.zip
      from countries as a
      right join cities as b
      on a.ID = b.country_id")
##   ID country_name city_name   zip
## 1  1      Germany    Berlin 10713
## 2  1      Germany Frankfurt 60329
## 3  3       France     Paris 75001
## 4 NA         <NA>    Vienna  1030

Full (Outer) Join

Instead of full join we can also use full outer join. The full join leads to an output with all observations from both tables, connecting the ones with matching values and filling the missing values with “NA”. We save this output data frame, beacause we will use it in the next chapter to show how to handle missing values.

city_country <- 
  
sqldf("select a.ID, a.country_name, b.city_name, b.zip
      from countries as a
      full join cities as b
      on a.ID = b.country_id")

city_country
##   ID country_name city_name   zip
## 1  1      Germany    Berlin 10713
## 2  1      Germany Frankfurt 60329
## 3  2        Italy      <NA>    NA
## 4  3       France     Paris 75001
## 5 NA         <NA>    Vienna  1030

Missing Values

In SAS missing values are displayed as blanks for character variables or as . for numeric variables. On the other side in R missing values are displayed as for character variables or as NA for numeric variables. Nevertheless the functions is not null or is null work the same way in both languages.

sqldf("select country_name, city_name
        from city_country
        where zip is not null and country_name is not null;")
##   country_name city_name
## 1      Germany    Berlin
## 2      Germany Frankfurt
## 3       France     Paris

Differences between PROC SQLand SQLDF

In this chapter we will show some differences between PROC SQL and SQLDF

Aggregate Functions

The aggregate functions are used to execute some calculations on a set of values. There are two major differences between PROC SQL and SQLDF. The first one is the function which calculates the standard deviation of a set of values. The call of the function is stdev(values) in SQLDF but to use the same query in PROC SQL we need to use std(values) instead. Both calls will cause an error in the other language. The second difference is the displaying of the new build dataset. In SQLDF the columns will be named with the aggregate function used to build the column. But in PROC SQL the columns will be named with “_TEMG001”, “_TEMG002”…ect.

sqldf("select fruit, avg(years), min(years), max(years), count(fruit), stdev(years), sum(years)
            from randomdf
            group by fruit")
##    fruit avg(years) min(years) max(years) count(fruit) stdev(years) sum(years)
## 1  apple       36.5         16         57            4     18.80603        146
## 2 banana       52.0          2         78            3     43.31282        156
## 3 orange       49.0         14         73            3     31.00000        147

To avoid uninformative columnheaders, we should name the column by adding as <columnname> after the aggregate function. As in the following example. This works for both languages. Note that if the column name has two parts (like in ‘standart deviation’) it mus be in parenthesis otherwise it will cause an error.

sqldf("select fruit, avg(years) as average, min(years) as minimum, max(years) as maximum, count(fruit) as count, stdev(years) as 'standard deviation', sum(years) as sum
            from randomdf
            group by fruit")
##    fruit average minimum maximum count standard deviation sum
## 1  apple    36.5      16      57     4           18.80603 146
## 2 banana    52.0       2      78     3           43.31282 156
## 3 orange    49.0      14      73     3           31.00000 147

Easy Column Renaming

In PROC SQL the naming of the column only works on aggregate functions. If we only want to rename a variable which already has a name, it won’t change the variable name, but the variable lable. If we want to display the lable instead of the name we can change displaying options with options nolabel; in SAS.

DF = data.frame(column_A = c(1,2,3,4,1,2,3,4))
DF
##   column_A
## 1        1
## 2        2
## 3        3
## 4        4
## 5        1
## 6        2
## 7        3
## 8        4

In R there are no column lables. Therefor adding as <columnname will simply rename the columns name.

sqldf("select column_a as a from DF")  
##   a
## 1 1
## 2 2
## 3 3
## 4 4
## 5 1
## 6 2
## 7 3
## 8 4

In SQLDF it is also possible to leave as away and just add the desired name. This will cause an error in PROC SQL.

sqldf("select column_a a from DF")  
##   a
## 1 1
## 2 2
## 3 3
## 4 4
## 5 1
## 6 2
## 7 3
## 8 4

Integer division leads to integer value in R

intquot <- data.frame(a = 1:2, b = 2:1)
intquot
##   a b
## 1 1 2
## 2 2 1

A simple division of two integers will lead to an integer value in SQLDF. If the true result of the division is a float the numeric character after the dot simply get cut.

sqldf("select a/b as quotient from intquot")
##   quotient
## 1        0
## 2        2

If we want to force a real division, we have to convert at least one of the numbers to a float. This could simply be done by adding a +0.0 to one variable.

sqldf("select (a+0.0)/b as quotient from intquot")
##   quotient
## 1      0.5
## 2      2.0

In PROC SQL it is not neccesary to convert the number. PROC SQL will always perform a real division.

Missing grouping

To show the next differnce we first build a dataset with two different group variables. The Observations can be split ether in 0 and 1 or in A,B and C.

set.seed(1)
exampledata <- data.frame(value = rnorm(10), group1 = rbinom(10,1,0.5), group2 = sample(c('A','B','C'), 10, replace = TRUE, prob = c(0.3,0.3,0.4))) 
exampledata
##         value group1 group2
## 1  -0.6264538      1      B
## 2   0.1836433      0      B
## 3  -0.8356286      1      B
## 4   1.5952808      0      C
## 5   0.3295078      0      A
## 6  -0.8204684      0      B
## 7   0.4874291      0      A
## 8   0.7383247      0      C
## 9   0.5757814      1      A
## 10 -0.3053884      0      B

If we want to find the maximum numeric value of each combined group, we have to mention both group variables in the group by command.

sqldf("select max(value) as max_value,
                 group1, group2
                 from exampledata
                 group by group1, group2")
##    max_value group1 group2
## 1  0.4874291      0      A
## 2  0.1836433      0      B
## 3  1.5952808      0      C
## 4  0.5757814      1      A
## 5 -0.6264538      1      B

If we want to find only the maximum value of each group in the first group variable, but still display the value of the second group variable in SQLDF only the observation which contains the maximum value will be displayed as the following example shows:

sqldf("select max(value) as max_value,
                 group1, group2
                 from exampledata
                 group by group1")
##   max_value group1 group2
## 1 1.5952808      0      C
## 2 0.5757814      1      A

If we run the same code in PROC SQL every observation will be displayed, mapping the maximum value of each group of the first group variable to the corresponding group.

Nevertheless using the SQL-query above is not a good programming practice. As we can see in the example of other aggregate functions like the average, the value of the second group variable is simply the first value which appears together with the corresponding value of the first group variable.

sqldf("select avg(value) as max_value,
                 group1, group2
                 from exampledata
                 group by group1")
##    max_value group1 group2
## 1  0.3154756      0      B
## 2 -0.2954337      1      B