A lot of relational datasets are processed by SQL queries to reduce coding steps and execution time. Since SAS has been the preferred programming language in pharma for decades, PROC SQL has been the most common way to use SQL in the clinical context. But as more and more users tend to open-source programming, R packages which facilitate work on clinical trial datasets are rapidly gaining importance. The R package SQLDF offers a simple way to use SQL queries. Its coding style shows a big similarity to PROC SQL in SAS and enables traditional SAS users to easily use SQL in R. This paper is an addition to the poster which demonstrates the similarity of the PROC SQL procedure in SAS and the SQLDF package in R and can be used as handy cheat sheet for SAS users to step into R programming.The paper will show how to reproduce the examples in R.
citation("sqldf")
##
## To cite package 'sqldf' in publications use:
##
## Grothendieck G (2017). _sqldf: Manipulate R Data Frames Using SQL_. R
## package version 0.4-11, <https://CRAN.R-project.org/package=sqldf>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {sqldf: Manipulate R Data Frames Using SQL},
## author = {G. Grothendieck},
## year = {2017},
## note = {R package version 0.4-11},
## url = {https://CRAN.R-project.org/package=sqldf},
## }
##
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
Per default, PROC SQL does not connect to any external database management system (DBMS), but operates directly on SAS datasets. SQLDF on the other hand needs an external DBMS. When executing an SQLDF query on a R dataframe, the database is automatically built and after the execution deleted again. It is basically not visible for the user. By default, when executing an SQLDF query an SQLite database is build in the backend. Other possible DBMS are PostgreSQL, H2 and MySQL. Depending on which backend database SQLDF is working, there can be some minor differences in the behavior of some queries. Also PROC SQL can be connected to any external DB. In this case, PROC SQL supports every SQL query element, the DBMS supports. If PROC SQL and SQLDF work on the same DBMS, there will not be any different behavior. Differences shown on this paper refer to the default settings PROC SQL operating on SAS datasets and SQLDF connecting to an SQLite database.
First of all the package sqldf must be installed and
loaded to R-Studio.
install.packages("sqldf")
library(sqldf)
PROC SQLand
SQLDFIn this chapter we are going to see some similarities between
SQLDFand PROC SQL.
The basic functions select, where and
order by work the same ether in SQLDF as in
PROC SQL. To show the functions in SQLDFwe
create an example dataset as following. The command seed(2)
assigns a number on which the random number generator starts to generate
pseudo random numbers. We should always use it, to reproduce the same
results despite using random numbers.
set.seed(2)
randomdf <- data.frame(fruit = sample(c('apple','orange','banana'), 10, replace = TRUE, prob = c(0.3,0.3,0.4)), number = round(runif(10, min=1, max=6)), name = sample(c('Harry','Ron','Hermione', 'Dumbledore', 'Voldemort'), 10, replace = TRUE, prob = c(0.2,0.2,0.2,0.2,0.2)), years = round(runif(10, min=1, max=90)))
randomdf
## fruit number name years
## 1 banana 4 Voldemort 2
## 2 apple 2 Hermione 16
## 3 orange 5 Harry 73
## 4 banana 2 Ron 78
## 5 apple 3 Hermione 47
## 6 apple 5 Dumbledore 57
## 7 banana 6 Ron 76
## 8 apple 2 Hermione 26
## 9 orange 3 Harry 60
## 10 orange 1 Ron 14
select indicates the columns to choose,
where indicates the rows which are going to be displayed in
the new dataset and order by describes on which column the
data should be sorted. On default the sorting occurs ascending unless we
add the parameter desc for “descending” after the column
name as we can see in the following example.
sqldf("select fruit, number, years
from randomdf
where number < 5
order by fruit, years desc")
## fruit number years
## 1 apple 3 47
## 2 apple 2 26
## 3 apple 2 16
## 4 banana 2 78
## 5 banana 4 2
## 6 orange 3 60
## 7 orange 1 14
To show the function of Text operators we use the same dataframe as
in the last chapter. We can see that unlike PROC SQL
SQLDFis not case sensitive. The values “banana” appear on
the output dataset even though we are looking for values with a big “B”.
Using the same query in PROC SQL will lead to an output
which contains only the value “apple”.
sqldf("select fruit
from randomdf
where fruit like 'B%'
or fruit like '%le'")
## fruit
## 1 banana
## 2 apple
## 3 banana
## 4 apple
## 5 apple
## 6 banana
## 7 apple
Joining two dataframes is the same in SQLDF as in
PROC SQL. To show the function in SQLDF we
create two dataframes as follows:
countries <- data.frame(ID = c(1,2,3),country_name= c('Germany', 'Italy', 'France'))
cities <- data.frame(zip = c(10713, 60329, 75001, 1030), city_name =c('Berlin', 'Frankfurt', 'Paris', 'Vienna'), country_id = c(1,1,3,4))
countries
## ID country_name
## 1 1 Germany
## 2 2 Italy
## 3 3 France
cities
## zip city_name country_id
## 1 10713 Berlin 1
## 2 60329 Frankfurt 1
## 3 75001 Paris 3
## 4 1030 Vienna 4
Using the command join causes an inner join. Instead of
join we could also use inner join and it leads
to the same output. An inner join only displays the observations with
matching values on both tables.
sqldf("select a.ID, a.country_name, b.city_name, b.zip
from countries as a
join cities as b
on a.ID = b.country_id")
## ID country_name city_name zip
## 1 1 Germany Berlin 10713
## 2 1 Germany Frankfurt 60329
## 3 3 France Paris 75001
A left join displays all observations from the left table, but only the ones with matching values from the right table. The missing values are filled with “NA”.
sqldf("select a.ID, a.country_name, b.city_name, b.zip
from countries as a
left join cities as b
on a.ID = b.country_id")
## ID country_name city_name zip
## 1 1 Germany Berlin 10713
## 2 1 Germany Frankfurt 60329
## 3 2 Italy <NA> NA
## 4 3 France Paris 75001
Similar to the left join, the right join displays all observations from the right table and only the ones with matching values from the left table.
sqldf("select a.ID, a.country_name, b.city_name, b.zip
from countries as a
right join cities as b
on a.ID = b.country_id")
## ID country_name city_name zip
## 1 1 Germany Berlin 10713
## 2 1 Germany Frankfurt 60329
## 3 3 France Paris 75001
## 4 NA <NA> Vienna 1030
Instead of full join we can also use
full outer join. The full join leads to an output with all
observations from both tables, connecting the ones with matching values
and filling the missing values with “NA”. We save this output data
frame, beacause we will use it in the next chapter to show how to handle
missing values.
city_country <-
sqldf("select a.ID, a.country_name, b.city_name, b.zip
from countries as a
full join cities as b
on a.ID = b.country_id")
city_country
## ID country_name city_name zip
## 1 1 Germany Berlin 10713
## 2 1 Germany Frankfurt 60329
## 3 2 Italy <NA> NA
## 4 3 France Paris 75001
## 5 NA <NA> Vienna 1030
In SAS missing values are displayed as blanks for
character variables or as . for numeric variables. On the other side in
R missing values are displayed as is not null or is null work the same way in
both languages.
sqldf("select country_name, city_name
from city_country
where zip is not null and country_name is not null;")
## country_name city_name
## 1 Germany Berlin
## 2 Germany Frankfurt
## 3 France Paris
PROC SQLand SQLDFIn this chapter we will show some differences between
PROC SQL and SQLDF
The aggregate functions are used to execute some calculations on a
set of values. There are two major differences between
PROC SQL and SQLDF. The first one is the
function which calculates the standard deviation of a set of values. The
call of the function is stdev(values) in SQLDF
but to use the same query in PROC SQL we need to use
std(values) instead. Both calls will cause an error in the
other language. The second difference is the displaying of the new build
dataset. In SQLDF the columns will be named with the
aggregate function used to build the column. But in
PROC SQL the columns will be named with “_TEMG001”,
“_TEMG002”…ect.
sqldf("select fruit, avg(years), min(years), max(years), count(fruit), stdev(years), sum(years)
from randomdf
group by fruit")
## fruit avg(years) min(years) max(years) count(fruit) stdev(years) sum(years)
## 1 apple 36.5 16 57 4 18.80603 146
## 2 banana 52.0 2 78 3 43.31282 156
## 3 orange 49.0 14 73 3 31.00000 147
To avoid uninformative columnheaders, we should name the column by
adding as <columnname> after the aggregate function.
As in the following example. This works for both languages. Note that if
the column name has two parts (like in ‘standart deviation’) it mus be
in parenthesis otherwise it will cause an error.
sqldf("select fruit, avg(years) as average, min(years) as minimum, max(years) as maximum, count(fruit) as count, stdev(years) as 'standard deviation', sum(years) as sum
from randomdf
group by fruit")
## fruit average minimum maximum count standard deviation sum
## 1 apple 36.5 16 57 4 18.80603 146
## 2 banana 52.0 2 78 3 43.31282 156
## 3 orange 49.0 14 73 3 31.00000 147
In PROC SQL the naming of the column only works on
aggregate functions. If we only want to rename a variable which already
has a name, it won’t change the variable name, but the variable lable.
If we want to display the lable instead of the name we can change
displaying options with options nolabel; in
SAS.
DF = data.frame(column_A = c(1,2,3,4,1,2,3,4))
DF
## column_A
## 1 1
## 2 2
## 3 3
## 4 4
## 5 1
## 6 2
## 7 3
## 8 4
In R there are no column lables. Therefor adding
as <columnname will simply rename the columns name.
sqldf("select column_a as a from DF")
## a
## 1 1
## 2 2
## 3 3
## 4 4
## 5 1
## 6 2
## 7 3
## 8 4
In SQLDF it is also possible to leave as
away and just add the desired name. This will cause an error in
PROC SQL.
sqldf("select column_a a from DF")
## a
## 1 1
## 2 2
## 3 3
## 4 4
## 5 1
## 6 2
## 7 3
## 8 4
intquot <- data.frame(a = 1:2, b = 2:1)
intquot
## a b
## 1 1 2
## 2 2 1
A simple division of two integers will lead to an integer value in
SQLDF. If the true result of the division is a float the
numeric character after the dot simply get cut.
sqldf("select a/b as quotient from intquot")
## quotient
## 1 0
## 2 2
If we want to force a real division, we have to convert at least one
of the numbers to a float. This could simply be done by adding a
+0.0 to one variable.
sqldf("select (a+0.0)/b as quotient from intquot")
## quotient
## 1 0.5
## 2 2.0
In PROC SQL it is not neccesary to convert the number.
PROC SQL will always perform a real division.
To show the next differnce we first build a dataset with two different group variables. The Observations can be split ether in 0 and 1 or in A,B and C.
set.seed(1)
exampledata <- data.frame(value = rnorm(10), group1 = rbinom(10,1,0.5), group2 = sample(c('A','B','C'), 10, replace = TRUE, prob = c(0.3,0.3,0.4)))
exampledata
## value group1 group2
## 1 -0.6264538 1 B
## 2 0.1836433 0 B
## 3 -0.8356286 1 B
## 4 1.5952808 0 C
## 5 0.3295078 0 A
## 6 -0.8204684 0 B
## 7 0.4874291 0 A
## 8 0.7383247 0 C
## 9 0.5757814 1 A
## 10 -0.3053884 0 B
If we want to find the maximum numeric value of each combined group,
we have to mention both group variables in the group by
command.
sqldf("select max(value) as max_value,
group1, group2
from exampledata
group by group1, group2")
## max_value group1 group2
## 1 0.4874291 0 A
## 2 0.1836433 0 B
## 3 1.5952808 0 C
## 4 0.5757814 1 A
## 5 -0.6264538 1 B
If we want to find only the maximum value of each group in the first
group variable, but still display the value of the second group variable
in SQLDF only the observation which contains the maximum
value will be displayed as the following example shows:
sqldf("select max(value) as max_value,
group1, group2
from exampledata
group by group1")
## max_value group1 group2
## 1 1.5952808 0 C
## 2 0.5757814 1 A
If we run the same code in PROC SQL every observation
will be displayed, mapping the maximum value of each group of the first
group variable to the corresponding group.
Nevertheless using the SQL-query above is not a good programming practice. As we can see in the example of other aggregate functions like the average, the value of the second group variable is simply the first value which appears together with the corresponding value of the first group variable.
sqldf("select avg(value) as max_value,
group1, group2
from exampledata
group by group1")
## max_value group1 group2
## 1 0.3154756 0 B
## 2 -0.2954337 1 B