Using the CEOdata package
CEOdata is a package that facilitates the incorporation of microdata (individual
responses) of public opinion polls in Catalonia into R
, as performed by the “Centre
d’Estudis d’Opinió” (CEO, Opinion Studies Center). It has basically two main
functions with a separate purpose:
CEOdata()
: that provides the data of the surveys directly intoR
.CEOmeta()
: that allows the user to inspect the details of the available surveys (metadata) and to search for specific topics and get the survey details.
CEOdata()
: Get the survey data
The most comprehensive kind of data on catalan public opinion is the
“Barometer”, that can be retrieved by default by the main function CEOdata()
.
library(CEOdata)
d <- CEOdata()
This provides a cleaned and merged version of all the available Barometers, providing easy acces to the following number of responses and variables:
dim(d)
## [1] 33838 814
d
## # A tibble: 33,838 × 814
## PONDERA ORDRE_REVISADA REO BOP_NUM ANY MES DIA HOR_INI HOR_FIN DURADA
## <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.86 1 746 Març 1… 2014 3 NA NA NA 1631
## 2 1.86 2 746 Març 1… 2014 3 NA NA NA 1972
## 3 1.86 3 746 Març 1… 2014 3 NA NA NA 1478
## 4 1.86 4 746 Març 1… 2014 3 NA NA NA 872
## 5 1.86 5 746 Març 1… 2014 3 NA NA NA 1356
## 6 1.86 6 746 Març 1… 2014 3 NA NA NA 1205
## 7 1.86 7 746 Març 1… 2014 3 NA NA NA 1279
## 8 1.86 8 746 Març 1… 2014 3 NA NA NA 995
## 9 1.86 9 746 Març 1… 2014 3 NA NA NA 1497
## 10 1.86 10 746 Març 1… 2014 3 NA NA NA 1221
## # … with 33,828 more rows, and 804 more variables: ENQUESTADOR_CODI <dbl>,
## # ENQUESTADOR_SEXE <fct>, ENQUESTADOR_ESTUDIS <fct>,
## # ENQUESTADOR_NACIONALITAT <fct>, PROVINCIA <fct>, HABITAT <fct>,
## # MUNICIPI <fct>, COMARCA <fct>, ID_RUTA <dbl>, SECCIO_TEORICA <dbl>,
## # CONF_SECC <fct>, SECCIO_REAL <dbl>, DOMICILI_PARTICULAR <fct>,
## # LLENGUA_ENQUESTA <fct>, PADRO <fct>, CIUTADANIA <fct>, SEXE <fct>,
## # EDAT <dbl>, EDAT_GR <fct>, EDAT_CEO <fct>, LLOC_NAIX <fct>, …
names(d)[1:50]
## [1] "PONDERA" "ORDRE_REVISADA"
## [3] "REO" "BOP_NUM"
## [5] "ANY" "MES"
## [7] "DIA" "HOR_INI"
## [9] "HOR_FIN" "DURADA"
## [11] "ENQUESTADOR_CODI" "ENQUESTADOR_SEXE"
## [13] "ENQUESTADOR_ESTUDIS" "ENQUESTADOR_NACIONALITAT"
## [15] "PROVINCIA" "HABITAT"
## [17] "MUNICIPI" "COMARCA"
## [19] "ID_RUTA" "SECCIO_TEORICA"
## [21] "CONF_SECC" "SECCIO_REAL"
## [23] "DOMICILI_PARTICULAR" "LLENGUA_ENQUESTA"
## [25] "PADRO" "CIUTADANIA"
## [27] "SEXE" "EDAT"
## [29] "EDAT_GR" "EDAT_CEO"
## [31] "LLOC_NAIX" "HORA_PRIMERA_PREGUNTA"
## [33] "GRAVACIO" "TIPUS_GRAV"
## [35] "PRE_PROBLEMES" "PROBLEMES_LITERALS"
## [37] "PROBLEMES_R_1" "PROBLEMES_R_2"
## [39] "PROBLEMES_R_3" "PROBLEMES_R_4"
## [41] "PROBLEMES_R_5" "PROBLEMES_R_6"
## [43] "PROBLEMES_R_7" "PROBLEMES_R_8"
## [45] "PROBLEMES_R_9" "PROBLEMES_R_10"
## [47] "PROBLEMES_R_11" "PROBLEMES_R_12"
## [49] "PROBLEMES_E_1" "PROBLEMES_E_2"
Specific barometers or time frames
CEOdata()
allows you to select specific Barometers, by providing their internal register in the reo
argument.
The reo is the internal name that the CEO uses, and stands for “Registre
d’Estudis d’Opinió” (register of opinion studies), and is the main identificator
of the survey, also present in the table of meta data. Although many of them are
numbers, some have a number, a slash and another number, and therefore a
character vector must be passed. Only a single REO can be passed, as it is not
guaranteed that different data matrices share any column, and may refer to very
different topics.
For instance, to get only the data of the study with register “746” (corresponding to March 2013):
d746 <- CEOdata(reo = "746")
d746
## # A tibble: 2,000 × 475
## NUM NUM_CINE BOP_NUM MES ANY DIA CODI_ENQ HOR_INI HOR_FIN DURADA
## <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl+lb> <time> <time> <dbl>
## 1 NA 1 32 [Març. 1… 4 14 NA 243 [G3] NA NA 1203
## 2 NA 2 32 [Març. 1… 4 14 NA 254 [T4] NA NA 997
## 3 NA 3 32 [Març. 1… 4 14 NA 243 [G3] NA NA 1127
## 4 NA 4 32 [Març. 1… 4 14 NA 235 [B5] NA NA 1792
## 5 NA 5 32 [Març. 1… 4 14 NA 236 [B6] NA NA 1418
## 6 NA 6 32 [Març. 1… 3 14 NA 246 [L1] NA NA 1964
## 7 NA 7 32 [Març. 1… 4 14 NA 244 [G4] NA NA 2283
## 8 NA 8 32 [Març. 1… 4 14 NA 233 [B3] NA NA 1322
## 9 NA 9 32 [Març. 1… 3 14 NA 232 [B2] NA NA 1778
## 10 NA 10 32 [Març. 1… 3 14 NA 254 [T4] NA NA 1119
## # … with 1,990 more rows, and 465 more variables: num_quest <dbl>,
## # sexe_enq <dbl+lbl>, edat_enq <dbl>, estudis_enq <dbl+lbl>,
## # nacionalitat_enq <dbl+lbl>, PROVI <dbl+lbl>, HABITAT <dbl+lbl>,
## # MUN <dbl+lbl>, ADRECA <chr>, COMARCA <dbl+lbl>, SEXE <dbl+lbl>, EDAT <dbl>,
## # SE <chr>, GR_EDAT <dbl+lbl>, Edat_CEO <dbl+lbl>, F01 <dbl+lbl>,
## # F02 <dbl+lbl>, F03 <dbl+lbl>, F04 <dbl+lbl>, PRE_P1 <dbl+lbl>,
## # P1_LITERALS <chr>, P1_100_R <dbl>, P1_200_R <dbl>, P1_300_R <dbl>, …
The function CEOdata()
also allows to restrict the whole set of barometers based on specific time frames defined by a date with the arguments date_start
and date_end
using the YYYY-MM-DD format.
b2019 <- CEOdata(date_start = "2019-01-01", date_end = "2019-12-31")
b2019
## # A tibble: 0 × 814
## # … with 814 variables: PONDERA <dbl>, ORDRE_REVISADA <dbl>, REO <dbl>,
## # BOP_NUM <fct>, ANY <dbl>, MES <dbl>, DIA <dbl>, HOR_INI <dbl>,
## # HOR_FIN <dbl>, DURADA <dbl>, ENQUESTADOR_CODI <dbl>,
## # ENQUESTADOR_SEXE <fct>, ENQUESTADOR_ESTUDIS <fct>,
## # ENQUESTADOR_NACIONALITAT <fct>, PROVINCIA <fct>, HABITAT <fct>,
## # MUNICIPI <fct>, COMARCA <fct>, ID_RUTA <dbl>, SECCIO_TEORICA <dbl>,
## # CONF_SECC <fct>, SECCIO_REAL <dbl>, DOMICILI_PARTICULAR <fct>, …
Convenience variables
By default CEOdata()
incorporates new variables to the original matrix. Variables that are created for convenience, such as the date of the survey.
The CEO data not always provides a day of the month.
In that case, 28 is used. These variables appear at the end of the dataset and can be distinguished from the original CEO variables because only the first letter is capitalized.
tail(names(d))
## [1] "CONSEQ_ECON_COVID19" "ACTITUD_PERSONAL_COVD19"
## [3] "CIRCUIT_985_1" "CIRCUIT_985_2"
## [5] "CIRCUIT_985_3" "Data"
CEOmeta()
: Access to the metadata of studies and surveys
The function CEOmeta
allows to easily retrieve, search and restrict by time
the list of all the surveys produced by the CEO, which amounts to more than a
thousand as of end of 2021.
When called alone, the function downloads the latest version of the metadata
published by the center, in a transparent way, and caching its content so that
any subsequent calls in the same R
session do not need to download it again.
CEOmeta()
## # A tibble: 1,151 × 27
## REO `Titol enquesta` `Titol estudi` `Metodologia en… `Metode de reco…
## <fct> <chr> <chr> <fct> <fct>
## 1 1006 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 2 1005 Enquesta sobre el … Enquesta sobre e… quantitativa telèfon
## 3 1004 Baròmetre sanitari… Baròmetre sanita… quantitativa telèfon
## 4 1003 L'ètica pública a … L'ètica pública … quantitativa autoadministrada
## 5 1002 Enquesta de satisf… Enquesta de sati… quantitativa autoadministrada
## 6 1001 Índex de satisfacc… Índex de satisfa… quantitativa internet;telèfon
## 7 1000 Enquesta panel sob… Enquesta panel s… quantitativa internet
## 8 999 Enquesta d'inserci… Enquesta d'inser… quantitativa telèfon
## 9 998 Hàbits de consum d… Hàbits de consum… quantitativa telèfon
## 10 997 Hàbits de consum d… Hàbits de consum… quantitativa telèfon
## # … with 1,141 more rows, and 22 more variables: Objectius <chr>,
## # Ambit territorial <fct>, Cost <dbl>, Promotors enquesta <chr>,
## # Executors enquesta <chr>, Promotors estudi <chr>, Executors estudi <chr>,
## # Data de treball de camp <chr>, Dia inici treball de camp <date>,
## # Dia final treball de camp <date>, Univers <chr>, Mostra <chr>,
## # Mostra estudis quantitatius <dbl>, Mostra estudis qualitatius <chr>,
## # Error mostral <chr>, Error mostral (numèric) <chr>, Resum <chr>, …
Search for specific topics though keywords
The first relevant argument for CEOmeta()
is search
, which is a built-in
simple search engine that goes through the columns of the metadata containing
potential descriptive information () and returns the studies that contain such
keyword.
CEOmeta(search = "Medi ambient")
## Looking for entries with: medi ambient
## # A tibble: 44 × 27
## REO `Titol enquesta` `Titol estudi` `Metodologia en… `Metode de reco…
## <fct> <chr> <chr> <fct> <fct>
## 1 1006 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 2 973 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 3 941 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 4 876 Baròmetre de la bi… Baròmetre de la … quantitativa internet;telèfon
## 5 875 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 6 865 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 7 819 Baròmetre 2015 de … Baròmetre 2015 d… quantitativa telèfon
## 8 807 Seguiment d'Indica… Seguiment d'indi… quantitativa telèfon
## 9 805 Baròmetre de la bi… Baròmetre de la … quantitativa internet;telèfon
## 10 803 Post test campanya… Post test campan… quantitativa telèfon
## # … with 34 more rows, and 22 more variables: Objectius <chr>,
## # Ambit territorial <fct>, Cost <dbl>, Promotors enquesta <chr>,
## # Executors enquesta <chr>, Promotors estudi <chr>, Executors estudi <chr>,
## # Data de treball de camp <chr>, Dia inici treball de camp <date>,
## # Dia final treball de camp <date>, Univers <chr>, Mostra <chr>,
## # Mostra estudis quantitatius <dbl>, Mostra estudis qualitatius <chr>,
## # Error mostral <chr>, Error mostral (numèric) <chr>, Resum <chr>, …
It is also possible to pass more than one value to search
, so that the search
includes them (either one of them OR any other).
CEOmeta(search = c("Medi ambient", "Municipi"))
## Looking for entries with: medi ambient OR municipi
## # A tibble: 48 × 27
## REO `Titol enquesta` `Titol estudi` `Metodologia en… `Metode de reco…
## <fct> <chr> <chr> <fct> <fct>
## 1 1006 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 2 973 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 3 941 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 4 876 Baròmetre de la bi… Baròmetre de la … quantitativa internet;telèfon
## 5 875 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 6 865 Seguiment d'indica… Seguiment d'indi… quantitativa telèfon
## 7 819 Baròmetre 2015 de … Baròmetre 2015 d… quantitativa telèfon
## 8 807 Seguiment d'Indica… Seguiment d'indi… quantitativa telèfon
## 9 805 Baròmetre de la bi… Baròmetre de la … quantitativa internet;telèfon
## 10 803 Post test campanya… Post test campan… quantitativa telèfon
## # … with 38 more rows, and 22 more variables: Objectius <chr>,
## # Ambit territorial <fct>, Cost <dbl>, Promotors enquesta <chr>,
## # Executors enquesta <chr>, Promotors estudi <chr>, Executors estudi <chr>,
## # Data de treball de camp <chr>, Dia inici treball de camp <date>,
## # Dia final treball de camp <date>, Univers <chr>, Mostra <chr>,
## # Mostra estudis quantitatius <dbl>, Mostra estudis qualitatius <chr>,
## # Error mostral <chr>, Error mostral (numèric) <chr>, Resum <chr>, …
Restrict by time
Metadata can be retrieved for a specific period of time, by using the arguments
date_start
and date_end
, also using the YYYY-MM-DD format. In this case the
dates that are taken into account are dates where the study gets into the
records, not the fieldwork dates.
CEOmeta(date_start = "2019-01-01", date_end = "2019-12-31")
## # A tibble: 0 × 27
## # … with 27 variables: REO <fct>, Titol enquesta <chr>, Titol estudi <chr>,
## # Metodologia enquesta <fct>, Metode de recollida de dades <fct>,
## # Objectius <chr>, Ambit territorial <fct>, Cost <dbl>,
## # Promotors enquesta <chr>, Executors enquesta <chr>, Promotors estudi <chr>,
## # Executors estudi <chr>, Data de treball de camp <chr>,
## # Dia inici treball de camp <date>, Dia final treball de camp <date>,
## # Univers <chr>, Mostra <chr>, Mostra estudis quantitatius <dbl>, …
Browse the CEO site
In addition, to the search engine and the restriction by time CEOmeta()
also allows
to automatically open the relevant URLs at the CEO domain that contain the details
of the studies gathered with the function. This can be done setting the browse
argument to TRUE
. However, there is a soft limitation of only 10 URLs to be
opened, unless the user forces to really open all of them (proceed with caution,
as this may open many tabs in your browser and leave your computer out of RAM in
some scenarios of RAM black holes, such as Chrome).
CEOmeta(search = "Medi ambient a", browse = TRUE)
It is also possible to specify an alternative language, so the default catalan pages are substituted by the automatic translations provided by Apertium (for Occitan/Aranese) or Google Translate.
CEOmeta(search = "Medi ambient a", browse = TRUE, browse_translate = "de")
Extensions
Once you have retrieved the data of the surveys, it is trivial to work with them. For instance, to get the overall number of males and females surveyed:
library(dplyr)
library(tidyr)
library(ggplot2)
d |>
count(SEXE)
## # A tibble: 2 × 2
## SEXE n
## <fct> <int>
## 1 Home 16307
## 2 Dona 17531
Or to trace the proportion of females surveyed over time, across barometers:
d |>
group_by(BOP_NUM) |>
summarize(propFemales = length(which(SEXE == "Dona")) / n()) |>
ggplot(aes(x = BOP_NUM, y = propFemales, group = 1)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
expand_limits(y = c(0, 1))
Alternatively, the metadata can alse be explored using the different topics (tags, called “Descriptors”) covered as reported by the CEO.
tags <- CEOmeta() |>
separate_rows(Descriptors, sep = ";") |>
mutate(tag = factor(stringr::str_trim(Descriptors))) |>
select(REO, tag)
tags |>
group_by(tag) |>
count() |>
filter(n > 5) |>
ggplot(aes(x = n, y = reorder(tag, n))) +
geom_point() +
ylab("Topic")
Or by examining the time periods where there has been fieldwork in quantitative studies, since 2018.
CEOmeta() |>
filter(`Dia inici treball de camp` > "2018-01-01") |>
ggplot(aes(xmin = `Dia inici treball de camp`,
xmax = `Dia final treball de camp`,
y = reorder(REO, `Dia final treball de camp`))) +
geom_linerange() +
xlab("Date") + ylab("Surveys with fieldwork") +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
Development and acknowledgement
The development of CEOdata
(track changes, propose improvements, report bugs) can be followed at github.
If using the date and the package, please cite and acknowledge properly the CEO and the package, respectively.