SOEP-IS in 6 steps¶
In this section we provide a brief guide describing the basic work with SOEP-IS. Data users who have never worked with the SOEP before can follow these steps relatively easily to gain initial experience with the SOEP-IS. Of course, the best way to use SOEP-IS data depends on the research project and individual approaches to working with the data. The following guide only offers basic tools to use the data quickly and easily. The statistical software used is Stata, but the principles can be applied in eny common software for data processing.
Note: This is not a tutorial for statistical software. Basic knowledge of Stata is assumed.
1. Data delivery and datasets¶
Below we provide a screenshot of the basic data delivery of SOEP-IS (version 2022). The standard data formats are Stata (.dta) and SPSS (.sav). The datasets are stored with English and German labels. However, all datasets of the different formats can be found in the respective folders without subdirectories:
Probably not all datasets are relevant for your research. For a quick overview, we have listed a short description and brief information on each dataset:
Dataset Name |
Explanation |
Years |
Unique identifiers |
Note |
---|---|---|---|---|
bioage |
Variables from the Modules of Questions on Children |
2003 – current |
pid pidresp syear |
|
biobirth |
Birth Biography of Female and Male Respondents |
N/A |
pid |
|
biol |
Variables from the Biography Questionnaire |
1999 – current |
pid syear |
|
bioparen |
Biography Information for Respondents‘ Parents |
N/A |
pid |
last update in 2020 |
cognit |
Data on cognitive potential |
2014, 2018 |
pid syear |
Data from specific innovative modules |
fgz |
FGZ exclusive Variables (Social Cohesion Panel) |
2020 |
pid |
|
hbrutto |
Household-related Gross File |
1998 – current |
hid syear |
|
hgen |
Household-related Status and Generated Variables |
1998 – current |
hid syear |
|
hl |
Variables from the Household Questionnaire |
1998 – current |
hid syear |
|
hpath |
Household-related Meta-dataset |
N/A |
hid |
only cid, hid and sample1 included |
hpathl |
Household-related Meta-dataset (long) |
1998 – current |
hid syear |
|
ibip_parent |
Bonn Intervention Panel (Parents) |
2014 – 2020 |
pid pidresp syear |
BIP specific dataset |
ibip_pupil |
Bonn Intervention Panel (Students) |
2014 – 2020 |
pid syear |
BIP specific dataset |
idrm |
Data from Innovative DRM Module |
2012 – 2015 |
pid syear aktnr |
Data from specific innovative modules |
idrm_esm |
Data from Innovative DRM/ESM Module |
2014 |
N/A |
Data from specific innovative modules |
iesm |
Data from Innovative IESM Module |
2014 |
pid esm_z |
Data from specific innovative modules |
ilanguage |
Innovative Language Modules |
2012, 2013, 2016, 2017, 2019 |
pid syear fspr_id sk_id langatt_id |
Data from specific innovative modules |
ilottery |
Innovative Lottery Experiment |
2016 |
pid1 pid2 |
Data from specific innovative modules |
inno |
Variables from the Innovation Modules |
2011 – last |
pid syear |
Data from most innovative modules |
inno_h |
Variables from the Innovation Modules on Household Level |
2014, 2015, 2017, 2018 |
hid syear |
Data from innovative modules that were surveyed at household level and are thus not included in “inno” |
intv |
Information on Interviewers |
2011 – current |
intid syear |
|
irisk |
Person-related Data from Innovative RISK Module |
2014 |
N/A |
Data from specific innovative modules |
kidlong |
Variables on Children in the Household |
1998 – current |
pid syear |
|
pbrutto |
Person-related Gross File |
1998 – current |
pid syear |
|
pgen |
Person-related Status and Generated Variables |
1998 – current |
pid syear |
|
pl |
Variables from the Individual Question Module |
1998 – current |
pid syear |
|
ppath |
Person-related Meta-dataset |
N/A |
pid |
|
ppathl |
Person-related Meta-dataset (long) |
1998 – current |
pid syear |
Please note that there was no SOEP-IS survey in 2021. For simplification, this is not considered in the “Years” column.
2. Finding what you need¶
You now have an overview of the various datasets and can search for the variables required. SOEP-IS comes with a lot of variables and also similar information in different datasets. For example, income related data can be found in pl or in pgen. Which variables should be used? While this of course depends on the individual research work, there are different approaches to find out which variables should be used. It might be useful to incorporate difference approaches into the work with the data. Some possible ways of finding variables or information on the variables are described below. Not all of them are always necessary, but they offer different advantages and disadvantages.
1. Search in the datasets
Probably the most intuitive way to find variables is to open the dataset and search for keywords in the variable labels. However, this approach might be difficoult because it is unclear which datasets you need, which words are included in the labels and also because similar information might exist in several datasets.
Advantages: quick, easy, intuitive
Disadvantages: requires knowledge of datasets and labels, provides no information on the related survey-question
2. Search in the questionnaires
There is a complete overview of the questionnaires of SOEP-IS on paneldata.org/soep-is/instruments/. You can access the English and German metadata-based questionnaires (see column “Attachments” for PDFs), which contain the survey questions, the related variables and the respective datasets they are stored in. This offers the highest degree of transparency regarding what was actually surveyed in the questionnaire. Since 2022, these questionnaires are being used as a direct programming template and thus represent the questionnaire in the best possible way. But because of that, they are also rather technically and not always easy to understand. For example, long filter conditions sometimes extend beyond the end of the page. Another problem might be, that the dataset and variable names have changed over the years. Older questionnaires might include the outdated names. In most cases, however, the renaming was minimal and can be traced relatively easily. For example, letters were added (l0011 -> lb0011) or suffixes were added (l0011 -> lb0011_v1). Besides the negative aspects of the questionnaires, they still offer the possibility to find related information and variables relatively quickly by searching for keywords in the documents.
Advantages: detailed information on survey questions over the years, related variables are easy to find
Disadvantages: not always up to date in older years, very technical (e.g. filters)
3. Search in the codebooks
For each dataset you can find a codebook with a short overview of each variable. This offers the possibility to get a first impression of the data without having to access it. If possible, the codebooks also include the latest version of the question related to some variables. As some of the values are displayed, you can also search for value labels in the codebooks. For variables with a large number of values, however, only some are displayed.
Advantages: overview of variables without access to data, limited search for value labels
Disadvantages: large PDF documents, overall limited information
4. Search in SOEP-IS-Companion
In the chapter Topics we provide a selection of the main topics of SOEP-IS and the respective variables. You can use this for finding some of the most relevant varaibles in SOEP-IS. However, this is not a comprehensive list and SOEP-IS offers a lot of additional variables. This should therefore serve more as a starting point or for a quick search for general information.
Advantages: topic related, easy
Disadvantages: just a selection of content and variables
5. Search in paneldata.org
You can use paneldata.org to look for all kinds of information. Paneldata allows you to search for variables and to find more information about generated variables. It offers comprehensive frequency counts, chronologies of variables, cross-study variable linkage via concepts, a syntax generator, and a topic list for content search in the SOEP.
Advantages: easy and comprehensive search, cross-study
Disadvantages: partially missing information/links, not all features are available for all data
3. Merging datasets¶
If you know which variables you want to use it might be that they are stored in different datasets. You can merge most of SOEP-IS data using the identifiers pid, hid, and syear. In the following you can find a brief overview of these most important identifiers:
pid: Never Changing Person ID - Each individual in SOEP has an unambiguous and never changing pid. It is constant over the years and can be used across datasets and studies.
hid: Current Wave HH Number - The hid identifies the household of the respondents in a wave. The hid can change from year to year, e.g. when people leave or switch households. However, each person only has one hid per year. The hid can be used across studies.
syear: The syear variable can be found in every dataset in the long format and indicates the survey year.
SOEP-Core already has an extensive guide on the identifiers in SOEP data
General guidelines
Merges with long datasets should probably include the syear variable
Data on household level should probably be merged using hid
Data on individual level should probably be merged using pid.
SOEP-Core already has an extensive guide on how to merge SOEP data with Stata
4. Understanding the Data¶
Once you have all the variables you need, it is sometimes necessary to understand the origin and characteristics of the variables or distributions. Here are some tips that may help you understand the data:
The options described at 2. Finding what you need can be used to find more information
In the FAQ chapter What do the suffixes “_v1”, “_v2”, or “_h” in variable names mean? you can find information about the meaning of variable names
SOEP-Core provides a description of the missing conventions
It might be useful to merge the sample1 variable and check whether specific items were only asked to certain samples
5. Weighting and imputation¶
Even if the normal SOEP-IS samples are random probability samples, the weights are needed to be able to draw conclusions for the total population. This is due to the fact that not all people who are selected in the sampling process actually take part in the survey. The weights are used to compensate for the non-responses that bias the sample. SOEP-IS provides weights on individual- and household-level. The variables phrf and hhrf can be found in ppathl and hpathl.
SOEP-IS also provides imputations of household income and individual gross and net income. The different imputed variables are stored in hgen and pgen.
6. Record Linkage and Specific Data Requests¶
SOEP-IS offers the possibility to merge the data with other data sources. Specifically, SOEP-IS data can be linked to administrative data of the Institute for Employment Research (IAB) and to administrative data of the German Pension Insurance. However, in order for the linkage to be possible, the respondents have to consent to it. This consent has so far only been surveyed in the SOEP-IS in 2019 and 2024 (in 2020 only a small subsample received the consent question).
Some variables are not included in the normal Data Distribution File. If you find items in the questionnaires or the documentation sites that are not part of the official data, you can send us an e-mail and we may be able to make the data available individually. However, this depends on the respective data and the circumstances.
Our Community Management is also available to answer any other questions you may have. SOEP-IS specific questions may be forwarded to the responsible members of the team.