SOEP-IS in 6 steps

In this section we provide a brief guide describing the basic work with SOEP-IS. Data users who have never worked with the SOEP before can follow these steps relatively easily to gain initial experience with the SOEP-IS. Of course, the best way to use SOEP-IS data depends on the research project and individual approaches to working with the data. The following guide only offers basic tools to use the data quickly and easily. The statistical software used is Stata, but the principles can be applied in eny common software for data processing.

Note: This is not a tutorial for statistical software. Basic knowledge of Stata is assumed.

1. Data delivery and datasets

Below we provide a screenshot of the basic data delivery of SOEP-IS (version 2022). The standard data formats are Stata (.dta) and SPSS (.sav). The datasets are stored with English and German labels. However, all datasets of the different formats can be found in the respective folders without subdirectories:

../_images/SUF_data.png

Probably not all datasets are relevant for your research. For a quick overview, we have listed a short description and brief information on each dataset:

Dataset Name

Explanation

Years

Unique identifiers

Note

bioage

Variables from the Modules of Questions on Children

2003 – current

pid pidresp syear

biobirth

Birth Biography of Female and Male Respondents

N/A

pid

biol

Variables from the Biography Questionnaire

1999 – current

pid syear

bioparen

Biography Information for Respondents‘ Parents

N/A

pid

last update in 2020

cognit

Data on cognitive potential

2014, 2018

pid syear

Data from specific innovative modules

fgz

FGZ exclusive Variables (Social Cohesion Panel)

2020

pid

hbrutto

Household-related Gross File

1998 – current

hid syear

hgen

Household-related Status and Generated Variables

1998 – current

hid syear

hl

Variables from the Household Questionnaire

1998 – current

hid syear

hpath

Household-related Meta-dataset

N/A

hid

only cid, hid and sample1 included

hpathl

Household-related Meta-dataset (long)

1998 – current

hid syear

ibip_parent

Bonn Intervention Panel (Parents)

2014 – 2020

pid pidresp syear

BIP specific dataset

ibip_pupil

Bonn Intervention Panel (Students)

2014 – 2020

pid syear

BIP specific dataset

idrm

Data from Innovative DRM Module

2012 – 2015

pid syear aktnr

Data from specific innovative modules

idrm_esm

Data from Innovative DRM/ESM Module

2014

N/A

Data from specific innovative modules

iesm

Data from Innovative IESM Module

2014

pid esm_z

Data from specific innovative modules

ilanguage

Innovative Language Modules

2012, 2013, 2016, 2017, 2019

pid syear fspr_id sk_id langatt_id

Data from specific innovative modules

ilottery

Innovative Lottery Experiment

2016

pid1 pid2

Data from specific innovative modules

inno

Variables from the Innovation Modules

2011 – last

pid syear

Data from most innovative modules

inno_h

Variables from the Innovation Modules on Household Level

2014, 2015, 2017, 2018

hid syear

Data from innovative modules that were surveyed at household level and are thus not included in “inno”

intv

Information on Interviewers

2011 – current

intid syear

irisk

Person-related Data from Innovative RISK Module

2014

N/A

Data from specific innovative modules

kidlong

Variables on Children in the Household

1998 – current

pid syear

pbrutto

Person-related Gross File

1998 – current

pid syear

pgen

Person-related Status and Generated Variables

1998 – current

pid syear

pl

Variables from the Individual Question Module

1998 – current

pid syear

ppath

Person-related Meta-dataset

N/A

pid

ppathl

Person-related Meta-dataset (long)

1998 – current

pid syear

Please note that there was no SOEP-IS survey in 2021. For simplification, this is not considered in the “Years” column.

2. Finding what you need

You now have an overview of the various datasets and can search for the variables required. SOEP-IS comes with a lot of variables and also similar information in different datasets. For example, income related data can be found in pl or in pgen. Which variables should be used? While this of course depends on the individual research work, there are different approaches to find out which variables should be used. It might be useful to incorporate difference approaches into the work with the data. Some possible ways of finding variables or information on the variables are described below. Not all of them are always necessary, but they offer different advantages and disadvantages.

1. Search in the datasets

Probably the most intuitive way to find variables is to open the dataset and search for keywords in the variable labels. However, this approach might be difficoult because it is unclear which datasets you need, which words are included in the labels and also because similar information might exist in several datasets.

Advantages: quick, easy, intuitive

Disadvantages: requires knowledge of datasets and labels, provides no information on the related survey-question

2. Search in the questionnaires

There is a complete overview of the questionnaires of SOEP-IS on paneldata.org/soep-is/instruments/. You can access the English and German metadata-based questionnaires (see column “Attachments” for PDFs), which contain the survey questions, the related variables and the respective datasets they are stored in. This offers the highest degree of transparency regarding what was actually surveyed in the questionnaire. Since 2022, these questionnaires are being used as a direct programming template and thus represent the questionnaire in the best possible way. But because of that, they are also rather technically and not always easy to understand. For example, long filter conditions sometimes extend beyond the end of the page. Another problem might be, that the dataset and variable names have changed over the years. Older questionnaires might include the outdated names. In most cases, however, the renaming was minimal and can be traced relatively easily. For example, letters were added (l0011 -> lb0011) or suffixes were added (l0011 -> lb0011_v1). Besides the negative aspects of the questionnaires, they still offer the possibility to find related information and variables relatively quickly by searching for keywords in the documents.

Advantages: detailed information on survey questions over the years, related variables are easy to find

Disadvantages: not always up to date in older years, very technical (e.g. filters)

3. Search in the codebooks

For each dataset you can find a codebook with a short overview of each variable. This offers the possibility to get a first impression of the data without having to access it. If possible, the codebooks also include the latest version of the question related to some variables. As some of the values are displayed, you can also search for value labels in the codebooks. For variables with a large number of values, however, only some are displayed.

Advantages: overview of variables without access to data, limited search for value labels

Disadvantages: large PDF documents, overall limited information

4. Search in SOEP-IS-Companion

In the chapter Topics we provide a selection of the main topics of SOEP-IS and the respective variables. You can use this for finding some of the most relevant varaibles in SOEP-IS. However, this is not a comprehensive list and SOEP-IS offers a lot of additional variables. This should therefore serve more as a starting point or for a quick search for general information.

Advantages: topic related, easy

Disadvantages: just a selection of content and variables

5. Search in paneldata.org

You can use paneldata.org to look for all kinds of information. Paneldata allows you to search for variables and to find more information about generated variables. It offers comprehensive frequency counts, chronologies of variables, cross-study variable linkage via concepts, a syntax generator, and a topic list for content search in the SOEP.

Advantages: easy and comprehensive search, cross-study

Disadvantages: partially missing information/links, not all features are available for all data

3. Merging datasets

If you know which variables you want to use it might be that they are stored in different datasets. You can merge most of SOEP-IS data using the identifiers pid, hid, and syear. In the following you can find a brief overview of these most important identifiers:

pid: Never Changing Person ID - Each individual in SOEP has an unambiguous and never changing pid. It is constant over the years and can be used across datasets and studies.

hid: Current Wave HH Number - The hid identifies the household of the respondents in a wave. The hid can change from year to year, e.g. when people leave or switch households. However, each person only has one hid per year. The hid can be used across studies.

syear: The syear variable can be found in every dataset in the long format and indicates the survey year.

SOEP-Core already has an extensive guide on the identifiers in SOEP data

General guidelines

  • Merges with long datasets should probably include the syear variable

  • Data on household level should probably be merged using hid

  • Data on individual level should probably be merged using pid.

SOEP-Core already has an extensive guide on how to merge SOEP data with Stata

4. Understanding the Data

Once you have all the variables you need, it is sometimes necessary to understand the origin and characteristics of the variables or distributions. Here are some tips that may help you understand the data:

5. Weighting and imputation

Even if the normal SOEP-IS samples are random probability samples, the weights are needed to be able to draw conclusions for the total population. This is due to the fact that not all people who are selected in the sampling process actually take part in the survey. The weights are used to compensate for the non-responses that bias the sample. SOEP-IS provides weights on individual- and household-level. The variables phrf and hhrf can be found in ppathl and hpathl.

SOEP-IS also provides imputations of household income and individual gross and net income. The different imputed variables are stored in hgen and pgen.

6. Record Linkage and Specific Data Requests

SOEP-IS offers the possibility to merge the data with other data sources. Specifically, SOEP-IS data can be linked to administrative data of the Institute for Employment Research (IAB) and to administrative data of the German Pension Insurance. However, in order for the linkage to be possible, the respondents have to consent to it. This consent has so far only been surveyed in the SOEP-IS in 2019 and 2024 (in 2020 only a small subsample received the consent question).

Some variables are not included in the normal Data Distribution File. If you find items in the questionnaires or the documentation sites that are not part of the official data, you can send us an e-mail and we may be able to make the data available individually. However, this depends on the respective data and the circumstances.

Our Community Management is also available to answer any other questions you may have. SOEP-IS specific questions may be forwarded to the responsible members of the team.

Contact Information