SOEP-IS in 6 steps

In this section we provide a brief guide describing the basic work with SOEP-IS. Data users who have never worked with the SOEP before can follow these steps relatively easily to gain initial experience with the SOEP-IS. Of course, the best way to use SOEP-IS data depends on the research project and individual approaches to working with the data. The following guide only offers basic tools to use the data quickly and easily. The statistical software used is Stata, but the principles can be applied in eny common software for data processing.

Note: This is not a tutorial for statistical software. Basic knowledge of Stata is assumed.

1. Data delivery and datasets

The standard data formats in which SOEP-IS data are delivered are Stata (.dta) and SPSS (.sav). The datasets are stored with English and German labels. However, all datasets in the various formats are located in the respective folders without subdirectories. Probably not all datasets are relevant for your research. For a quick overview, we have listed a short description and brief information on each dataset, that is included in the data delivery (version 2024):

Dataset Name	Explanation	Years	Unique identifiers	Note
bioagel	Variables from the Modules of Questions on Children	2003 - current	pid pidresp syear
biobirth	Birth Biography of Female and Male Respondents	N/A	pid
biol	Variables from the Biography Questionnaire	1999 - current	pid syear
bioparen	Biography Information for Respondents‘ Parents	N/A	pid
calvi	Information on CALVI Interviews	2024	pid syear	Additional information on interviews in CALVI mode
cognit	Data on cognitive potential	2014, 2018	pid syear	Data from specific innovative modules
dips3_calls	DIPS Daily-Level Data	2022	pid n_days n_call	Data from specific innovative modules
dips3_daily	DIPS Hourly-Level Data	2022	pid n_days	Data from specific innovative modules
dips3_hourly	DIPS Call Data	2022	pid n_days n_hour	Data from specific innovative modules
fgz	FGZ exclusive Variables (Social Cohesion Panel)	2020	pid
hbrutto	Household-related Gross File	1998 - current	hid syear
hgen	Household-related Status and Generated Variables	1998 - current	hid syear
hl	Variables from the Household Questionnaire	1998 - current	hid syear
hpath	Household-related Meta-dataset	N/A	hid	only cid, hid and sample1 included
hpathl	Household-related Meta-dataset (long)	1998 - current	hid syear
ibip_parent	Bonn Intervention Panel (Parents)	2014 - 2020	pid pidresp syear	BIP specific dataset
ibip_pupil	Bonn Intervention Panel (Students)	2014 - 2020	pid syear	BIP specific dataset
idrm	Data from Innovative DRM Module	2012 - 2015	pid syear aktnr	Data from specific innovative modules
idrm_esm	Data from Innovative DRM/ESM Module	2014	N/A	Data from specific innovative modules
iesm	Data from Innovative IESM Module	2014	pid esm_z	Data from specific innovative modules
ilanguage	Innovative Language Modules	2012, 2013, 2016, 2017, 2019	pid syear fspr_id sk_id langatt_id	Data from specific innovative modules
ilottery	Innovative Lottery Experiment	2016	pid1 pid2	Data from specific innovative modules
inno	Variables from the Innovation Modules	2011 - last	pid syear	Data from most innovative modules
inno_h	Variables from the Innovation Modules on Household Level	2014, 2015, 2017, 2018	hid syear	Data from innovative modules that were surveyed at household level and are thus not included in “inno”
intv	Information on Interviewers	2011 - current	intid syear
irisk	Person-related Data from Innovative RISK Module	2014	N/A	Data from specific innovative modules
kidlong	Variables on Children in the Household	1998 - current	pid syear
pbrutto	Person-related Gross File	1998 - current	pid syear
pgen	Person-related Status and Generated Variables	1998 - current	pid syear
pl	Variables from the Individual Question Module	1998 - current	pid syear
ppath	Person-related Meta-dataset	N/A	pid
ppathl	Person-related Meta-dataset (long)	1998 - current	pid syear

Please note that there was no SOEP-IS survey in 2021. For simplification, this is not considered in the “Years” column.

2. Finding what you need

You now have an overview of the various datasets and can search for the variables required. SOEP-IS comes with a lot of variables and also similar information in different datasets. For example, income related data can be found in pl or in pgen. Which variables should be used? While this of course depends on the individual research work, there are different approaches to find out which variables should be used. It might be useful to incorporate difference approaches into the work with the data. Some possible ways of finding variables or information on the variables are described below. Not all of them are always necessary, but they offer different advantages and disadvantages.

1. Search in the datasets

Probably the most intuitive way to find variables is to open the dataset and search for keywords in the variable labels. However, this approach might be difficoult because it is unclear which datasets you need, which words are included in the labels and also because similar information might exist in several datasets.

Advantages: quick, easy, intuitive

Disadvantages: requires knowledge of datasets and labels, provides no information on the related survey-question

2. Search in the questionnaires

There is a complete overview of the questionnaires of SOEP-IS on paneldata.org/soep-is/instruments/. You can access the English and German metadata-based questionnaires (see column “Attachments” for PDFs), which contain the survey questions, the related variables and the respective datasets they are stored in. This offers the highest degree of transparency regarding what was actually surveyed in the questionnaire. Since 2022, these questionnaires are being used as a direct programming template and thus represent the questionnaire in the best possible way. But because of that, they are also rather technically and not always easy to understand. For example, long filter conditions sometimes extend beyond the end of the page. Another problem might be, that the dataset and variable names have changed over the years. Older questionnaires might include the outdated names. In most cases, however, the renaming was minimal and can be traced relatively easily. For example, letters were added (l0011 -> lb0011) or suffixes were added (l0011 -> lb0011_v1). Besides the negative aspects of the questionnaires, they still offer the possibility to find related information and variables relatively quickly by searching for keywords in the documents.

Advantages: detailed information on survey questions over the years, related variables are easy to find

Disadvantages: not always up to date in older years, very technical (e.g. filters)

3. Search in the codebooks

For each dataset you can find a codebook with a short overview of each variable. This offers the possibility to get a first impression of the data without having to access it. If possible, the codebooks also include the latest version of the question related to some variables. As some of the values are displayed, you can also search for value labels in the codebooks. For variables with a large number of values, however, only some are displayed.

Advantages: overview of variables without access to data, limited search for value labels

Disadvantages: large PDF documents, overall limited information

4. Search in SOEP-IS-Companion

In the chapter Topics we provide a selection of the main topics of SOEP-IS and the respective variables. You can use this for finding some of the most relevant varaibles in SOEP-IS. However, this is not a comprehensive list and SOEP-IS offers a lot of additional variables. This should therefore serve more as a starting point or for a quick search for general information.

Advantages: topic related, easy

Disadvantages: just a selection of content and variables

5. Search in paneldata.org

You can use paneldata.org to look for all kinds of information. Paneldata allows you to search for variables and to find more information about generated variables. It offers comprehensive frequency counts, chronologies of variables, cross-study variable linkage via concepts, a syntax generator, and a topic list for content search in the SOEP.

Advantages: easy and comprehensive search, cross-study

Disadvantages: partially missing information/links, not all features are available for all data

3. Merging datasets

If you know which variables you want to use it might be that they are stored in different datasets. You can merge most of SOEP-IS data using the identifiers pid, hid, and syear. In the following you can find a brief overview of these most important identifiers:

pid: Never Changing Person ID - Each individual in SOEP has an unambiguous and never changing pid. It is constant over the years and can be used across datasets and studies.

hid: Current Wave HH Number - The hid identifies the household of the respondents in a wave. The hid can change from year to year, e.g. when people leave or switch households. However, each person only has one hid per year. The hid can be used across studies.

syear: The syear variable can be found in every dataset in the long format and indicates the survey year.

SOEP-Core already has an extensive guide on the identifiers in SOEP data

General guidelines

Merges with long datasets should probably include the syear variable
Data on household level should probably be merged using hid
Data on individual level should probably be merged using pid.

SOEP-Core already has an extensive guide on how to merge SOEP data with Stata

4. Understanding the Data

Once you have all the variables you need, it is sometimes necessary to understand the origin and characteristics of the variables or distributions. Here are some tips that may help you understand the data:

The options described at 2. Finding what you need can be used to find more information
In the FAQ chapter What do the suffixes “_v1”, “_v2”, or “_h” in variable names mean? you can find information about the meaning of variable names
SOEP-Core provides a description of the missing conventions
It might be useful to merge the sample1 variable and check whether specific items were only asked to certain samples

5. Weighting and imputation

Even if the normal SOEP-IS samples are random probability samples, the weights are needed to be able to draw conclusions for the total population. This is due to the fact that not all people who are selected in the sampling process actually take part in the survey. The weights are used to compensate for the non-responses that bias the sample. SOEP-IS provides weights on individual- and household-level. The variables phrf and hhrf can be found in ppathl and hpathl.

SOEP-IS also provides imputations of household income and individual gross and net income. The different imputed variables are stored in hgen and pgen.

6. Record Linkage and Specific Data Requests

SOEP-IS offers the possibility to merge the data with other data sources. Specifically, SOEP-IS data can be linked to administrative data of the Institute for Employment Research (IAB) and to administrative data of the German Pension Insurance. However, in order for the linkage to be possible, the respondents have to consent to it. This consent has so far only been surveyed in the SOEP-IS in 2019 and 2024 (in 2020 only a small subsample received the consent question).

Some variables are not included in the normal Data Distribution File. If you find items in the questionnaires or the documentation sites that are not part of the official data, you can send us an e-mail and we may be able to make the data available individually. However, this depends on the respective data and the circumstances.

Our Community Management is also available to answer any other questions you may have. SOEP-IS specific questions may be forwarded to the responsible members of the team.

Contact Information