Ingest (or Dispose)

Working with research data is a responsible task. This is particularly true for social, behavioral, educational, and economic sciences, where people are often the subject of a research, and thus sensitive, personal data form the core of a research. In this context, various ethical principles can come into conflict. This issue is presented in our article on research ethics.

Legal requirements, such as data protection and copyright laws, also play a central role in research data management for data curators, and regulate the handling of research data at EU, federal or state level. One of the key responsibilities of data managers is not only to be aware of these aspects, but also to actively work towards their compliance to recognize risks and violations, and to develop procedures for adhering to these principles in their institutions.

Anonymization & Pseudonymization

For the ingestion of the selected research data to your data center, the topics of anonymization and pseudonymization of sensitive data are relevant in connection with the necessary data processing.

AnonymizationAnonymization refers to the process of altering personally identifiable data to ensure that no direct or indirect identification of a specific individual is possible. Anonymization involves removing or modifying any information that could lead to the identification of a person. The goal is to modify the data in such a way that it can no longer be attributed to a specific individual.
PseudonyzationPseudonymization is the process of replacing personally identifiable data with pseudonyms or codes to make direct identification of a person more difficult. The link between the pseudonyms and actual identities is stored in a separate table or database accessible only to authorized personnel.

[Source: Glossary | Practice in Short | Research Data and Research Data Management]

Both anonymization and pseudonymization are methods used to protect privacy and ensure data protection. They are employed to ensure that sensitive data cannot be used in the social sciences to identify individuals or violate their privacy. These measures minimize the risk of unauthorized disclosure or misuse of personal data.

Common strategies for anonymizing sensitive data include:

  1. Aggregation: Aggregation involves consolidating personal data and presenting it in an aggregate form. This obscures individual information while keeping general trends and patterns.
  2. Generalization: Generalization involves altering personal data to make it more difficult to identify individuals. For example, age data can be grouped into age groups or precise locations can be generalized to regions.
  3. Data Masking: Data masking involves removing or obfuscating certain parts of personal data. This can include removing names, addresses, or other identifying information.
  4. Data Suppression: Data suppression involves removing specific records or variables to prevent the identification of individuals. This may be necessary, for example, in case of small samples or rare characteristics.
  5. Top-/Bottom-Coding: This strategy is often applied in statistics to handle outliers or extreme values. In top-coding, values exceeding a certain threshold are limited to that threshold. This means that any values above this threshold will be set to the threshold itself. This cuts off extreme values and makes it difficult to identify individuals with abnormally high values. Bottom-coding, on the other hand, limits values that are below a certain threshold to that threshold. This means that any values below this threshold will be set to the threshold itself. This limits extreme values downwards and makes it more difficult to identify individuals with abnormally low values. Top-/Bottom-Coding can be applied in various contexts, such as income data, where very high or very low incomes can be considered outliers. It protects individuals’ privacy by preserving important information about data distribution. It should be noted that Top-/Bottom-Coding can also affect statistical analysis, especially data distribution and parameter estimation. Therefore, this strategy should be used with caution, considering its implications for the analysis.

It is important to note that the choice of anonymization strategy depends on various factors, such as specific research context, data protection regulations, and data analysis requirements. It is advisable to adhere to applicable data protection policies and laws and to consult with data privacy experts if necessary.

There are helpful working papers from the Consortium for Research Data in Education which examines the anonymization of qualitative as well as quantitative research data. The free anonymization tool QualiAnon supports the anonymization/pseudonymization of text data.

Access Controls

Finally, in case of data for which comprehensive anonymization cannot be sufficiently guaranteed, there is the option of institutionally implemented access controls.

An example of this is the recommendations of the German Psychological Society (DGPs) on the handling of research data. The RDC at ZPID has implemented these recommendations in the form of an Access Class Model: For data with special requirements for to data protection and research ethics, ZPID offers various data release levels.