Preservation action (Reappraise, Migrate)

  • Data Preservation

    After you have reviewed, selected, documented, and prepared the incoming research data, specific measures for data preservation come into play in the context of long-term data storage. The primary focus here is to ensure (technical) reusability so that the preserved research data do not become technically outdated and unusable for the research community.

    The need for specific preservation measures is driven by changing requirements of researchers who will reuse the data as well as technological developments:

    Examples of this include new data or file formats, new interfaces required by the target audience for working with new software programs or environments, new scientific standards or practices which require conversion into new units of measurement or additional parameters as background information, or even the expansion of the target audience to include lay people.

    (translated from Ludwig & Enke, 2013, p. 36).

    Measures

    Central to this is the ongoing documentation of the technologies used and the needs of the user target group. The necessary observation of technological developments (Technology Watch) and the target audience (Community Watch) are essential activities during the preservation phase and crucial in determining whether long-term usability measures are required.

    Community Watch

    “The designated communities with their specialist and technical requirements and
    possibilities are the main point of reference in determining whether information objects can
    still be used or whether they risk becoming unusable.”

    (nestor – Working Group Preservation planning, Guideline for Preservation Planning, 2012, p. 26)

    The target audience is decisive in determining which data formats are currently prevalent. On the one hand, this means that rare formats are legitimate if they can be processed by the user target group. On the other hand, the necessity in conservation measure may also arise in case when the availavle data cannot be used by the target group, even if it concerns a commonly used data format.

    Consequently, the precise and continuous observation of the target audience is an important activity in the context of data preservation for the digital long-term archiving of research data. To systematically conduct this observation, methods such as “annual interviews, surveys or workshops with representatives of the designated communities, and also receptive procedures such as participating in events of the designated communities or the targeted evaluation of user inquiries and requests” (nestor – Working Group on Preservation Planning, Guideline for Preservation Planning, 2012, p. 26).

    Technology Watch

    In addition to observing the target audience, it is necessary to constantly evaluate current technological developments and possibilities. This allows the reflection on the currently used software environments, as well as consideration of innovative alternatives.

    Checksums

    Checksums are also a relevant topic for the long-term preservation of research data in the social sciences. In the social sciences, large amounts of data are often generated that are collected and analyzed over a long period of time. Checksums help ensure the quality and reliability of the data and contribute to the reproducibility of research results. They also ensure data integrity, which is crucial for the reproducibility of research results. Checksums are primarily used to ensure that data remains unchanged during preservation for archiving. This is particularly important because research data is often retained for an extended period and may be stored on different storage media or in different archiving systems. In addition, checksums can also be used when transferring data between different institutions or research teams. By comparing checksums before and after the transfer, potential data loss or corruption can be detected.

    Specifically, a checksum is a numerical value calculated from a digital object. The algorithm is based on the bits of the file and changes even with minor alterations. When copying, the checksums of the original file and the copy can be compared to detect errors. Data repositories typically use cryptographic hash functions for file validation, which are applied to the bitstream and respond to the smallest changes with different checksums. The result of the algorithms is a unique string that changes as soon as even a single bit in the data changes. When the data is later restored, the checksum can be recalculated and compared to the original checksum. If the two checksums match, it indicates that the data is unchanged and intact.

    In the social sciences, commonly used checksum algorithms are often general-purpose algorithms, which are also used in other areas of computer science and data processing. Some of the most common checksum algorithms include:

    • MD5 (Message Digest Algorithm 5): MD5 is a widely used checksum algorithm that generates a 128-bit checksum. Although MD5 is still frequently used, it is now considered insecure for cryptographic applications due to vulnerabilities.
    • SHA-1 (Secure Hash Algorithm 1): SHA-1 is another well-known checksum algorithm that generates a 160-bit checksum. SHA-1 is also considered insecure for cryptographic applications.
    • SHA-256 (Secure Hash Algorithm 256-bit): SHA-256 is an advancement of the SHA-1 algorithm and generates a 256-bit checksum. It is considered more secure and is used in various fields, including the long-term archiving of research data.

    These checksum algorithms can be used in the social sciences to ensure the integrity of research data during archiving, transmission, or analysis processes. By comparing checksums before and after specific data preservation and archiving actions, potential data loss or corruption can be detected.

    Overall, checksums are an important tool to ensure the integrity of research data in the social sciences.