Skip to main content

This website only uses technically necessary cookies. They will be deleted at the latest when you close your browser. To learn more, please read our Privacy Policy.

DE EN
Test Login
Logo, to home
  1. You are here:
  2. Synthetic PDF Testset for File Format Validation
...

    Dataset: Synthetic PDF Testset for File Format Validation

    • RADAR Metadata
    • Content
    • Statistics
    • Technical Metadata
    Alternate identifier:
    -
    Related identifier:
    -
    Creator/Author:
    Lindlar, Michelle https://orcid.org/0000-0003-3709-5608 [TIB]

    Tunnat, Yvonne [ZBW]

    Carl, Wilson [OPF]
    Contributors:
    -
    Title:
    Synthetic PDF Testset for File Format Validation
    Additional titles:
    -
    Description:
    (Abstract) This data set presents a corpus of light-weight files designed to test the validation criteria of JHOVE's PDF module against "well-formedness". Test cases are based on structural requirements for PDF files as per ISO 32000-1:2008 standard. The basis for all test files is a single page, one line docu... This data set presents a corpus of light-weight files designed to test the validation criteria of JHOVE's PDF module against "well-formedness". Test cases are based on structural requirements for PDF files as per ISO 32000-1:2008 standard. The basis for all test files is a single page, one line document with no special features such as linearization. While such a light-weight document only allows to check against a fragment of standard requirements, the focus was put on basic structure violations at the header, trailer, document catalog, page tree node and cross-reference levels. The test set also checks for basic violations at the page node, page resource and stream object level. The accompanying spreadsheet briefly categorizes and describes the test set and includes the outcome when running the test set against JHOVE 1.16, PDF-hul 1.8 as well as Adobe Acrobat Professional XI Pro (11.0.15). The spreadsheet also includes a codecov coverage statistic for the test set in relation to the JHOVE 1.16, PDF-hul 1.8 module. Further information can be found in the paper "A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly", published in the proceedings of the 14th International Conference on Digital Preservation (Kyoto, Japan, September 25-29 2017). While the spreadsheet only contains results of running the test set against JHOVE, it can be used as a ground truth for any file format validation process.

    This data set presents a corpus of light-weight files designed to test the validation criteria of JHOVE's PDF module against "well-formedness". Test cases are based on structural requirements for PDF files as per ISO 32000-1:2008 standard. The basis for all test files is a single page, one line document with no special features such as linearization. While such a light-weight document only allows to check against a fragment of standard requirements, the focus was put on basic structure violations at the header, trailer, document catalog, page tree node and cross-reference levels. The test set also checks for basic violations at the page node, page resource and stream object level. The accompanying spreadsheet briefly categorizes and describes the test set and includes the outcome when running the test set against JHOVE 1.16, PDF-hul 1.8 as well as Adobe Acrobat Professional XI Pro (11.0.15). The spreadsheet also includes a codecov coverage statistic for the test set in relation to the JHOVE 1.16, PDF-hul 1.8 module. Further information can be found in the paper "A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly", published in the proceedings of the 14th International Conference on Digital Preservation (Kyoto, Japan, September 25-29 2017). While the spreadsheet only contains results of running the test set against JHOVE, it can be used as a ground truth for any file format validation process.

    Show all
    Keywords:
    PDF, file format validation, digital preservation, ISO 32000-1:2008
    Related information:
    -
    Language:
    English
    Publishers:
    Michelle Lindlar, Yvonne Tunnat
    Production year:
    2017
    Subject areas:
    Software Technology
    Resource type:
    Text
    Data source:
    -
    Software used:
    -
    Data processing:
    -
    Publication year:
    2017
    Rights holders:
    Michelle Lindlar
    Funding:
    -
    Show all Show less
    Name Storage Metadata Upload Action
    Status:
    Published
    Uploaded by:
    lindlar
    Created on:
    2017-09-11
    Archiving date:
    2017-11-05
    Archive size:
    613.9 kB
    Archive creator:
    lindlar
    Archive checksum:
    b19d4d5668bc8cb3cc8a1a3b0280e246 (MD5)
    Embargo period:
    -
    DOI: 10.22000/53
    Publication date: 2017-11-05
    Download Dataset
    Download (613.9 kB)

    Download Metadata
    Statistics
    0
    Views
    0
    Downloads
    Rights statement for the dataset
    This work is licensed under
    CC BY-SA 4.0
    CC icon
    Cite Dataset
    Lindlar, Michelle; Tunnat, Yvonne; Carl, Wilson (2017): Synthetic PDF Testset for File Format Validation. Michelle Lindlar, Yvonne Tunnat. DOI: 10.22000/53
    • About the Repository
    • Privacy Policy
    • Terms and Conditions
    • Legal Notices
    • Accessibility Declaration
    1.22.5 (f) / 1.15.6 (b) / 1.22.3 (i)
    RADAR is an online service for the archival and publication of research data resulting from completed scientific studies and projects. RADAR is operated by FIZ Karlsruhe - Leibniz Institute for Information Infrastructure GmbH (hereinafter referred to as "we" or "us"). Using RADAR third parties ("data providers") may compile research data into datasets, annotate these datasets with metadata, store them permanently and make them publicly available. Review of the content and quality of the datasets is the sole responsibility of the data provider.

    1. The relationship between you ("data user") and us exists solely in relation to the download of datasets or metadata. We reserve the right to restrict the use or terminate the provision of RADAR completely at any time.
    2. If you are registered as a data user or are authorised by Shibboleth, the data provider may also grant you access to archived datasets.
    3. A full explanation as to how we protect your personal data is provided in our privacy policy.
    4. We assume no responsibility or liability for the accuracy, or reliability of the content or whether the content is up to date, except in the case of legally mandated liability.
    5. Access to RADAR as well as the ability to search and download datasets will be provided to you, the data user, without cost.
    6. You must comply with the conditions of the license connected with the dataset used.




    July 2019 / FIZ Karlsruhe