Open Data Guide
This is a support document to guide open-data digital projects to submit an application for digital public good recognition by the DPGA.
Last updated
This is a support document to guide open-data digital projects to submit an application for digital public good recognition by the DPGA.
Last updated
Two dimensions of data openness
The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions.
Tips:
π This guide helps in publishing metadata
π And this guide helps in opening and publishing government data in a secure manner.
Criteria to consider data as βopenβ
Available: Data should be priced at no more than a reasonable cost of reproduction, preferably as a free download from the Internet. This pricing model is achieved because your agency should not undertake any cost when it provides data for use.
In bulk: The data should be available as a complete set. If you have a register which is collected under statute, the entire register should be available for download. A web API or similar service may also be very useful, but they are not a substitute for bulk access.
In an open, machine-readable format: Re-use of data held by the public sector should not be subject to patent restrictions. More importantly, making sure that you are providing machine-readable formats allows for greatest re-use. To illustrate this, consider statistics published as PDF (Portable Document Format) documents, often used for high quality printing. While these statistics can be read by humans, they are very hard for a computer to use. This greatly limits the ability for others to re-use that data.
Digital public goods must be designed and developed to advance the Sustainable Development Goals (SDGs). A good way to provide evidence of this is:
State a clear couple of sentences that explain the relationship between your content and the selected SDG(s) pointing to the specific targets you help accomplish.
Provide any link(s) of a blog post, media post, or public communication (organization mission statement or similar) that talks about any social, public, or relevant contribution to society. It is not necessary that these mention SDGs as long as it relates to the previous explanation.
π You can use this SDG tracker tool to get an idea of the targets, initiatives, and data around each SDGs
π The SDG Academy provides free, open educational resources from the worldβs leading experts on the sustainable development goals.
<Project name> helps advance SDG 4: Quality Education by providing freely accessible data on the number of children out of the public schooling system categorised by region. Governments can use this data to build better facilities in underserved communities. This is in alignment with target 4.6: Universal literacy and numeracy and target 4.1: Free primary and secondary education. The data is published under Creative Commons Attribution 4.0 International licence and can be copied and repurposed by all education stakeholders.
Collaboration with X local government to advance education - βwww.link-to-the-article.comβ
For open data sets the use of a Conformant licence is required. A screenshot of licences allowed:
A good way to provide evidence of the licence used is to have it listed as a footer on your website and have it in the root repository of your Github page.
All data published under <project name> is licensed under (CC-BY-4.0) Creative Commons Attribution 4.0 International - https://creativecommons.org/licenses/by/4.0 and can be copied and repurposed, as well as remixed and shared.
<Project name> acts as an aggregator of data. We collect data from farmers across North India and publish it with statistical analysis on the correlation between crop yields and farmer suicides by state to better inform government policies and private sector aid to those places.
<Project name>βs governance policies can be found on Github <insert link> as well as on the Governance page of our website <insert link>. The data is stewarded by the <name of committee> who undertake a list of measures including <give examples> to ensure safety of users and usability of the data set. Any grievance or concern of misuse can be reported to <name + contact>
[Note: This information should also be publicly, easily accessible]
Clear Ownership for data sets includes:
Declaring the type of organisation
Naming the data governance committee and declaring the policies
Naming the stewarding committee
Please identify which category your organisation belongs to according to Deloitteβs five archetypal open data value propositions and provide evidence:
Suppliers: organisations that publish their data via an open interface to allow others to use and reuse it.
Aggregators: organisations that collect aggregate open data and sometimes, other proprietary data, typically on a particular theme, to find correlations, identify efficiencies or visualise complex relationships.
Developers: organisations and software entrepreneurs that design, build and sell web-based, tablet or smartphone applications for individual consumption.
Enrichers: organisations (typically larger, established businesses) that use open data to enhance their existing products and services through better insights.
Enablers: organisations that facilitate the supply or use of open data, such as the competition initiatives
This is the broader term that defines the policies, regulations and practices that govern data in an organisation. These policies make sure that all the data collected is standardised so that it is interoperable and that it is in line with an organisation's broader framework and vision. Stewardship is a part of this structure. An effective data governance program addresses three key questions or elements:
What data to govern
a) Reference data on business categories like plants, account groups, payment terms, shipment priorities, and so on.
b) Master data for business entities like customers, vendors, products, GL accounts and more.
c)Transactional data on the business events like orders, prices, invoices and so on.
How to govern data
a) A policy is a rule that helps an organisation govern the data and manage risks based on standards.
b) A business process is a series of related, structured activities performed by the data governance team to accomplish a specific objective.
c) A procedure is a sequence of steps or work instructions to complete an activity within a process.
What organisation mechanisms are required
a) Data owner, who is from the business, is accountable for the data and makes decisions on the right to access and usage.
b) Data stewards are from the various business units, and they are responsible for the content and context associated with the data.
c) Data custodians are from IT and they are responsible for the safe and secure custody, integration, and storage of data.
Stewardship is a mechanism for sharing data that fosters trust among stakeholders, enhances transparency and enables greater control over data. A steward acts as a trusted, neutral intermediary who engages and negotiates with stakeholders to represent their best interests while preserving the privacy of individuals. Stewards must possess relevant technical capacity. The purpose of this is two-fold:
The steward or an associated entity it oversees must carry out necessary data cleaning, pre-processing and processing of data. This ensures the dataβs quality, integrity and interoperability.
Stewards must ensure that data is protected both in storage and in transmission.
All data sets must be platform independent - ie. They must not create mandotory dependencies on users of the data. In case there are dependencies, users should be given an easy, no cost way, to navigate to other technologies - ie. the dependencies must have alternatives and cannot be a mandatory part of the data set.
Data on <Project name> can be found on <Github docs link> where it describes the context behind collecting farmersβ data, methodologies used, technical requirements for accessing + downloading the data set, possible use cases as well as resources to build on this set. It consists of structured, raw data such as name, age, income levels, crop yeild %
(2) <Project name> acquires its data from paddy farmers spread across North India. It is used to help various government bodies formulate policies and enable the private sector to extend financial help to underserved communities.
It only contains information that would be relevant to formulate policies such as income levels, crop yield percentages, cost of production, and number of dependent members of the household. This data set enables the informed creation of farmer-first schemes and helps in the equitable distribution of public and private funds + resources.
(3) To ensure protection of farmersβ data, the stewardship committee of <Project name> undertakes the following measures: We prevent re-identification of farmers to protect them against backlash by <insert measure>, we also encrypt all the data at source using <method>. Moreover, the data is only shared with authorised government agencies and private bodies. We authenticate their access using <method 1> and <method 2>.
(4) Users of this data set can choose to be notified of further updates by signing up on <insert link>. This link is publicly available on our website as well as the downloadable data set. Users who choose to get this data through our API <insert link> will also find information automatically updated for them on a periodic basis.
This is the space to provide more context as to the fields that the data set includes, how it was collected and how it is to be interpreted. This could include:
Type of data (Raw or Processed and Structured or unstructured)
Background
Methods + context
Use cases
Interpretation guidelines
Framework for utilising this data
Additional resources
π This is a good top open source static document generator.
There are 3 other fields you must specify:
Source + Benefit of data
It is important to take into account the people who have provided this data and the people this data is intended for. Doing so helps clearly outline who the source and beneficiaries of this data set are and it ensures that the advantages derived from using it are equitably distributed.
You can outline the steps taken for data minimisation by listing them on your website or Github repository. In a world of βinformation overloadβ, it is necessary to ensure that all new data put into the public domain is absolutely vital and will advance important goals. This helps prevent a βdata dumpβ where non-relevant, potentially sensitive information is put out into the public domain without any clear benefit and with the possibility of being misused or being unusable.
Steps taken for Data protection at source
It is the responsibility of the Data Stewards to ensure that the data protects the rights of the contributors while allowing the users to fully and freely benefit from it. Some methods they can take to protect the data at source include:
This may include innovative computational techniques that remove identifiers or prevent re-identification like data masking, synthetic data generation or recombinant sequencing.
Other measures may include encryption of data at the source through one-time hashing techniques which corresponds to the principle of anonymization at source. Data is de-personalized at the source through this process, which masks personal identifiers with another unique identifier.
Username and password combinations, geo-blocking restrictions and software that limit usability of certain features like copying or pasting
While the measures employed depend on the data type that is collected, what should be broadly considered from a technical standpoint are principles of data minimization and access control. Data minimization entails advocating minimal data collection based on explicitly identified purposes, limited retention policies, and deletion policies.
Mechanisms for Updation of Data
A lot of information provided by the data set may change on a monthly / yearly / periodic basis. While the majority of the data is provided in bulk, it is necessary to have a method to update the data as and when required. This could be done through providing an additional API as well as having an optional system of communications where users of the data set can choose to be notified of updates to the set.
Digital public goods must have the possibility of extracting data from the system in a non-proprietary format. A good way to provide evidence of this is to state the mechanisms from which data can be downloaded or exported publicly.
π List of non-proprietary file formats.
π Open API Specifications
Data can be directly exported and/ or downloaded into the following open formats: CSV, XML, JSON
Digital public goods must be designed and developed to comply with applicable privacy laws. A good way to provide evidence of this is:
Provide a link to your project/organisation's privacy policy.
State any privacy laws you comply with.
π Data Protection and Privacy Legislation Worldwide.
π Privacy policy generator and example.
<Project name> complies with laws like the GDPR, CCPA, CalOPPA and U.S. Federal Childrenβs Online Privacy Protection Act of 1998. You can also access our privacy policy at www.project-website.org/privacy
Digital public goods must be designed and developed to align with relevant standards, best practices, and/or principles. A good way to provide evidence of this is to state all relevant data, technology or related best practices/ open standards
π List of resources and best practices for open data
π HINT:
For best practices regarding open source software solutions, particularly for organisations involved in in developing and maintaining software and policy together, please refer to The Standard For Public Code
<Project name> adheres to the Principles For Digital Development and Human Centred Design Principles. Evidence of this compliance can be found here <insert link>
Digital public goods must be designed to anticipate, prevent, and do no harm by design. A good way to provide evidence of this is to provide any links relevant to user terms and conditions, privacy policy, code of conduct or similar.
π Definition for personal data (PII data).
π Terms of use example.
These are reference docs for specific purposes:
Child Protection guidelines
Mobile Security Testing guidelines
Data protection impact assessment guidelines + template
You can access our privacy policy at www.project-website.org/privacy, code of conduct at www.project-website.org/code-of-conduct and terms of use at www.project-website.org/terms-of-use.β
1.Possession
Who is generating the data?
Who is responsible for producing the data?
Who is responsible for collecting the data?
Who controls the publication of the data?
Who controls the access permissions of the data?
Who controls versioning the data?
Who is consuming the data?
Who can access the data?
Who can store the data?
2.Accountability
Who decides the goal of the data?
Who has an understanding of the goal of the data?
Who can ultimately verify the data is correct and has integrity?
Who is responsible for data loss?
Who is liable if data is corrupted?
Who fixes problems with the data?
Who can make agreements about the data?
3.Execution
Who decides what data needs to be collected?
Who decides what roles and organisations can access the data?
Who can add, change, and remove data from the system?
4.Production
Who can benefit from the data that is published or generated?
Who decides what the data is worth and sets pricing?
Who benefits when the data is sold?