Legal:Data publication guidelines
This policy or procedure is maintained by the Wikimedia Foundation. Please note that in the event of any differences in meaning or interpretation between the original English version of this content and a translation, the original English version takes precedence. |
The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our human rights commitments. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our Data retention guidelines and contribute to our commitment to protect users' data as elaborated in our privacy policy.
Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in Wikimedia Foundation systems.
Data publication risk tiering grid
Data classification | Confidential | Restricted | |
---|---|---|---|
Risk level | Tier 1: High risk | Tier 2: Medium risk | Tier 3: Low risk |
Data that could certainly be used to cause harm | Data that could likely or possibly be used to cause harm | Data that is unlikely to be used to cause harm or is private for administrative reasons | |
Examples (non-exhaustive list) | * Data containing PII
|
* High-level analyses of
|
* High-level analyses of
|
Response time goal | 3 work weeks | 5 work days | N/A |
Expected % of requests (internal metric) | 15% | 35% | 50% |
What this means for Wikimedia Foundation teams | |||
Follow-up actions |
|
|
|
Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country
Frequently asked questions
- Q: What is the Risk Tiering Grid used for? The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.
- Q: What are the key risks the Tiering Grid measures? The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.
- Q: Who are the intended audiences of the Tiering Grid? Teams that work with data in product and tech.
- Q: What is changed from the existing risk review process? The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.
- Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?
- Get Privacy approval
- Anyone can initiate an update/amendment but approval must be sought across the board before implementing
- Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)
- Q: What should I do if I am unsure whether to reach out to the Legal and Security teams? When in doubt, it is better to err on the side of caution and submit a L3SC request.
Threshold table
Use this table to determine whether your analysis is granular or high-level, informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits.
Data unit type | Classification of analysis based on counts | |
---|---|---|
"Granular" | "High-level" | |
Users (including unique devices) | <25 | ≥25 |
Edits | <50 | ≥50 |
App interactions | <100 | ≥100 |
Views | <250 | ≥250 |
For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example:
- If 8 out of 49 edits were reverted:
- "16.3% reverted (out of <50 edits)"
- If 49 out of 49 edits were reverted:
- "100% reverted (out of <50 edits)"
- If 20 out of 580 edits were reverted:
- "3.4% reverted (out of ~600 edits)"
- "3.4% reverted (out of >500 edits)"
- If 50 out of 50 edits were reverted:
- OK to leave as-is (both counts meet threshold)
This guidance also applies to reporting below-threshold percentages for other data types.
Publication risk mitigation checklist
This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information.
Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you
- entered this data publication into the data publication log form?
- cleared outputs that display raw data?
- cleared outputs that display granular data (as defined in the threshold table above)?
- obfuscated rows that display granular data? For example:
Python | R |
---|---|
# imagine we are doing an analysis of the number of *users* to try a feature
# set constants
threshold = 25
col = "num_users"
# obfuscate rows
df.loc[df[col] < threshold, col] = f'<{threshold}'
|
library(tidyverse)
library(glue)
# {{tunit|69|set constants}}
threshold <- 25
df <- df |>
mutate(num_users = ifelse(num_users < threshold, glue("<{threshold}"), num_users))
|
- filtered out rows that display granular data? For example:
Python | R |
---|---|
# imagine we are doing an analysis of *app interactions* on the users did
# {{tunit|69|set constants}}
threshold = 100
col = "num_interactions"
# filter out rows below threshold
df = df[df[col] >= threshold]
|
library(tidyverse)
# {{tunit|69|set constants}}
threshold <- 100
df <- df |>
filter(num_interactions >= threshold)
|
General risk heuristics
Below, "X > Y > Z" means that X is riskier than Y, which is in turn riskier than Z.
- Data type:
- Geography:
- city > (sub-national) region > country > subcontinent > continent > global
- country protection list > non-country protection list
- Device details:
- raw User-Agent > browser or OS type > device type
- raw IP > partially-redacted IP range
- Temporal:
- dt > hourly > daily > monthly
- Combos of multiple keys > any key on its own (i.e. country + project > country or project)
- Geography:
- User activity type:
- fundraising activity > editing activity > interaction activity > reading activity
- Wikimedia Foundation activity type:
- data collection > data analysis
- granular analysis > high-level analysis
Contact us
If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacywikimedia.org.
Notes
- ↑ This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.