0% found this document useful (0 votes)
45 views8 pages

Algorithms That Don't See Color

This paper investigates the biases present in Facebook's Lookalike and Special Ad Audiences algorithms, particularly after modifications made following a lawsuit settlement aimed at reducing discriminatory effects. The authors find that simply removing demographic features does not significantly reduce bias in ad targeting, as both audience types still exhibit similar demographic skews. The study highlights the complexities of algorithmic fairness and suggests that organizations should consider more comprehensive approaches to mitigate discrimination in algorithmic systems.

Uploaded by

willbee123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views8 pages

Algorithms That Don't See Color

This paper investigates the biases present in Facebook's Lookalike and Special Ad Audiences algorithms, particularly after modifications made following a lawsuit settlement aimed at reducing discriminatory effects. The authors find that simply removing demographic features does not significantly reduce bias in ad targeting, as both audience types still exhibit similar demographic skews. The study highlights the complexities of algorithmic fairness and suggests that organizations should consider more comprehensive approaches to mitigate discrimination in algorithmic systems.

Uploaded by

willbee123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Algorithms that “Don’t See Color”: Measuring Biases in

Lookalike and Special Ad Audiences


Piotr Sapiezynski Avijit Ghosh Levi Kaplan
Northeastern University Northeastern University Northeastern University
Boston, MA, USA Boston, MA, USA Boston, MA, USA
[Link]@[Link] ghosh.a@[Link] kaplan.l@[Link]

Aaron Rieke Alan Mislove


Upturn Northeastern University
arXiv:1912.07579v3 [[Link]] 31 May 2022

Washington, DC, USA Boston, MA, USA


aaron@[Link] amislove@[Link]

ABSTRACT in large-scale machine learning (ML) systems, which can take as


Researchers and journalists have repeatedly shown that algorithms input thousands or even millions of features [6].
commonly used in domains such as credit, employment, healthcare, In this paper, we leverage a unique opportunity created by a
or criminal justice can have discriminatory effects. Some organi- recent lawsuit settlement involving Facebook’s advertising plat-
zations have tried to mitigate these effects by simply removing form to explore the limits of this approach. Specifically, we examine
sensitive features from an algorithm’s inputs. In this paper, we ex- Facebook’s Lookalike Audiences targeting tool, which takes a list
plore the limits of this approach using a unique opportunity. In 2019, of Facebook users provided by an advertiser (called the source au-
Facebook agreed to settle a lawsuit by removing certain sensitive dience) and creates a new audience of users who share “common
features from inputs of an algorithm that identifies users similar qualities” with those in the source audience. In March 2018, the
to those provided by an advertiser for ad targeting, making both National Fair Housing Alliance (NFHA) and others sued [13] Face-
the modified and unmodified versions of the algorithm available to book over violations of the Fair Housing Act (FHA). When the
advertisers. We develop methodologies to measure biases along the case was settled in March 2019, Facebook agreed to modify the
lines of gender, age, and race in the audiences created by this modi- functionality of Lookalike Audiences when used to target housing,
fied algorithm, relative to the unmodified one. Our results provide credit, and employment ads. In brief, Facebook created the Special
experimental proof that merely removing demographic features Ad Audiences tool, which works like Lookalike Audiences, except its
from a real-world algorithmic system’s inputs can fail to prevent algorithm does not consider users’ age, gender, relationship status,
biased outputs. As a result, organizations using algorithms to help religious views, school, political views, interests, or zip code when
mediate access to important life opportunities should consider other detecting common qualities.
approaches to mitigating discriminatory effects. We seek to learn whether the Special Ad Audience algorithm
(which is not provided with certain demographic features) actually
KEYWORDS produces significantly less skewed audiences than the Lookalike
Audience algorithm (which is). In other words, when provided with
online advertising, fairness, process fairness a source audience that skews heavily toward one demographic
group over another, to what extent do each of these tools reproduce
1 INTRODUCTION that skew? We focus on skews along demographic features named
in the settlement, enabling us to examine whether simply removing
Organizations use algorithmic models1 (“algorithms”) in a variety
the protected features as input to an algorithm is sufficient eliminate
of important domains, including healthcare [27], credit [19], em-
skew along those features. To do so, we develop a methodology to
ployment [9, 23], and content distribution [3]. Unfortunately, these
examine the delivery of the same ads when using the two types of
algorithms have been shown to sometimes have discriminatory ef-
audiences, measuring the skew along the lines of gender, age, and
fects that can often be challenging to detect, measure, and articulate.
race.
Some have proposed mitigating discriminatory effects by remov-
We show that our Special Ad audiences2 are skewed to almost
ing demographic features from an algorithm’s inputs. For example,
the same degree as Lookalike audiences, with many of the results
in 2019 the U.S. Department of Housing and Urban Development
being statistically indistinguishable. For example, when using a
(HUD) proposed a rule that considered applying this approach to
source audience that is all women, our Lookalike audience-targeted
housing discrimination [12]. Because algorithms can effectively use
ad delivered to 96.1% women, while Special Ad audience-targeted
omitted demographic features by combining other inputs that are
ad delivered to 91.2% women. We also provide evidence indicating
each correlated with those features [5], such a rule could nullify
that both Lookalike and Special Ad audiences carry—to a certain
any protection from discriminatory effects. This is particularly true

1 Throughout this paper, we refer to a large class of algorithmic models using the 2 Throughout the paper, we use “Lookalike Audience” or “Special Ad Audience” to
now-common term “algorithms”, especially those created through statistical modeling refer to the general tools provided by Facebook, and “Lookalike audience” or “Special
and machine learning. Ad audience” to refer to a particular audience.
1
Piotr Sapiezynski, Avijit Ghosh, Levi Kaplan, Aaron Rieke, and Alan Mislove

extent—the biases of the source audience in terms of race and 2.2 Special Ad Audiences
political affiliation. In March 2018, the NFHA and others sued Facebook for allowing
To underscore the real-world impact of these results, we place ads landlords and real estate brokers to exclude members of protected
as an employer who is seeking to find candidates “similar to” to their groups from receiving housing ads [13]. The lawsuit was settled in
current workforce. Using a source audience consisting of Facebook March 2019, and Facebook agreed to make a number of changes to
employees we find that the resulting Special Ad audience skews its ad targeting tools. Facebook now refers to this modified Looka-
heavily towards 25–34-year-old men. We also confirm that previous like Audiences tool as Special Ad Audiences.
findings on how Facebook’s delivery mechanisms can cause further From an advertiser’s perspective, Special Ad Audiences are cre-
skews in who is shown ads hold for Special Ad Audiences. ated in the same manner as Lookalike Audiences (i.e., based on
Taken together, our results show that simply removing demo- a source Custom audience). The minimum size for both types of
graphic features from the inputs of a large-scale, real-world algo- these algorithmically generated audiences is 1% of the population
rithm will not always suffice to meaningfully change its outputs. At of the target location, regardless of the size of the source audience.
the same time, this work presents a methodology by which other In case of the US that means that the algorithm outputs audiences
algorithms could be studied. of 2.3 million users.
To be clear, we are not claiming—and do not believe—that Face-
book has incorrectly implemented Special Ad Audiences, or is in
violation of its settlement agreement. Rather, the findings in this 2.3 Related work
paper are a natural result of how complex algorithmic systems work Greenberg distinguishes two kinds of fairness concerns, distributive
in practice. and procedural [22]. The former aims to assure balanced outcomes,
whereas the latter focuses on the process itself. Elimination of sensi-
Ethics The research has been reviewed by our Institutional tive features, for example sex or race, from an algorithm’s input (as
Review Board and marked as exempt. Further, we minimized harm with Special Ad Audiences) falls into the procedural category. Such
to Facebook users by only running “real” ads, i.e., if a user clicked on approach in the legal context is also referred to as anti-classification
one of our ads, they were presented with a real-world site relevant and it is encoded in the current standards [11]. However, scholars
to content the ad. We did not have any direct interaction with the and researchers have for decades critiqued this so-called “colorblind”
users who were shown our ad, and did not collect any of their approach to addressing historical inequality and discrimination [7].
information. Finally, we minimized harm to Facebook by running Legal scholar Destiny Perry argues that “(1) colorblindness is, un-
and paying for our ads just like any other advertiser, as well as der most circumstances, undesirable given its recently discovered
flagging them as employment ads whenever applicable. negative outcomes, particularly for the very groups or individuals
it is meant to protect; (2) true colorblindness is unrealistic given the
2 BACKGROUND psychological salience of race; and (3) race consciousness in the law
In this section, we provide background on Facebook’s ad targeting is necessary to ensure equal treatment of racial groups in regulated
tools and overview related work. domains such as housing, education, and employment [30].” In the
context of sentencing and mass incarceration Traci Schlesinger
2.1 Facebook’s ad targeting tools concludes that “in the post-civil rights era, racial disparities are
Facebook provides a range of targeting tools to help advertisers primarily produced and maintained by colorblind policies and prac-
select an audience of users who will be eligible to see their ads. tices [32].” Similar arguments have been made in the context of
For example, advertisers can select users through combinations of housing discrimination and a range of other domains [4].
targeting attributes, including over 1,000 demographic, behavioral, Previous work in statistics and machine learning indicated that,
and interest-based features. in general, removing sensitive features does not reliably achieve
More germane to this paper and its methods, Facebook also offers fairness for a number of reasons. First, certain features might serve
a number of other, more advanced targeting tools. One such tool as close proxies for the sensitive information. For example, due
is Custom Audiences, which allows advertisers indicate individual to housing segregation a person’s zip-code can be predictive of
users that they wish to include in an audience. To use Custom their race. Second, the removed information might be redundantly
Audiences, an advertiser uploads a list of personally identifiable encoded by non-sensitive features or their combinations. It will then
information (PII), potentially including names, email addresses, be reconstructed by the model if it is pertinent to the prediction
phone numbers, dates of birth, and mobile identifiers. Facebook then task [10, 14, 36]. One such example is the fiasco of Amazon’s hir-
compares those identifiers against its database of active users, and ing algorithm [21]. Third, there are cases in which only certain
lets the advertiser include matched users in their target audience. intersections of values of otherwise non-sensitive features are to be
Another tool is Lookalike Audiences, which creates an audience protected [29]. Finally, even if none of the features or their combi-
of users who share “common qualities” with users in a Custom nations are unfair, their predictive performance might differ across
audience provided by the advertiser (called the source audience). sub-populations. In an effort to minimize the total error, the clas-
The exact input qualities used by the algorithm in creating these au- sifier will fit the majority group better than the minority [8, 31].
diences are not known and the documentation lists only two exam- Taken together, these prior works paint a clear picture of process
ples: demographic information and interests. Prior work has demon- fairness, or fairness through unawareness, as insufficient to ensure
strated that Lookalike Audiences can reproduce demographic skews fair outcomes. Unfortunately, despite this consensus among schol-
present in source audiences [34]. ars and a few high-profile failures in practice, the 2019 settlement is
2
Algorithms that “Don’t See Color”: Measuring Biases in Lookalike and Special Ad Audiences

still based on fairness through unawareness. In this article we inves- Thus, for each demographic feature we wish to study, we first
tigate whether this particular implementation is closer to achieving create a Custom audience based on the voter records (which we
the goal of fairness. treat as ground truth). For example, when studying gender, we select
Regardless of the particular approach to ML fairness, focusing a subset of the voters who are listed as female and use that list to
on particular algorithms can be too narrow of a problem definition. create a Custom audience. We use each biased Custom audience
Real-world algorithmic systems are often composed of multiple to create both a Lookalike audience and a Special Ad audience,
subsystems and can be discriminatory as a whole, even if built selecting users in the U.S. and choosing the smallest size option (1%
from a series of fair algorithms [15]. They need to be modeled of the population).
along with the other components of the socio-technical systems
they are embedded in [33]. The burden of these investigations lies Data collection Once the ads are running we use Facebook’s
on independent researchers and auditors since the companies who Ad Manager tool to collect information about demographics of the
operate these algorithms might not be incentivized to measure and audiences that Facebook shows our ads to, broken down by age
address the externalities they cause [28]. group, gender, and the intersections of these two characteristics.
Calculating and comparing gender skew The Ad Manager
3 METHODOLOGY tool reports gender of each user as either female, male, or unknown.
In this work we attempt to measure the audience skews in terms of The unknown gender might refer to users who choose to self-
gender, age, race, and political views. Facebook Ad Manager reports report their gender as falling outside of the binary, or those who
the gender and age distribution of the audiences that received did not provide their gender. We note that in all experiments there
each ad, but it does not report the information about the race or is no more than 1% of such users, and report the observed gender
political views of these audiences. We therefore apply two different bias as the fraction of men 𝑝ˆ in the reached audience. We also
approaches to creating the audiences and measuring the effects. calculate the upper and lower 99% confidence intervals (𝑈 .𝐿 and
𝐿.𝐿, respectively) around this fraction 𝑝ˆ using the method presented
3.1 Timing by Agresti and Coull [2]:
√︂
The 2019 settlement [16] stipulated that the updated ad creation 𝑧2
𝛼 /2 𝑝ˆ (1−𝑝)
ˆ 𝑧2
𝛼 /2
flow for special categories be implemented by September 30, 2019. 𝑝ˆ + 2𝑛 − 𝑧𝛼/2 𝑛 + 4𝑛 2

All of our ads were created and run between October 20, 2019 𝐿.𝐿. = 2 /𝑛
,
1 + 𝑧𝛼/2
and December 15, 2019, leaving Facebook ample time after the √︂ (1)
implementation deadline. 𝑧𝛼2 /2 𝑝ˆ (1−𝑝)
ˆ 𝑧𝛼2 /2
𝑝ˆ + 2𝑛 + 𝑧𝛼/2 𝑛 + 4𝑛2
𝑈 .𝐿. = 2 /𝑛
,
3.2 Measuring skews by gender and age 1 + 𝑧𝛼/2
To measure the makeup of a target audience by gender and age, we We set 𝑧𝛼/2 = 2.576, corresponding to the 99% interval.
create and run actual ads and then we use the Facebook Ad Manager Finally, we verify whether the difference between fractions ob-
API to record how they are delivered. For these experiments, we served for Lookalike and Special Audiences is statistically signifi-
need to provide an ad creative (consisting of the ad text, headline, cant using the difference of proportion test:
image, and destination URL). Since the ad content influences the √︄
delivery [3], we chose to use the same creative for all ads, unless 𝑝ˆ𝐿 (1 − 𝑝ˆ𝐿 ) 𝑝ˆ𝑆 (1 − 𝑝ˆ𝑆 )
otherwise noted: a generic ad for Google Web Search, which has Δ𝑝𝐿𝑆 = (𝑝ˆ𝐿 − 𝑝ˆ𝑆 ) ± 𝑧𝛼/2 + , (2)
𝑛𝐿 𝑛𝑆
basic text (“Search the web for information”) and a link to Google.
We found that Facebook does not verify that an ad that is self- where 𝑝ˆ𝐿 and 𝑝ˆ𝑆 are the fractions of men who saw the ad in the
reported by an advertiser as a housing, credit, or employment ad is, Lookalike and Special audiences, 𝑛𝐿 and 𝑛𝑆 are number of people
in fact, such an ad. On the other hand, Facebook does automatically reached in each of these audiences. Because we are testing the sig-
classify housing, credit, or employment ads as such even if the nificance in seven experiments (one for each input proportion), we
advertisers chooses not to disclose that information. Thus, the only apply the Bonferroni correction for multiple hypotheses testing.
way for us to run the same ad creative using both Lookalike and We do so by setting 𝑧𝛼/2 to 3.189, corresponding to Bonferroni cor-
Special Ad audiences was to run a neutral ad that would not trigger rected 𝑝 𝑣𝑎𝑙 = 0.01/7 ≈ 0.00143. If the confidence interval includes
the automatic classification. 0, we cannot reject the hypothesis that the fraction of men is the
same in the two audiences and thus the result is not statistically
Creating audiences Recall that our goal is to measure whether significant.
Special Ad Audiences produce significantly less biased audiences
than Lookalike Audiences. We therefore need to generate source Calculating and comparing the age skew Age of the users
audiences with controlled and known bias, from which we can who were shown each ad is reported in groups: <18, 18-24, 35-44,
create a Lookalike and a Special Ad audience. We replicate the 45-54, 55-64, and 65+. We calculate the mean age and the confidence
approach from prior work [3], relying on publicly available voter intervals around it using formulas specific to grouped data. First,
records from New York and North Carolina. These records include we compute the mid-point 𝑀𝑖 for each age range 𝑖,
registered voters’ gender, age, location (address), and (only in North 𝑥𝑚𝑖𝑛𝑖 + 𝑥𝑚𝑎𝑥𝑖
Carolina) race. 𝑀𝑖 = (3)
2
3
Piotr Sapiezynski, Avijit Ghosh, Levi Kaplan, Aaron Rieke, and Alan Mislove

Next, we find the mean age 𝜇 Audience Creation We start with the publicly available voter
Í
(𝑀𝑖 × 𝐹𝑖 ) records from North Carolina, in which the voters self-report their
𝜇= 𝑖 Í , (4) race and ethnicity. We focus on two groups: Non-Hispanic Black and
𝑖 𝐹𝑖
Non-Hispanic white. For each group, we create two independent
where 𝐹𝑖 is the number of audience members in the age group 𝑖. Custom audiences: one list of 10,000 randomly selected users with
We then compute the standard deviation around that mean that race, and one list of 900,000 randomly selected users with that
√︄ race. The latter audience does not contain any individuals already
Í 2 2
𝑖 (𝐹𝑖 × 𝑀𝑖 ) − (𝑛 × 𝜇 ) selected for the first list, and will be refered to as the reference
𝜎= (5)
𝑛−1 audience.
and the corresponding standard error We refer to these as w_10k and w_900k (white audiences) and
𝜎 b_10k and b_900k (Black audiences). We then have Facebook al-
𝑆𝐸 = √ (6) gorithmically generate Lookalike and Special Ad audiences using
𝑛
the smaller Custom audiences as input. We refer to the resulting
Presented upper and lower confidence intervals correspond to audiences as 𝐿w_10k (for the Lookalike audience based on w_10k),
𝑈 .𝐿. = 𝜇 + 𝑘 × 𝑆𝐸, 𝑆 w_10k (for the Special Ad audience), 𝐿b_10k , and 𝑆 b_10k .
(7)
𝐿.𝐿. = 𝜇 − 𝑘 × 𝑆𝐸 Targeting The goal of this step is to find the overlaps between
the audiences with unknown race generated by the algorithms and
respectively, and 𝑘 is set to 2.576.
the reference Custom audiences that we provided (with known race).
Finally, we verify whether the difference in mean ages between
Then we can say there is a race bias in the white Lookalike audience
the Lookalike and Special audiences is statistically significant. To
𝐿w_10k if the overlap between it and a white reference audience
achieve that, we compute the standard error of the difference
√︄ w_900k is higher than the overlap between it and a Black reference
𝜎𝐿2 𝜎𝑆2 audience b_900k (and vice versa for an audience generated from a
𝑆𝐸𝐿𝑆 = + (8) Black source audience). We also perform these overlap comparisons
𝑛 𝐿 𝑛𝑆
for Special Ad audiences to measure whether this effect persists
and the 99% confidence interval around the difference between despite removing sensitive features from the algorithm.
mean ages: Our method relies on the fact that Facebook allows advertisers
√︄ not only to specify which audiences to include in the targeting, but
𝜎𝐿2 𝜎𝑆2 also which to exclude. Suppose we wish to obtain an estimate of
Δ𝜇𝐿𝑆 = 𝜇𝐿 − 𝜇𝑠 ± 𝑧𝛼/2 × + (9)
𝑛 𝐿 𝑛𝑆 the fraction of white users in 𝐿w_10k . To do so, we first target the
We apply the Bonferroni correction for six tests and use the 𝑧𝛼/2 reference white audience w_900k audience and record the potential
set to 3.143. If the confidence interval includes 0, we cannot reject daily reach (e.g., 81,000). We then target 𝐿w_10k and record the
the hypothesis that the mean age is the same in the two audiences potential daily reach (e.g., 397,000). Finally, we target 𝐿w_10k and
and thus the difference is not statistically significant. exclude the w_900k audience, and record the potential daily reach
(e.g., 360,000). Now, we can observe that excluding w_900k from
3.3 Measuring racial skews 𝐿w_10k caused the potential daily reach to drop by 37,000, indicating
that approximately 46% (37,000/81,000) of w_900k were present in
When measuring racial skew in the audiences we are unable to
𝐿w_10k . We can then repeat the process with excluding b_900k, and
re-use the same methodology for age and gender, which relied on
measure the fraction of the reference Black audience that is present
Facebook’s ad delivery statistics. Instead, we develop an alternative
in 𝐿w_10k . By comparing the fraction of w_900k and b_900k that
methodology that relies on estimated daily results– Facebook’s es-
are present in 𝐿w_10k , we obtain an estimate of the racial bias of
timate of the number of users matching the advertiser’s targeting
𝐿w_10k .
criteria that can be reached daily within the specified budget. We
set the daily budget to the maximum allowed value ($1M) to best Measuring political skews To measure political skews we fol-
approximate the total number of users that match the targeting low the exact same method as with measuring racial skews, but
criteria. Facebook returns these values as a range (e.g., “12,100 – rather than constructing the audiences based on their reported
20,400 users”); throughout this procedure, we always use the lower race, we use their registered political affiliation as Democratic or
value.3 The procedure has only two steps: audience creation and tar- Republican voters.
geting. It does not involve running any ads and observing the skew
in delivery, and it is entirely based on the estimates on audience Limitations Unlike in our experiments with gender and age,
sizes provided by Facebook at the ad targeting step. here we do not know the race of a vast majority of the audience.
We note that ours is not the first use of these estimates to infer The Lookalike and Special Ad audiences that Facebook creates
the number of users that match different criteria. For example, consist mostly of people who appear not to be in our voter records.
Garcia et al. used them to estimate the gender inequality across the There are multiple reasons for why this might be the case: (1)
globe [20], while Fatehkia et al. found they are highly predictive of we only looked and single race, non-Hispanic white and Black
a range of other social indicators [18]. voters, excluding all Hispanic voters, as well as those of other races,
and multi-racial; (2) the users in the created audiences and could
3 We used the midpoint and the upper value and found similar results. be located in other states - while creating lookalike and special
4
Algorithms that “Don’t See Color”: Measuring Biases in Lookalike and Special Ad Audiences

A. Gender skew in Lookalike


and Special Audiences
B. Age skew in Lookalike
and Special Audiences Lookalike w_10k Lookalike b_10k
1.0 Special w_10k Special b_10k
60
in reached audience

0.8

of reached users
Fraction of men

0.4

Average age
50
0.6

Frequency
40 0.3
0.4
30 0.2
0.2 Lookalike Lookalike

0.0
Special
20
Special 0.1
mean age
difference

difference 0.0
fraction

0.05 5
0.00
−0.05 0 0.25 0.50 0.75 1 0 0.25 0.50 0.75 1
0.0 0.2 0.4 0.6 0.8 1.0
0
Overlap with w_700k Overlap with b_700k
-24 -34 5-44 5-54 5-64 65+
Fraction of men in source audience 18 25 3 4 5
Age of source audience
Figure 2: Both Lookalike and Special Ad audiences created
from source audiences of white users containing a higher
Figure 1: A. Gender breakdown of ad delivery to Lookalike
fraction of white users than Black users. Conversely, audi-
and Special Ad audiences created from the source audiences
ences created from source audiences of Black users contain
with varying fraction of male users. The Special Ad audi-
a higher fraction of Black users than white users.
ences replicate the skew to a large extent. B. Age breakdown
of ad delivery to Lookalike and Special Ad audiences cre-
ated from source audiences with varying age brackets. Both
Lookalike and Special Ad audiences follow the age distribu-
Third, and most importantly, when we compare the delivery of
tion of the source audiences, but the latter shows a decrease
each Special Ad audience to its corresponding Lookalike audience,
of mean age by up to six years in the 65+ group.
we observe that a similar level of skew (that in some cases is sta-
tistically indistinguishable). For example, the Special Ad audience
audiences the advertiser can only select the country where those derived from a male-only source audiences delivers to over 95%
audiences would be located. Thus, the results we present in this men, despite being created without having access to users’ genders.
section only refer to the fraction of voters with known race who As emphasized the lower panel of Figure 1, the Special Ad audiences
are included in each Lookalike and Special Ad audience, not the do show a bit less skew when compared to the Lookalike audiences
racial composition of these audiences overall. Still, these estimates for some of the input audiences, while still carrying over most of
do give us a small window into the makeup of the Lookalike and the skew from the source audience.
Special Ad audiences. We follow an analogous procedure to create six Custom Audi-
ences, each consisting of individuals only in a specified age range.
4 RESULTS We then create Custom and Special Ad audiences and measure
We now present our experiments and analyze whether Lookalike whether the age skews are reproduced and present the results in
and Special Ad Audiences show similar levels of skew. Figure 1B.

4.1 Gender and age 4.2 Race


We begin by focusing on gender, creating seven Custom audiences Next, we turn to examine the extent to which Special Ad Audiences
based on New York voter records. Each audience contains 10,000 can be biased along racial lines, in the same manner Lookalike
voters, with varying fractions of men: 0%, 20%, 40%, 50%, 60%, 80%, Audiences were observed to be in past work [34]. We summarize
100%. We run ads to the resulting Lookalike and Special Ad au- the overlap between the Lookalike and Special Ad audiences and the
diences, and compare the results in ad delivery as reported by large white and Black audiences in Table 1. Focusing on the table, we
Facebook’s advertiser interface. can immediately observe that both the Lookalike audiences show
Figure 1A presents a summary of the results of this experiment, significantly more overlap with the race of the source audience,
and we make a number of observations. First, we can see that each suggesting that the makeup of the Lookalike audiences are racially
Lookalike audience clearly mirrors its source audience along gen- biased. For example, the Lookalike audience created from b_10k
der lines: the Lookalike audience derived from a male-only source contains 61% of the active users from b_900k but only 16% of the
audience delivers to over 99% men, and the the Lookalike audience active users from w_900k (see Methods for the explanation of the
derived from a female-only source audience delivers to over 97% audience names). More importantly, the Special Ad audiences show
women. Second, we observe a slight male bias in our delivery, rela- a similar behavior (though as before, perhaps with slightly less of
tive to the source audience: for example, the Lookalike audience a bias). Again, it is important to keep in mind that we can only
derived from a source audience of 50% men actually delivered to make estimates of the fraction of w_900k and b_900k that overlap
approximately 70% men. This male bias has been observed by prior with the Lookalike and Special Ad audiences, and cannot comment
work [3, 26] and may be due to market effects or ad delivery effects on the majority of these audiences (as they likely fall outside of
(which affect both Lookalike and Special Ad audiences equally). North Carolina). Thus, our results are not conclusive—but only
5
Piotr Sapiezynski, Avijit Ghosh, Levi Kaplan, Aaron Rieke, and Alan Mislove

Percent overlap Percent overlap


Black White Democrat Republican
Source Type (b_900k) (w_900k) Source Type (d_900k) (r_900k)
Lookalike (𝐿b_10k ) 61.0 16.0 Lookalike 𝐿d_10k 51.6 31.8
100% Black Democrats
Special (𝑆 b_10k ) 62.3 12.3 Special 𝑆 d_10k 42.2 25.8
Lookalike (𝐿w_10k ) 16.9 42.0 Lookalike 𝐿r_10k 28.1 50.0
100% white Republicans
Special (𝑆 w_10k ) 10.4 35.8 Special 𝑆 r_10k 25.0 47.0
Table 1: Breakdown of overlap between audiences with Table 2: Breakdown of overlap between source audiences
known racial makeup and Lookalike and Special Ad audi- with known political leaning and resulting Lookalike and
ences. While we do not know the race of the vast majority Special Ad audiences. While we do not know the political
of the created audiences, we see large discrepancies in the leaning of the vast majority of the audiences, we see discrep-
race distribution among the known users. ancies in the distribution among the known users.

suggestive—that the overall audiences are similarly biased. Below,


we provide further robustness analysis of these results.

4.3 Robustness 4.5 Real-world use cases


Here, we verify that the presented results regarding race biases are Next, we test a “real-world” use case of Special Ad Audiences. We
robust to the random selection of seed from which Lookalike and imagine an employer wants to use Facebook to advertise open posi-
Special Ad audiences are created. Following the method described tions to people who are similar to those already working for them.
in Methodology, we use the two sample Kolmogorov–Smirnov test The employer might assume that since the Special Ad Audiences
to compare the distributions of overlaps presented in Figure 2. The algorithm is not provided with protected features as inputs, it will
findings are confirmed to be robust to the particular source audi- allow them to reach users who are similar to their current employ-
ence choice. First, the racial skew observable in Lookalike audiences ees but without gender, age, or racial biases. The employer would
persists in Special Ad audiences and is statistically significant at therefore upload a list of their current employees to create a Custom
𝑝 𝑣𝑎𝑙 = 0.01 even with the Bonferroni correction for multiple hy- audience, ask Facebook to create a Special Ad audience from that,
potheses testing, Second, the differences between overlaps produced and then target job ads to the resulting Special Ad audience.
by Special Ad audiences and Lookalike audiences generated from We play the role of this hypothetical employer (Facebook it-
the w_10k custom audience are not statistically significant – Special self in this example, which provides employees with an @[Link]
Ad audiences generated from the w_10k are just as biased as the email address). We then run the following experiment: We first
corresponding Lookalike audiences. Third the differences between create a baseline audience by using randomly generated U.S. phone
overlaps produced by Special Ad audiences and Lookalike audiences numbers, 11,000 of which Facebook matched to existing users. We
generated from the b_10k custom audience are small statistically then create a Custom audience consisting of 12M generated email
significant and this difference comes from Special Ad audiences addresses: all 2–5 letter combinations + @[Link], 11,000 of which
being even more biased than Lookalike audiences. Facebook matched to existing users; this is our audience of Face-
book employees. We create Special Ad audiences based on each of
4.4 Political views these two Custom audiences. Finally, we run two generic job ads
We next turn to measure the extent to which Lookalike and Special —each to one of these Special Ad audiences, at the same time, from
Ad Audiences can be biased along the lines of political views. As the same account, with the same budget—and observe how they
with race, Facebook does not provide a breakdown of ad delivery are delivered.
by users’ political views. Thus, we repeat the methodology we used Figure 3 presents the results of the experiment. The Special Ad
for race, using voter records from North Carolina and focusing on audience based on Facebook employees delivers to 88% men, com-
the differences in delivery to users registered as Republicans and pared to 54% in the baseline case. Further, the Special Ad audience
Democrats. based on Facebook employees delivers to 48% to men aged be-
We report the results in Table 2. We can observe a skew along tween 25-34, compared to 15% for the baseline audience. Note that
political views for Lookalike audiences (for example, the Lookalike Facebook themselves report that the actual skew among company
audience created from users registered as Democrats contains 51% employees is lower, with 63% of male employees [1]. Overall, our
of d_900k but only 32% of r_900k). We can also observe that the results show that our hypothetical employer’s reliance on Special
Special Ad audiences show a skew as well, though to a somewhat Ad audiences to avoid discrimination along protected classes was
lesser degree than the Lookalike audiences. As with the race exper- misplaced: their ad was ultimately delivered to an audience that
iments, we remind the reader that we can only observe the overlap was significantly biased along age and gender lines (and presum-
between the created audiences and the large Democrat/Republican ably reflective of Facebook’s employee population). Based on this
audiences; we are unable to measure the majority of the created singular experiment we cannot claim that the extent of the problem
audiences. However, the demonstrated skew suggests that there is would be similar for other employers. Still, we do recommend that
a bias in the overall makeup of the created audiences. potential advertisers use the tool cautiously.
6
Algorithms that “Don’t See Color”: Measuring Biases in Lookalike and Special Ad Audiences

women men women men


18-24 18-24
25-34 25-34
35-44 35-44 54%

overall
45-54 45-54 66%

overall
88%
54% Generic 28%
55-64 55-64
source: facebook 0.0 0.5 1.0 AI 0.0 0.5 1.0
source: random 65+ Fraction of men Supermarket 65+ Fraction of men
0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5
Fraction of the audience Fraction of the audience Fraction of the audience Fraction of the audience

Figure 3: Gender and age breakdown of a generic job ad de- Figure 4: Gender and age breakdown of delivery of job ads
livery to a Special Ad audience based on random American to a Special Ad audience based on random American users.
users (in orange) and a Special Ad audience based on Face- Facebook’s delivery optimization based on the ad content
book employees (in blue). The audience based on Facebook can lead to large skews despite the gender and age-balanced
employees is predominantly male and 25-34. target audience.

4.6 Content-based skew in delivery


of third-party content. This immunity was a central issue in the
Previous work [3, 24] demonstrated that the skew in delivery can litigation resulting in the settlement analyzed above. Although
be driven by Facebook’s estimated relevance of a particular ad copy Facebook argued in court that advertisers are “wholly responsible
to a particular group of people. Specifically, even when the target for deciding where, how, and when to publish their ads” [17], this
audience were held constant, Facebook would deliver our ads to paper makes clear that Facebook can play a significant, opaque
different subpopulations: ads for supermarket jobs were shown role by creating biased Lookalike and Special Ad audiences. If a
primarily to women, while ads for jobs in lumber industry were court found that the operation of these tools constituted a “material
presented mostly to men. Here, we show that these effects persist contribution” to illegal conduct, Facebook’s ad platform could lose
also when using Special Ad Audiences. We run generic job ad to its immunity [35].
a Special Ad Audience created from a random set of 11,000 users
along with ads for supermarket and artificial intelligence pointing
to search for either keyword on [Link]. Figure 4 shows that 6 DISCUSSION
the different ads skew towards middle-aged women (in the case of We demonstrated that both Lookalike and Special Ad Audiences
supermarket jobs) or towards younger men (in the case of artificial can create similarly biased target audiences from the same source
intelligence jobs). audiences. We are not claiming that Facebook incorrectly imple-
The results underline a crucial point: when designing fairness/anti- mented Special Ad Audiences, nor are we suggesting they violated
discrimination controls, one cannot just focus on one part of the al- the settlement. Rather, our findings are a consequence of a complex
gorithmic system. Instead one must look at the whole socio-technical algorithmic system at work.
system, including how an algorithm is used by real people, how Our findings have broad and narrow implications. Broadly, we
people adjust their behaviors in response to the algorithm, and how demonstrate that simply removing demographic features from a
the algorithm adapts to people’s behaviors. complex algorithmic system can be insufficient to remove bias
from its outputs, which is an important lesson for government and
5 LEGAL IMPLICATIONS corporate policymakers. More specifically, we show that relative
to Lookalike Audiences, Facebook’s Special Ad Audiences do little
At a high level, U.S. federal law prohibits discrimination in the
to reduce demographic biases in target audiences. As a result, we
marketing of housing, employment and credit opportunities. Our
believe Special Ad Audiences will do little to mitigate discriminatory
findings might have near-term legal consequences for advertisers
outcomes.
and even Facebook itself.
Absent any readily available algorithm-centered solutions to the
A creditor, employer, or housing provider who used biased Spe-
presented problem, removing the Lookalike/Special Ad audience
cial Ad audiences in their marketing could run afoul of the US
functionality as well as disabling ad delivery optimization in the
anti-discrimination laws. This could be exceptionally frustrating
sensitive contexts of housing, employment, and credit ads might
for an advertiser who believed that Special Ad Audiences was an
be the appropriate interim approach.
appropriate, legally-compliant way to target their ads.
Facebook itself could also face legal scrutiny. In the U.S., Sec-
tion 230 of the Communications Act of 1934 (as amended by the ACKNOWLEDGEMENTS
Communications Decency Act, specifically 47 USC § 230 Protection The authors thank Ava Kofman and Ariana Tobin for suggesting
for private blocking and screening of offensive material) provides the experiments presented in Section 4.5 as well as for going an
broad legal immunity to Internet platforms acting as publishers extra mile (or two) for their ProPublica story around this work [25].
7
Piotr Sapiezynski, Avijit Ghosh, Levi Kaplan, Aaron Rieke, and Alan Mislove

We also thank NaLette Brodnax for her feedback on the experi- using social media advertising data. EPJ Data Science, 9(1):22, 2020.
mental design and Aleksandra Korolova for her comments on the [19] Marion Fourcade and Kieran Healy. Classification Situations: Life-chances In
The Neoliberal Era. Accounting, Organizations and Society, 38(8):559–572, 2013.
manuscript. This work was funded in part by a grant from the Data [20] David Garcia, Yonas Mitike Kassa, Angel Cuevas, Manuel Cebrian, Esteban Moro,
Transparency Lab, NSF grants CNS-1916020 and CNS-1616234, and Iyad Rahwan, and Ruben Cuevas. Analyzing gender inequality through large-
scale facebook advertising data. Proceedings of the National Academy of Sciences,
Mozilla Research Grant 2019H1. 115(27):6958–6963, 2018.
[21] Rachel Goodman. Why Amazon’s Automated Hiring Tool Discriminated Against
REFERENCES Women, 2018. [Link]
workplace/why-amazons-automated-hiring-tool-discriminated-against.
[1] Advancing Opportunity For All. [Link]
[22] Jerald Greenberg. A Taxonomy Of Organizational Justice Theories. Academy of
[2] Alan Agresti and Brent A Coull. Approximate Is Better Than “exact” For Interval
Management review, 12(1):9–22, 1987.
Estimation Of Binomial Proportions. The American Statistician, 52(2):119–126,
[23] Aniko Hannak, Claudia Wagner, David Garcia, Alan Mislove, Markus Strohmaier,
1998.
and Christo Wilson. Bias In Online Freelance Marketplaces: Evidence From
[3] Muhammad Ali, Piotr Sapiezynski, Miranda Bogen, Aleksandra Korolova, Alan
Taskrabbit And Fiverr. In ACM Conference on Computer Supported Cooperative
Mislove, and Aaron Rieke. Discrimination Through Optimization: How Face-
Work, Portland, Oregon, USA, February 2017.
book’s Ad Delivery Can Lead To Biased Outcomes. In ACM Conference on
[24] Basileal Imana, Aleksandra Korolova, and John Heidemann. Auditing For Discrim-
Computer Supported Cooperative Work, Austin, Texas, USA, November 2019.
ination In Algorithms Delivering Job Ads. In Proceedings of the Web Conference
[4] Michelle Wilde Anderson. Colorblind segregation: Equal protection as a bar to
2021, pages 3767–3778, 2021.
neighborhood integration. Calif. L. Rev., 92:841, 2004. [25] Ava Kofman and Ariana Tobin. Facebook Ads Can Still Discrimi-
[5] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness And Machine nate Against Women and Older Workers, Despite a Civil Rights Settle-
Learning. [Link], 2019. [Link] ment. [Link]
[6] Joseph Blass. Algorithmic Advertising Discrimination. Northwestern University against-women-and-older%2Dworkers-despite-a-civil-rights-settlement.
Law Review, 114(2):415–468, 2019. [26] Anja Lambrecht and Catherine Tucker. Algorithmic bias? an empirical study of
[7] Eduardo Bonilla-Silva. Racism without racists: Color-blind racism and the persis- apparent gender-based discrimination in the display of stem career ads. Manage-
tence of racial inequality in the United States. Rowman & Littlefield Publishers, ment science, 65(7):2966–2981, 2019.
2006. [27] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dis-
[8] Irene Chen, Fredrik D Johansson, and David Sontag. Why Is My Classifier secting Racial Bias In An Algorithm Used To Manage The Health Of Populations.
Discriminatory? In Advances in Neural Information Processing Systems, pages Science, 366(6464):447–453, 2019.
3539–3550, 2018. [28] Rebekah Overdorf, Bogdan Kulynych, Ero Balsa, Carmela Troncoso, and Seda
[9] Le Chen, Aniko Hannak, Ruijin Ma, and Christo Wilson. Investigating The Gürses. Questioning The Assumptions Behind Fairness Solutions. arXiv preprint
Impact Of Gender On Rank In Resume Search Engines. In Annual Conference arXiv:1811.11293, 2018.
of the ACM Special Interest Group on Computer Human Interaction, Montreal, [29] Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware
Canada, April 2018. Data Mining. In ACM SIGKDD International Conference of Knowledge Discovery
[10] Consumer Financial Protection Bureau. Using Publicly Available Information To and Data Mining, Las Vegas, North Dakota, USA, August 2008.
Proxy For Unidentified Race And Ethnicity, 2014. [Link] [30] Destiny Peery. The colorblind ideal in a race-conscious reality: The case for a
gov/f/201409_cfpb_report_proxy-[Link]. new legal ideal for race relations. Nw. JL & Soc. Pol’y, 6:473, 2011.
[11] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: [31] Piotr Sapiezyński, Valentin Kassarnig, Christo Wilson, Sune Lehmann, and Alan
A critical review of fair machine learning. CoRR, abs/1808.00023, 2018. Mislove. Academic Performance Prediction In A Gender-imbalanced Environ-
[12] Department of Housing and Urban Development. Hud’s Implemen- ment. In Workshop on Responsible Recommendation, Como, Italy, August 2017.
tation Of The Fair Housing Act’s Disparate Impact Standard, 2019. [32] Traci Schlesinger. The failure of race neutral policies: How mandatory terms
[Link] and sentencing enhancements contribute to mass racialized incarceration. Crime
implementation-of-the-fair-housing-acts-disparate-impact-standard. & delinquency, 57(1):56–81, 2011.
[13] Emily Dreyfuss. Facebook Changes Its Ad Tech To Stop Discrimina- [33] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian,
tion. WIRED, 2019. [Link] and Janet Vertesi. Fairness And Abstraction In Sociotechnical Systems. In
discrimination-settlement/. Conference on Fairness, Accountability, and Transparency, Atlanta, Georgia, USA,
[14] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard January 2019.
Zemel. Fairness Through Awareness. In Proceedings of the 3rd Innovations in [34] Till Speicher, Muhammad Ali, Giridhari Venkatadri, Filipe Nunes Ribeiro, George
Theoretical Computer Science Conference, pages 214–226. ACM, 2012. Arvanitakis, Fabrício Benevenuto, Krishna P. Gummadi, Patrick Loiseau, and Alan
[15] Cynthia Dwork and Christina Ilvento. Fairness Under Composition. In 10th Mislove. On The Potential For Discrimination In Online Targeted Advertising.
Innovations in Theoretical Computer Science Conference (ITCS 2019), volume 124 In Conference on Fairness, Accountability, and Transparency, New York, New York,
of Leibniz International Proceedings in Informatics (LIPIcs), pages 33:1–33:20, 2018. USA, February 2018.
[16] Exhibit A – Programmatic Relief. [Link] [35] Upturn Amicus Brief In Onuoha V. Facebook. [Link]
uploads/2019/03/[Link]. recap/[Link].304918/[Link].
[17] Facebook Motion To Dismiss In Onuoha V. Facebook. [Link] [36] Samuel Yeom, Anupam Datta, and Matt Fredrikson. Hunting For Discriminatory
com/recap/[Link].304918/[Link]. Proxies In Linear Regression Models. In Advances in Neural Information Processing
[18] Masoomali Fatehkia, Isabelle Tingzon, Ardie Orden, Stephanie Sy, Vedran Sekara, Systems, pages 4568–4578, 2018.
Manuel Garcia-Herranz, and Ingmar Weber. Mapping socioeconomic indicators

You might also like