SAS Institute: Volume 1 - Gender, Racial & Political Diversity Benchmarks - A 2022 Machine Learning Study of Alternative Metrics

Cover Photo

SAS Institute: Volume 1 - Gender, Racial & Political Diversity Benchmarks - A 2022 Machine Learning Study of Alternative Metrics

  • Pages (approximate) 489
  • Author Philip M. Parker, PhD, INSEAD Chair Professor of Management Science
  • Region World
  • Item Code TBHF6AE1RE6D2
  • Published 2022
  • Please note ICON Group has a strict no refunds policy.
  • Price $ 995
Related Reports

Introduction

According to Wikipedia (2022), the interest in diversity metrics can find its origins in environmental, social, and corporate governance (ESG) approaches to board-level decision making:

“Environmental, social, and corporate governance (ESG) is an approach to evaluating the extent to which a corporation works on behalf of social goals that go beyond the role of a corporation to maximize profits on behalf of the corporation’s shareholders. Typically, the social goals advocated within an ESG perspective include working to achieve a certain set of environmental goals, as well as a set of goals having to do with supporting certain social movements, and a third set of goals having to do with whether the corporation is governed in a way that is consistent with the goals of the diversity, equity, and inclusion movement.”

It is the later emphasis on governance that is the focus of this report, especially as to evaluating the extent to which SAS Institute can be benchmarked against peers across diversity metrics. Again, Wikipedia.org notes the following:

“Diversity, equity, and inclusion (DEI) is a term used by organizations and training programs that attempt to ensure all people, regardless of race, gender, or other demographic attribute, are able to succeed in an organization. Diversity is the presence of differences that may include race, gender, religion, sexual orientation, ethnicity, nationality, socioeconomic status, language, (dis)ability, age, religious commitment, or political perspective. Populations that have been-and remain- underrepresented among practitioners in the field and marginalized in the broader society. Equity is promoting justice, impartiality and fairness within the procedures, processes, and distribution of resources by institutions or systems. Tackling equity issues requires an understanding of the root causes of outcome disparities within our society. Inclusion is an outcome to ensure those that are diverse actually feel and/or are welcomed. Inclusion outcomes are met when you, your institution, and your program are truly inviting to all. To the degree to which diverse individuals are able to participate fully in the decision-making processes and development opportunities within an organization or group.”

The definitions and discussions above are mirrored by many academic institutions. For example, the University of Washington provides the following definition which emphasizes the potential impact of DEI on the workplace (https://www.washington.edu/research/or/office-of-research-diversity-equity-and-inclusion/dei-definitions/):

“Diversity is the presence of differences that enrich our workplace. Some examples of diversity may include race, gender, religion, sexual orientation, ethnicity, nationality, socioeconomic status, language, (dis) ability, age, religious commitment, or political perspective in our workplace. There are many more. Equity is ensuring that access, resources, and opportunities are provided for all to succeed and grow, especially for those who are underrepresented and have been historically disadvantaged. Inclusion is a workplace culture that is welcoming to all people regardless of race, ethnicity, sex, gender identity, age, abilities, and religion and everyone is valued, respected and able to reach their full potential.”

Description

This report was created for senior leaders of SAS Institute involved in Diversity, Equity, & Inclusion (DEI), or Environmental, Social, & Governance (ESG) initiatives that require measurable benchmarks relating to public and internal disclosures of diversity metrics. These metrics can be used to track progress and benchmark across other organizations. The methodology for this report was developed by Professor Philip M. Parker, INSEAD Chair Professor of Management Science (Singapore campus, where he teaches Master’s and Ph.D. courses on machine learning and artificial intelligence).

This report leverages natural language processing, a branch of artificial intelligence, and machine learning to develop replicable diversity benchmarks for SAS Institute against companies that are either well known for such initiatives or those competing in “similar” markets. Self-reported affiliations define “publicly visible” persons who are included in the study (e.g., board members, executives, or public-facing employees), and are used to benchmark SAS Institute against the following notable firms and competitors, themselves identified via algorithm (a mix of companies, brand names, multinationals, subsidiaries, or other organizational forms):

Accenture, Adobe, Amazon, Apple, BAE Systems, Cisco, Cloudera, Dell, Deloitte, Ericsson, Facebook, FICO, Google, Hitachi Vantara, HP, IBM, Informatica, Intel, Juniper Networks, Micro Focus, Microsoft, MongoDB, Nokia, OpenText, Oracle, Palantir Technologies, PWC, Red Hat, Salesforce, Samsung, SAP, Schneider Electric, Splunk, Tableau Software, Tata Consultancy Services, Teradata, Verint Systems, VMware, Wipro, Zendesk

The metrics covered (prior probability distributions) include gender assigned at birth, gender identity, sexual orientation, geographic diversity, race/ethnicity, physical characteristics (age, skin color, eye color, hair color, weight, body mass), religion, and political beliefs. As expected, there is substantial variances across companies competing with SAS Institute. The report introduction concludes with ideas on how SAS Institute can improve upon the estimates presented in this study.

About the Author/Editor

Professor Philip M. Parker, PhD (Wharton) is the INSEAD Chaired Professor of Management Science and teaches INSEAD’s MBA, executive, and PhD courses on artificial intelligence and machine learning. He has also taught as a visiting Professor at MIT, Harvard University, Stanford University, UCSD, and UCLA. He pioneered the use of algorithms to generate original content, across a variety of formats, and received a patent for his approach in 2007. His work has been presented at numerous public forums (G8, White House, Davos, TEDx, etc.) and covered extensively in the press (Huffington Post, New York Times, Singularity Hub, etc.). Parker has worked with numerous multinational companies and consulting firms to develop implementation-oriented programs and projects, including McKinsey & Company, PWC, SAP, Google, Jardine Matheson, Tata Group, Citibank, Ericsson, ABB, Thomson Corporation, and a number of large financial and technology firms, to name a few.

Excerpt

1. The Problem

Of the issues mentioned above (e.g., diversity, equity and inclusion), this report focuses solely on diversity. One of the problems in discussing diversity is accepting appropriate input metrics and/or criteria upon which metrics are established. Legal considerations vary dramatically from one country to another when considering the types of data that can be collected on employees or that might be used to establish diversity metrics (e.g., it may be strictly forbidden to keep records on an employee’s gender identity, race, ethnicity, religious beliefs, or political affiliations). For the purposes of this report, we recommend an approach that can be replicated, and, in some cases, improved upon, should a company wish to fine tune the benchmarks presented here. As such, this report should be seen as a starting point, or a first phase in developing diversity metrics and benchmarks for SAS Institute. The advantage of the approach taken here is that it relies only on public information; retains individual privacy, and it has been shown to be rigorous across a number of academic disciplines, especially in the natural and medical sciences.

2. Caveat – Priors

There is one very important note of caution when interpreting the statistics presented in this report. All of the data presented should be interpreted as “priors” based on publicly available information. According to HandWiki.org, the world's largest wiki-style encyclopedia dedicated to science, technology and computing: "In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account."

The statistics in this report should be interpreted as prior probability distributions, not as final measurements. In the field of diversity metrics, there can never be full information. One may never be able to know, with certainty, a person’s religious beliefs, less so for everyone in a firm. However, this information asymmetry does not preclude us from developing actionable or relatively accurate priors. Intuitively, one might ask “How many arms per capita do Singaporeans have?”. The only way to know for sure is to perform a census of all Singaporeans, and summarize the counts. Of course, one need not do so. Rather, we might form an opinion, based on public observations, that the likely number is roughly 2.0. Clearly this is a pretty good prior (e.g., everyone has 2 arms). However, we also know (via intuition) that there are more people with less than two arms, than those with more than two arms. Looking for more public information, we might discover a study that indicates (e.g., for the United States; https://www.ishn.com/articles/97844-statistics-on-hand-and-arm-loss) that a certain number of persons are born with less than two arms, and others have amputations due to accidents and/or medical conditions. Taking these additional cases into account, we come to a number more like 1.99999 arms per capita for Singaporeans. In other words, the prior probability distribution is something like 99.9 percent of Singaporeans have 2 arms each, while the remainder percent have less than 2. Again, this is just a prior – an informed calculation that can be updated when more information is available. When more data are available, the prior can be updated, and used as a prior in a third, future, study and so on. Estimates improve over time as more data are collected. This report is a “first pass” at generating priors on diversity metrics for SAS Institute. Since the data reported here are priors, therefore, a more in-depth study (e.g., using internal corporate records) can better estimate and/or update the statistics given in this report.

3. Criteria

Diversity can be measured in a number of ways. Traditional metrics include job functions percentages (e.g., X% are employed in white collar jobs, Y% are employed in blue collar jobs), salaries, years of tenure, industry background, etc. In this study we report priors covering benchmarks of greater concern to DEI initiatives, including gender assigned at birth, gender identity, sexual orientation, geographic diversity, racial diversity, age diversity, height diversity, weight diversity, body mass index diversity, eye color diversity, hair color diversity, religious diversity and political diversity. These criteria were chosen in this study by virtue of them receiving attention in the literature on DEI initiatives (in some cases, regulatory bodies are requesting such metrics in public disclosures to shareholders, mostly for employees in visible or senior functions).

A central point is to not emphasize absolute measures of diversity, but rather how SAS Institute compares to peers (well-known employers or those competing in similar sectors of the economy). It is this measure that is the most relevant to strategic planning exercises. For example, a firm may be found to have low diversity due to the preferences of men or women to fill certain positions within an industry. However, that same firm can be found to have very high diversity when compared to peer benchmarks. It may be the latter benchmark that is of most concern to stakeholders.

Again, these relative benchmarks are priors. Custom, in-depth, follow-up studies can be conducted to improve upon the metrics reported here. This report, therefore, has an additional pedagogical objective. The idea is to present a methodology that can be scaled without having to conduct individual surveys and can generate results that preserve the privacy of employees.

4. Methodology: Privacy & Replicability

Many countries (e.g., Germany) have very strict laws on confidentiality. In some countries, it is forbidden to employ, make offers of employment, or conduct employee search based on any criteria relating to race/ethnicity, gender or individual characteristics. Further, in some jurisdictions, gathering, storing or transmitting statistics that cover any aspects of an employee’s race, religion, political orientation, gender, gender identity or sexual orientation is forbidden. Such constraints have proven extremely onerous on organizations and researchers who know that such data are essential to understanding the impact of DEI or similar initiatives on measured outcomes. For some disciplines, knowing an individual’s background is critical. For example, a patient being an Ashkenazi Jew can have a correlation to disease prevalence (e.g., Bloom syndrome, Canavan disease, cystic fibrosis, familial dysautonomia, Fanconi anemia, Gaucher disease, mucolipidosis type IV, Niemann-Pick disease type A, etc.). Similarly, the impact of medical screenings or treatments (especially dosage amounts) for various diseases, including cancer, can systematically vary from one person to another based on race, gender, gender identity, or sexual orientation.

How, then, do medical researchers make scientific inferences and recommendations on these criteria without recording patient data on race, religion or gender? Recently, researchers used natural language processing, as we do in this study. We are extra precautious in that we do not use any data supplied by SAS Institute or their peers and competitors. We have not reported or used granular information on specific individuals, nor need to know anything about the employees, other than their self-reported (publicly known) affiliation with SAS Institute.

Our methodology is similar in nature to work by Mazieres and Roth (2018) in their study “Large-scale diversity estimation through surname origin inference” (https://hal.archives-ouvertes.fr/hal-01766665/document). Please see the references in that paper for supporting research. The basic idea is that one simply needs a list and frequency of surnames and given names within an organization, and this will be strongly correlated to other diversity metrics – again, as priors for Bayesian inference. For example, there is greater than 99.98% probability that a person is of African descent (and probably lives in Togo) if their surname is Kougbenya. There is a 24.79% probability that someone with the surname Parker is of African descent (if living in the United States), and there is a 0.27% probability that someone with the surname Merkel is of African descent (irrespective of where they live). One can make similar inferences for religion (e.g., there is a 79.706 percent probability that someone with the surname Mohamed has Islam as their religious faith). Similarly, the literature has shown that given names can be used to estimate priors for metrics relating to gender and sexual orientation. For example, the probability that a person with the given name of Cindy has a 99.7% probability of being recorded as a biological female at birth; someone with a surname of Terry has a 18.6% probability of being recorded as a biological female at birth. Similar metrics exist for geographic diversity (e.g., persons with the last name Swänson are likely to be from Sweden or have ancestors from Sweden if they currently reside elsewhere). Of course, people can intermarry and have mixed ancestry. The law of large numbers permits this, again thinking of benchmarking relative to the peers of SAS Institute.

In a similar fashion to the analogy of “arms per capita in Singapore”, the above “textual” inferencing has been successful in the medical and physical sciences and proves remarkably robust to organizations with many employees (e.g., the given names collected from an “all-male company” benchmarks to having low gender diversity; the prior is that there are 0% females employed in the organization).

5. Diversity Metrics

We use a variety of diversity outcome measures. This topic finds its origins in the natural sciences (e.g., species diversity) and economics (market concentration metrics). For each diversity criterion, a table is presented with the prior probability distributions across outcomes. For example, the following table illustrates such priors for the sexual orientations for a sample firm:

Gender orientation%
Heterosexual always91.47
Heterosexual mostly5.12
Homosexual mostly1.72
Bisexual0.59
Asexual / other0.59
Homosexual always0.5

For the same firm, one can calculate the following well known metrics of diversity:

Diversity metricDefinitionValue
Herfindahl–Hirschman IndexSquare of each percent added0.84
Normalized Herfindahl IndexTransformation of HHI between 0 and 10.8
Shannon's Diversity IndexProportional abundances of the types (sum of %*ln(%))0.39
Richness RNumber of slices in the pie (nonzero entries)6
CR1 - Berger–Parker IndexLargest percentage0.91
CR2 - Concentration RatioSum of 2 largest percentages0.97
CR3 - Concentration RatioSum of 3 largest percentages0.98
CR4 - Concentration RatioSum of 4 largest percentages0.99

In addition, each firm is benchmarked, on a percentile basis vis-à-vis peer firms, for these metrics. It is interesting to note the zero-sum nature of diversity metrics in their aggregate. Increasing one category to improve diversity, by definition, can decrease another; that other category might be one that is desirable to increase – also for reasons of diversity. Similarly, increasing diversity in one organization may come at the expense of reducing diversity in other organizations who might also seek to increase their diversity. Diversity in the aggregate across firms or organizations may remain unchanged.

6. Peer Selection

This study focuses on the employee as a unit of observation. They have, of course, many employment options including being a homemaker, being self-employed or working for SAS Institute or one of its competitors. Because the labor market is rather fluid, employees have the option to work for firms that might be well known, even if they do not compete with SAS Institute. For this reason, benchmark companies are selected based on three criteria. First, common occupation choices are included (e.g., self-employed, teacher or homemaker). Second, large employers that do not directly compete with SAS Institute are included since these are the names most recognizable to employees. Finally, companies that have been independently identified as competing with SAS Institute are included. For this last group of benchmarks, an unsupervised machine learning technique, called asymmetric multidimensional scaling (MDS), is used to identify companies which are most likely to be reasonable benchmarks. The core input data to this method are the names of “competitors” (e.g., company names, brands, etc.) that have previously or currently been active in the same product markets. These activities are determined using text mining of public documents, such as market research studies or public regulatory filings which require the disclosure of competitors. It should be noted that companies that no longer exist may be included as these may nevertheless yield insights into earlier dynamics. Likewise, a brand (e.g., Google) may appear as a benchmark, in contrast to the related company (e.g., Alphabet, Inc.). This provides a richer set of benchmarks to consider.

7. Samples

The report uses a number of customized crawlers to identify publicly visible employees of SAS Institute. Many of these might be considered to have a strong influence on management or leadership. Employee names can be found across companies from a number of sources, including public filings (e.g., regulatory, patents), social media (e.g., LinkedIn, Twitter, researchgate.net, Google Scholar), or career portals that allow the posting of resumes.

In the case of SAS Institute, the following is a sample of surnames:

Aadnesen, Abul-hajj, Adams, Alexander, Allemang, Arangala, Arnold, Baker, Baldauff, Ballard, Barber, Barton, Battala, Belmaggio, Berry, Bisen, Blackmon, Blakeley, Board, Borchardt, Boswell, Boyd, Boyle, Brannock, Brousseau, Brown, Bruckstein, Bruno, Bulk, Burke, Burniston, Bush, Calandrino, Carlton, Carr, Carter, Carville, Casey, Castelloe, Chabot, Chen, Chiao, Choi, Christie, Cline, Cohen, Coleman, Coles, Connelly, Conner, Coppotelli, Cotter, Cragen, Craib, Cumbee, Cybrynski, Defelice, Destephano, Diamond, Dietz, Dillman, Disantostefano, Dolson, Doninger, Dorney, Dotson, Doudt, Downes, Dremann, Eastwood, Eisner, Elhertani, Elmenhurst, Elnaccash, Ewing, Faenza, Feldman, Fitzgerald, Fried, Fulk, Furlow, Fury, Gaines, Gambucci, Garcowski, Garza, Ghattamaneni, Gill, Gjestvang-lucky, Gomez, Goodwin, Gordon, Gottimukkala, Gramm, Graves, Gray, Gregg, Grubaugh, Guan, Guarnaccia, Guscott-schultz, Gutschick, Hahl, Harvey, Hebrank, Heda, Heffernan, Helmkamp, Henderson, Hendrie, Herman, Hess, Hession, Hext, Hicks, Hikl, Hill, Holly, Holmgrain, Hopper, Horwitz, Houston, Hristov, Hsieh, Hsueh, Hull, Hunt, Hutchens, Hynd, Jensen, John, Jordan, Joshi, Juvvadi, Kalat, Kamalakanthan, Kapler, Karmakar, Kelly, Kharva, Khatib, Kilburn, Klenz, Kocher, Kohlmayr, Kraftsow, Kraus, Kuipers, Kulkarni, Lampert, Lankhaar, Larson, Leeman-munk, Leisner, Levey, Lievense, Ligtenberg, Lincoln, Liput, Lockhart, Lodge, Lohse, Losinger, Lourduraj, Lyne, Magee, Mahmood, Mamorbor, Mangum, Marcantonio, Marthinsen, Martin, Maughn, Mcdaniel, Mcelmurry, Mcguirk, Mclaughlin, Mclaurin, Mclester, Meadows, Memory, Miethe, Milavetz, Miller, Milley, Moore, Mountain, Muhlada, Murthy, Musacchia, Musolino, Musser, Myers, Myxter-iino, Nagae, Nargi, Nichols, Norton, Oberle, Obrien, Olinger, Overton, Owens, Page, Parham, Parker, Parks, Parsons, Patch, Patel, Peace, Pedraza, Pegoraro, Perkinson, Pianko, Pietrucki, Poole, Pratt, Prieb, Puertolas, Ragey, Ramage, Ranajee, Reagan, Redford, Reeves, Rickenbrode, Roberts, Robinson, Rossnagel, Rostovtseva, Roth, Sabourin, Saccoccio, Sakowski, Salci, Sall, Saravanja, Sarella, Schnurman, Schoaff, Scott, Seavey, Seldin, Shah, Shekton, Shelton, Sherock, Shirzad, Singleton, Sloane, Sofarelli, Sourirajan, Sourisak, Spann, Sparks, Spears, Spikes, Stallmann, Stockett, Stump, Sullivan, Sunchu, Tamayo, Tamburro, Tanzini, Tareen, Tate, Temares, Thacher, Tharp, Thompson, Tinney, Tomski, Trawinski, Underberg, Valsaraj, Vandusen, Vantland, Veress, Vericker, Vezzetti, Vilker, Wagoner, Walkee, Wallenberger, Warner, Watkins, Weathers, Weigandt, West, Whitaker, Wikstrom, Wilkie, Williams, Wolf, Xiao, Yang, Zaromb, Zhou.

In the case of given names, the following is a sample of names associated with SAS Institute:

Adam, Adolfo, Akkina, Alan, Alice, Alycia, Amellaly, Anand, Andres, Angela, Anna, Anne, Anthony, Arati, Asres, Atanu, Audrey, Austin, Avriel, Barbara, Barry, Bassel, Basselhajj, Beau, Benhao, Bernie, Beth, Bill, Birgitte, Blake, Blithe, Bobbie, Brandon, Brenda, Brian, Bryan, Cameron, Carla, Carl-philip, Carol, Casey, Catherine, Cathy, Chad, Chandana, Charles, Chris, Chrisopher, Christine, Christopher, Clark, Clarke, Colleen, Connie, Cynthia, Dana, Daniel, Darrell, Darren, Daud, Dave, Davetta, David, Deanna, Deborah, Debra, Dedde, Deva, Devi, Diana, Diane, Dianne, Dinaker, Dragos, Dudley, Earl, Eddie, Edward, Edwin, Eileen, Elizabeth, Ellis, Emad, Emel, Emily, Eric, Erica, Erin, Evelyne, Fang, Farzad, Franciscus, Frank, Freddy, Frederick, Fruzsina, Gabriel, Gale, Gary, Gearge, Gerardette, Gitte, Gjon, Glen, Glenn, Grace, Graciano, Graeme, Gregory, Halil, Hannah, Heather, Heidi, Henry, Holly, Howard, Jack, Jackie, Jacob, James, Janie, Jared, Jeanne, Jeff, Jeffery, Jeffrey, Jennifer, Jensen, Jerry, Jerzy, Jesse, Jodi, Jody, Joel, John, Johnaton, Jolene, Jordan, Joseph, Juan, Judy, Julia, Julie, June, Justin, Kannan, Karen, Kate, Katherine, Kathleen, Kathryn, Keith, Kenji, Kenneth, Kerrie, Kevin, Kimberly, Lanchien, Larnell, Larry, Laura, Laurel, Lawrence, Leah, Leatrice, Leilani, Lemuel, Lewis, Lillian, Lily, Linda, Lisa, Lise, Lori, Louise, Lynn, Madison, Malcolm, Manuel, Marc, Margaret, Marie, Mark, Marroy, Marry, Martha, Marty, Mary, Marya, Matt, Maureen, Mauro, Meijian, Melanie, Melinda, Michelle, Mike, Mildred, Minda, Minhyo, Minni, Mohammad, Monica, Murugiah, Nabaruna, Nabil, Nancy, Natalia, Natalie, Nathan, Naveen, Neely, Neil, Nicholas, Nick, Nicole, Norm, Ognian, Oita, Padraic, Paloma, Patrice, Patricia, Paul, Phil, Philip, Philippe, Phillip, Poornachandr, Poornachandran, Prairie, Pranesh, Prashant, Radhika, Ravi, Rebecca, Reese, Reji, Robin, Rollanda, Ross, Rupinder, Russ, Ruth, Ryan, Samuel, Sandra, Sanjay, Sara, Sarah, Sassan, Scott, Shahrzad, Shannon, Sharon, Sheila, Sherrine, Simon, Sridhar, Sriram, Stan, Stephan, Stephanie, Stephen, Steve, Steven, Sunil, Susan, Susanna, Suzanne, Tamara, Tamisa, Tammy, Tarek, Teala, Tejas, Timothy, Tina, Toni, Traci, Tracy, Troy, Tyler, Varunraj, Vernon, Vicci, Vicki, Vijay, Virginia, Walter, Wayne, Weiling, Will, William, Willis, Xiangqian, Xiaohui, Xunlei, Yado, Yongqiao.

These names are then evaluated against massive databases (e.g., over 300,000,000 named individuals and billions of observations) that have been classified by gender assigned at birth, geographic distribution, religious beliefs, political affiliations (e.g., public disclosures of donations to political parties or causes) and various metrics of racial and ethnic ancestries. These are often estimated using national statistical distributions reported for each country represented in the sample (e.g., persons having ancestry from a country are assumed to likely have the religious beliefs from that country, in roughly the same proportions as the general population from that country - e.g., most persons from Syria self-report having Islam as their religion).

8. Extrapolations

In many cases, extrapolations are made to generate priors associated with gender identity, sexual orientation, age, weight, hair color, eye color and so on. For each of these, broad statistical averages are used to estimate prior probability distributions. In the extreme, for example, if the firm has 100% women (assigned female at birth), then the percent of gay (non-transexual) men in the firm is likely to be close to 0%. While estimates used in the extrapolations might be questioned in the absolute, because the method is equally applied across benchmarks, the relative rankings across firms in not affected. In other words, the focus should be on relative rankings (reported as percentiles), and not necessarily the absolute numbers (even though these represent defendable priors). This entails the same philosophy illustrated earlier in our discussion of “arms per capita” calculations.

9. Example – Gender Identify & Sexual Orientation

To illustrate extrapolation methodologies which are used to calculate priors from larger datasets, the following describes assumptions used in deriving estimates for gender identity and sexual orientation. Similar approaches are used for other metrics. For geography, the report uses country of origin. For racial characteristics, country of ancestry leads to ethnic and racial diversity metrics which themselves affect physical features (skin, hair and eye color). Given names follow age distributions (e.g., names have weighted average ages, or popularity over time, affecting likelihoods of employment at or above minimum ages for employment – i.e., there is a low probability that someone called “Gertrude” is young, versus “Britney”). Geography and gender lead to body weight estimates, etc. The following discussion illustrates this process using gender identity and sexual orientation.

Referring to gender identity, we start with gender assigned at birth (which is highly correlated with gender identity but is not the same construct). This report considers biological gender as being correlated with gender assigned at birth (declared on birth certificates). Using natural language processing (NLP), a branch of text analysis in the field of artificial intelligence, the core concept is to use statistical inferences based on a child’s given name. Given names have known probability distributions across biological genders. Using a proprietary database, created by ICON Group International, approximately 30 million given names are assigned a probability of being either male or female when assigned on birth certificates (which varies across countries). Based on known assignments across billions of individuals across over 200 countries or sovereign territories (e.g., for one country a name can be mostly female, but in others it can be mostly male), the following table illustrates a few examples from that database for a given country:

Given namePercent MalePercent Female
Robert99.60.4
Sally0.399.7
Terry81.418.6
Kim16.383.7
Pat40.060.0
Odile0.0100.0
Didier100.00.0

As observed, some given names are almost always assigned to one gender or another, or can be somewhat ambiguous. Across a sample of given names, however, the aggregate approximates the distribution of gender assigned at birth. Most given names are strongly associated with a single gender (relatively few are ambiguous).

This extrapolation holds, therefore, for relatively small samples. For example, we use Halogen Ventures. It is a USA-based venture capital firm focusing on investing in women-run businesses. The application of NLP illustrates the types of deviations that might arise. The first names of their management team posted in late 2020, and associated gender probabilities, are as follows (see https://halogenvc.com/team), when limited to the United States:

Given namePercent MalePercent FemaleMost Likely
JESSE*97.22.8Male
ALEXA0.299.8Female
ASHLEY1.898.2Female
JONATHAN99.60.4Male
LINDSEY4.795.3Female
REBECCA0.399.7Female
SHEILA0.399.7Female
SONJA0.199.9Female
TAMI0.299.8Female
TIM99.80.2Male
UMAIMAH*n/an/an/a
JONATHAN99.60.4Male
Average36.763.3Female - oriented
Actual25.075.0Female - oriented

Of note are the two records above with the asterisks (*). In the first case, JESSE is mostly a name for children assigned to be male at birth. Inspection of the management team at Halogen Ventures, however, indicates that JESSE is female. In addition, one sees the given name UMAIMAH which is unassigned (e.g., there is an insufficient number of children born with this given name to calculate a meaningful probability). In this case, this observation cannot be used to measure gender diversity assigned at birth. Despite such shortcomings, and this extremely small sample size, the use of given names nevertheless indicates that this firm shows a female orientation (a conclusion confirmed with visual inspection, where 75 percent of leaders are female). When larger samples of names are employed, deviations such as the ones illustrated above are randomly distributed, and aggregate accuracy increases (e.g., errors tend to cancel each other out as the sample sizes grow). In what follows, this study will compare SAS Institute to a number of companies, as benchmarks, for which large sample sizes of given names are available from public sources (approaching or exceeding 1000 observations each in many cases). Given that the same error distributions exist across companies, relative rankings across the firms considered is achieved without bias.

Gender identity, as opposed to gender assigned at birth, has received substantial attention in the popular and academic press. In this study, we begin with a general categorization that is most often considered to exist at birth: (1) male, (2) female, and (3) intersex. Birth certificates typically only classify gender as binary. Intersex births, however, do occur with a probability.

In 2015, The United Nations Office of the High Commissioner for Human Rights published "Free & Equal Campaign Fact Sheet: Intersex" and states the following:

Intersex people are born with sex characteristics (including genitals, gonads and chromosome patterns) that do not fit typical binary notions of male or female bodies. Intersex is an umbrella term used to describe a wide range of natural bodily variations. In some cases, intersex traits are visible at birth while in others, they are not apparent until puberty. Some chromosomal intersex variations may not be physically apparent at all.

As illustrated in the research of Leonard Sax (2002), "How Common is Intersex? A Response to Anne Fausto-Sterling" in the Journal of Sex Research (39 - 3: 174–178), estimates of people who are intersex vary depending on a number of conditions or assumptions. As summarized in the article on Intersex (Wikipedia.org), estimates vary from a high of 1.7% of births, to lower...

Table of Contents

  • 1 Executive Summary
  • 2 Methodology
  • 2.1 Background
  • 2.2 The Problem
  • 2.3 Caveat – Priors
  • 2.4 Criteria
  • 2.5 Methodology: Privacy & Replicability
  • 2.6 Diversity Metrics
  • 2.7 Peer Selection
  • 2.8 Samples
  • 2.9 Extrapolations
  • 2.10 Example – Gender Identify & Sexual Orientation
  • 2.11 Geographic Diversity
  • 2.12 Racial Diversity
  • 2.13 BIPOC Diversity
  • 2.14 Religious Diversity
  • 2.15 Age, Height, Weight, Hair, Eye Color Diversity
  • 2.16 Disability Diversity
  • 2.17 Smoking Diversity
  • 2.18 Obesity Diversity
  • 2.19 Political Diversity
  • 2.20 Further Research
  • 3 Gender Diversity, Assigned at Birth
  • 3.1 SAS Institute - Gender Diversity, Assigned at Birth
  • 3.2 Benchmarks - Top Employers
  • 3.3 Benchmarks - Competitors
  • 4 Gender Identity
  • 4.1 SAS Institute - Gender Identity
  • 4.2 Benchmarks - Top Employers
  • 4.3 Benchmarks - Competitors
  • 5 Sexual Orientation
  • 5.1 SAS Institute - Sexual Orientation
  • 5.2 Benchmarks - Top Employers
  • 5.3 Benchmarks - Competitors
  • 6 Pronoun Use (Youth < 25 Years Old)
  • 6.1 SAS Institute - Pronoun Use (Youth < 25 Years Old)
  • 6.2 Benchmarks - Top Employers
  • 6.3 Benchmarks - Competitors
  • 7 Geographic Diversity
  • 7.1 SAS Institute - Geographic Diversity
  • 7.2 Benchmarks - Top Employers
  • 7.3 Benchmarks - Competitors
  • 8 Racial Diversity
  • 8.1 SAS Institute - Racial Diversity
  • 8.2 Benchmarks - Top Employers
  • 8.3 Benchmarks - Competitors
  • 9 BIPOC Diversity
  • 9.1 SAS Institute - BIPOC Diversity
  • 9.2 Benchmarks - Top Employers
  • 9.3 Benchmarks - Competitors
  • 10 Religious Diversity
  • 10.1 SAS Institute - Religious Diversity
  • 10.2 Benchmarks - Top Employers
  • 10.3 Benchmarks - Competitors
  • 11 Age Diversity
  • 11.1 SAS Institute - Age Diversity
  • 11.2 Benchmarks - Top Employers
  • 11.3 Benchmarks - Competitors
  • 12 Height Diversity
  • 12.1 SAS Institute - Height Diversity
  • 12.2 Benchmarks - Top Employers
  • 12.3 Benchmarks - Competitors
  • 13 Weight Diversity
  • 13.1 SAS Institute - Weight Diversity
  • 13.2 Benchmarks - Top Employers
  • 13.3 Benchmarks - Competitors
  • 14 Body Mass Index Diversity
  • 14.1 SAS Institute - Body Mass Index Diversity
  • 14.2 Benchmarks - Top Employers
  • 14.3 Benchmarks - Competitors
  • 15 Eye Color Diversity
  • 15.1 SAS Institute - Eye Color Diversity
  • 15.2 Benchmarks - Top Employers
  • 15.3 Benchmarks - Competitors
  • 16 Hair Color Diversity
  • 16.1 SAS Institute - Hair Color Diversity
  • 16.2 Benchmarks - Top Employers
  • 16.3 Benchmarks - Competitors
  • 17 Disability Diversity
  • 17.1 SAS Institute - Disability Diversity
  • 17.2 Benchmarks - Top Employers
  • 17.3 Benchmarks - Competitors
  • 18 Smoking Diversity
  • 18.1 SAS Institute - Smoking Diversity
  • 18.2 Benchmarks - Top Employers
  • 18.3 Benchmarks - Competitors
  • 19 Obesity Diversity
  • 19.1 SAS Institute - Obesity Diversity
  • 19.2 Benchmarks - Top Employers
  • 19.3 Benchmarks - Competitors
  • 20 Political Diversity
  • 20.1 SAS Institute - Political Diversity
  • 20.2 Benchmarks - Top Employers
  • 20.3 Benchmarks - Competitors
  • 21 DISCLAIMERS, WARRANTIES, AND USER AGREEMENT PROVISIONS
  • 21.1 DISCLAIMERS & SAFE HARBOR
  • 21.2 ICON GROUP INTERNATIONAL, INC. USER AGREEMENT PROVISIONS
Related Reports
We use cookies to ensure that you enjoy the best experience on our website. To learn about how we use cookies, please read our Privacy Policy.
OK