Abstracts: Big Data & HPS 2023

“Integrating Hermeneutic and Digital Analyses of Past Scientific Worldviews through a Kuhnian Lens: The Case of Biometrika”

Nicola Bertoldi (Université Catholique de Louvain)

Thomas Kuhn’s (2000; 2022) mature inquiries into the nature and dynamics of scientific knowledge arguably dealt with one fundamental problem: gaining a hermeneutic understanding of past scientific worldviews by reconstructing the complex entanglements of beliefs about the world and meanings of key scientific terms that structured them and tracing the historical development of those same scientific meaning and belief structures. In particular, Kuhn’s article on “Commensurability, Comparability, Communicability” (Kuhn 2000) foregrounds the possibility of addressing precisely this issue by representing the meaning and belief structure specific to a particular scientific worldview as a lexical network. In this ideal network, each node corresponds to a term denoting a given aspect of the phenomenal world, while the edges radiating from said node represent the criteria for identifying the appropriate referents of the term in question. Such criteria “tie some terms together and distance them from others, thus building a multidimensional structure within the lexicon” (Kuhn 2000, 52). The resulting structure is, therefore, both a structure of meaning and a structure of belief since the criteria of denotation used to determine the network’s edges and the clustering of nodes that those same edges produce define the meanings of the terms constituting the network and, at the same time, embody beliefs about the structure of the phenomenal world. However, to what extent can Kuhn’s sketched analogy between scientific meaning and belief structures, on the one hand, and lexical networks, on the other hand, serve as a guideline for integrating hermeneutic and digital analyses of past scientific worldviews?

Our contribution outlines an answer to this question by applying Kuhn’s analogy to a concrete historical case study, i.e., the window on the history of biometry and statistics provided by the digital archive of the journal Biometrika. Published since 1901, Biometrika constitutes a vantage point for exploring the birth and death of a scientific speciality, i.e., “classical” biometry in the vein of Francis Galton, Karl Pearson and W. F. R. Weldon, and the historical development of another one, i.e., statistics in its modern, post-Pearsonian form. More specifically, we expand on the insights that Bertoldi et al. (under review) have obtained through an LDA-based topic modelling analysis of a corpus of 5,596 research articles published in Biometrika between 1901 and 2020. By adopting a synchronic, diachronic and author-based topic-modelling approach, Bertoldi et al. have notably reached three main results: first, building a correlation network of topical clusters that appear to correspond to the conceptual domains, respectively, of classical biometry and various statistical sub-specialities; second, identifying clear trends in the statistical overrepresentation of such and such topics in different periods of Biometrika’s editorial history; third, building a correlation network of (sub-)communities of authors, based on their respective topical profiles, that highlights the tight association of those same authors with one or more topics in the corpus. Therefore, in light of those insights, we discuss possible ways of building a dynamic lexical network for Biometrika’s corpus that approximates Kuhn’s ideal representation of a scientific meaning and belief structure by intertwining distant reading and close reading analyses of the evolution of some key concepts – on the model of Bertoldi and Pence’s (forthcoming) work on the history of the biological and statistical concept of “population”. Furthermore, we assess the possible outcomes of adopting hybrid topic-modelling methodologies, e.g., ontology-based ones (Allahyari et al. 2017).

“Both Qualitative and Quantitative Research on Science are Useful – Especially in a Possible CSHPSSS!”

Marion Blute (University of Toronto)

Most work in the history and philosophy of science focuses on singularities – this piece of work, that researcher’s work, this method, that school of thought etc. Such conceptual analysis philosophers of science perform can often be scientifically useful. Partially under the influence of the late great philosopher of science David Hull (1935-2010) and the late Werner Callebaut (1952-2014) of the Konrad Lorenz Institute and founder of the journal Biological Theory, at ISHPSSB and elsewhere, most of my own work on evolution – biological, cultural and gene-culture – has been of that type. Social scientists as well as philosophers and historians sometimes do such qualitative work too, but more often their focus is on big data, or perhaps ‘biggish’ data. This is as true of the study of science as it is of the social sciences in general. One difference between the study of science in the social sciences and on other topics there is that quantitative work is increasingly concentrated in one journal, Scientometrics, while it is spread more widely on other subjects. However, in addition to qualitative work on evolution, across 50 years I have periodically used various bibliometric data, to ask and answer questions about science/scholarship. In 1972 while a graduate student and before the web of science or its precursors were available, I used data from the Science and Technology Division of the U.S. Library of Congress on the number of scientific and technical serials published by nation, that contrary to much talk at the time about the exponential growth of science, the dependence of scientific productivity on economic development is best described as a power function i.e. the rate of scientific growth is related linearly to the rate of economic growth. Fifty years later in 2022 I showed using the Web of Science and Google Scholar that Gabriel Tarde is neglected relative to other 19th century sociologists. Between these, using various bibliometric data sometimes with graduate students Saleema Saioud and Paul Armstrong, it was shown for example that:

  • post modernism had had its day, social constructionism had become unstable, and globalization was still growing in 2006;
  • the sociology of science had not been eclipsed by other descriptors such as the sociology of knowledge, the social studies of science or social epistemology in 2010;
  • Donald Campbell’s phrase ‘evolutionary epistemology’ has been more successful than others such as ‘generalized Darwinism’, ‘universal Darwinism’, or ‘variation and selective retention’ to apply to extensions of Darwinism in 2013;
  • 2/3 of the abstracts from a meeting held at the Royal Society in London on “New Trends in Evolutionary Biology: Biological, Philosophic and Social Science Perspectives” tilted towards a new rather than an extended evolutionary synthesis in 2017.

Since both qualitative and quantitative approaches are useful in the study of science, recently I suggested that CSHPS be renamed CSHPSSS i.e. the Canadian Society for the History, Philosophy and Social Studies of Science. After all, there is the very successful ISHPSSB example!”

“Demonstrating the Value of Wikipedia for Understanding the History of Science by Textually Analyzing 97.909 Profiles of Scientists to Identify Factors Related to Impactful Science”

Brett Buttliere (University of Warsaw), Veslava Osinska (Nicolaus Copernicus University), & Adam Kola (Nicolaus Copernicus University)

The paper makes the argument that Wikipedia represents an excellent, universal, and computationally available resource for studying the History and Philosophy of science. The paper demonstrates these arguments by collecting profiles of 97,909 scientists, and demonstrating that we can predict how long the page is and how often the page is viewed using the words to describe the individual and their work. All of the analyses are done in R, using packages which were made available by the Wikimedia foundation or by researchers building on their open APIs including, ‘WikipediR’, ‘WikipediaR’, ‘WikidataR’, and ‘pageviews’, along with more general analysis packages like, ‘psych’, ‘htmltools’, ‘htm2txt’, ‘tibble’, ‘stringi’, ‘stringr’, ‘tidytext’, ‘qdapregrex’, ‘dplyr’, ‘rvest’, ‘jsonlite’, ‘purrr’, ‘data.table’, ‘sjPlot’, ‘emmeans’, and ‘ggplot2’. Specifically, we demonstrated that indicators of conflict, that is a combination of negative and positive sentiment that indicates some people agreed and some people disagree, explain approximately 90% of the length of the text and 45% of how often the page is viewed per month since 2015. Subsampling 154 scientists that have been named across 20 lists of the most influential scientists ever suggests that they are exceptional on these metrics, especially in terms of negations. We believe the results demonstrate not only the value of Wikipedia as a data source, but also the need to be open to different ways of thinking, and the potential benefits of moving toward areas of uncertainty and disagreement, with good intentions and data, rather than away from them. This study limited itself to English versions of pages, and only accessed at single time points, but future studies will examine different versions of the same page and profiles over time, especially as we study how highly viewed profiles and pages are developed – as instances of the development of knowledge about a particular subject. Wikipedia in general is an Open First organization and one with a long history of making its data available to researchers and encouraging them to engage. Wikipedia is also the largest, most trusted (in terms of access) and most open knowledge base available to scientists today. As for the topic and focus on conflict, it can simply be summated by indicating that there is little need to do science about things which are clear, and people already agree about. Future studies in this direction will examine more sophisticated text analysis programs to identify this conflict, and a better conceptualization of the types of conflict that matter most.

“Delineating Philosophy of Medicine. A Data-driven Approach”

Vilius Dranseika & Piotr Bystranowski (Jagiellonian University)

Recent discussions tend to describe philosophy of medicine as (a) a branch of philosophy of science and (b) a field that is distinct from medical ethics (e.g., Gifford 2011; Schramme 2017). For instance, Thompson & Upshur (2018, 5) write: “[W]e treat philosophy of medicine as a branch of philosophy of science. Consequently, ethics does not play a large role”. If these characterizations are accurate, there should exist various observable patterns (citation patterns, publishing patterns, patterns in topical composition of papers, patterns in self-identification of practitioners etc.) that could be studied by triangulating various metascience approaches.

In this paper, we attempt to delineate and characterize philosophy of medicine in a data-driven way. While our primary aim is to contribute to the understanding of philosophy of medicine, we also hope that our approach can be used more broadly to study how scientific disciplines are related to neighboring fields. To achieve this, we draw on several different data sources (e.g., full-text corpus of approximately twenty thousand articles from seven leading journals in philosophy of medicine and bioethics, Web of Science citation data, self-selected keywords scholars use to describe their areas of interest on Google Scholar) and apply several different analytic approaches (e.g., topic modelling, community detection algorithms, citation analyses).

Triangulating several different metascience approaches draws the following picture. While data-driven characterization of philosophy of medicine reliably captures classical issues associated with a relatively narrow understanding of philosophy of science (e.g., discussions on causality and explanation in medicine, epistemology of medical diagnosis, concepts of disease and health), it also captures topics that would require a broader notion of philosophy of science (e.g., phenomenological and biopolitical reflections on medicine, the role of religion in medical practice). Furthermore, while philosophy of medicine seems to be distinguishable from medical ethics along several parameters (from association with different sets of journals to differences in self-identification of practitioners), this distinction is made somewhat problematic by several factors. First, metaethical reflections on bioethics (most notably, the principlism debate) are firmly rooted in philosophy of medicine, thus encouraging to qualify the claims to the effect that “ethics does not play a large role” in philosophy of medicine. Second, philosophical discussions on beginning of life issues (most notably, philosophical debates on metaphysical and normative status of embryos) exhibit a pattern that does not align neatly with a clear distinction between philosophy of medicine and bioethics. While these debates have some properties strongly associated to philosophy of medicine (relatively high rate of citations to philosophy journals, framing in terms of conceptual analysis) they also have properties that are unusual for philosophy of medicine (for instance, being more prominent in bioethics than philosophy of medicine journals).

KEYNOTE: “Sampling World History. An Overview of Methodological Choices Underpinning the Seshat Global History Databank”

Pieter Francois (University of Oxford)

This paper highlights and contextualizes some of the key research design choices upon which the Seshat Global History Databank project is based. Some of these methodological choices that the project faced will be relevant for other global historical datasets which aim to capture cross-cultural diversity. A special focus will be placed on the sampling scheme underpinning Seshat.

“The Turn of Datacentrism in the Digital Humanities and the Sciences”

Giovanni Galli (Urbino University) & Beatrice Tioli (Modena Institute of Historical Research)

Since the development of big data technologies, there have been continuous and tremendous updates of methods and platforms to extract, manipulate and mesh up many typologies of data within the frame of Digital Humanities (DH) and the sciences. Here we focus in particular on two areas: the use of big data in the historical project “Time Machine: Big Data of the Past for the Future of Europe” (Verwayen, Fallon, Gouet-Brunet, & Raines 2019) and the use of big data in the scientific practices facing Covid-19 crisis. The starting point common to the analysis of the use of data in both domains is the definition of datacentrism in DH and in the sciences. In this paper, we will analyse in the first part what are the definitions of dark data and fair data in the context of historical data-set and how the project Time Machine deals with these issues. In the second part, we will focus on the impact of Covid-19 on scientific practices. In the second section, we present some imaginaries of data use sketched by Leonelli (2021a; 2021b), an analytical tool to picture the data framework at play in the Covid-19 crisis. The concept of imaginaries is helpful to cope with the amount of uncertainty correlated with the scientific branches facing the Covid-19 issues and to reveal some epistemological issues regarding the notion of data. In the third section, we argue for a conception of data alternative to the representational and relational views, as they have representational property even though their significance and interpretation can vary according to the scientific contexts in which they are in use. This account is coherent with the view according to which big data and machine learning both influence and are influenced by scientific methodology (Hansen & Quinon 2023), involving continuous conceptual shifts and not a paradigm change. In conclusion, we extract some differences about the epistemological features of conceptual shifts in DH areas of expertise and in scientific enquiry engaged with the Covid-19 crisis.

“Probing Socio-Epistemic Dynamics in High-Energy Physics Using the Inspire HEP Database”

Lucas Gautheron (Bergische Universität Wuppertal)

In this paper, I defend the relevance of Big-Data methods inspired by ‘middle-range’ theories and conceptual frameworks from the philosophy and sociology of science for analysing historical data about the sciences. To this end, I present results from an original case-study concerning the disunity of High-energy physics (HEP), the field of physics dedicated to the fundamental entities of nature and their interactions.

High-energy physics has indeed been subject to an intense division of labor between theorists and experimentalists for decades. The degree of specialisation is such that theorists themselves are roughly divided between pure theorists on one hand and phenomenologists on the other, the latter being more directly interested in connecting their models to experiments. Yet, despite their methodological, ontological, and linguistic differences, these ‘subcultures’ of high-energy physics (as Galison names them) have still been able to communicate and to coordinate their efforts, at least until the 2000s. In order to explain the sentiment of unity of physics despite its striking diversity, Galison introduced the concept of ‘trading zones’ – where distinct scientific cultures are able to communicate and exchange knowledge despite their differences, – which provide a fruitful middle-range theory for addressing the plurality of the sciences.

To present this argument, I will proceed threefold. First, I introduce the database that I use for this quantitative analysis, Inspire HEP, discussing both its advantages and limitations.

Then, I explore the evolution of the relationship between the theoretical and phenomenological subcultures through a quantitative longitudinal analysis of the social and semantic dimensions of the scientific literature between 1980 and 2020. Inspired by Galison’s notion of ‘subcultures’, this analysis reveals the evolution in the magnitude of the divergence between ‘theory’ and ‘phenomenology’, as well as the changes in its linguistic component, by analysing the concepts that have been the most specific to each subculture throughout time. I then evaluate the magnitude of ‘trades’ between these subcultures throughout time and locate the concepts that sustain such ‘trades’ overtime, by combining citation and semantic data from the literature. This analysis shows that trades between ‘theory’ and ‘phenomenology’ have been less and less frequent over the past twenty years.

Finally, I examine the case of supersymmetry, a symmetry postulated in the early 1970s which has given rise to several prolific research programs within the field. Regarding ‘supersymmetry’ as a ‘boundary object’ (an object to which practitioners may ascribe different meanings and purposes), I apply a topic model trained on HEP abstracts in order to unveil the plurality of the contexts in which this symmetry arises in the literature. I then use the same model to unveil the diverging dynamics of the research programs involving this symmetry – in particular, the recent decline in ‘phenomenological’ supersymmetry research.

This case-study emphasises the fragility of the unity of physics, in contrast to reductionist theses about the unity of science. Most importantly, this study shows how quantitative analyses of historical data generated by scientific practice can benefit from concepts and middle-range theories offered by the philosophy of science and the STS tradition.

“What Empirical Network Analysis can do for &HPS: The Case of Model Transfer”

Catherine Herfeld (University of Zurich)

Philosophers and historians of science have recently started to discuss the role of empirical methods for their field. This is a pressing methodological issue, given that formal-mathematical models, experimental tools, ethnographic approaches, and simulation techniques are already accepted in philosophy of science, while historians of science draw increasingly on scientometric methods and tools from digital humanities. The need for discussing the usefulness of empirical methods is also fostered by the fact that the data relevant to study the development, the social organization, and the procedures of science are readily available.

Those tendencies have immediate methodological implications for Integrated History and Philosophy Science (&HPS) as a field that aims at contributions to philosophy of science while drawing heavily on the history of science. Given that the main method of &HPS is the use of (historical) case studies, the question arises whether and if so in which way research in &HPS can benefit from those methods. This paper aims at advancing this debate by discussing the usefulness of one such method, namely empirical network analysis, for research in &HPS. I thereby focus on its potential for developing a general theory of science that has the ambition to simultaneously fulfill a descriptive and a normative role to study and reflect scientific theory selection and change.

The latest since the naturalistic turn in philosophy of science, it has been a concern among some philosophers of science that the time of developing general theories about core philosophical issues is over. What has often been labelled ‘armchair philosophy’ addressed questions about theory change in a way that abstracted from how scientists actually come to accept and use a new theory, considering only rational features of theory change – with limited success. Historians of science have taught us that knowledge production and theory change happens in a historical, social, and cultural context and involves the original creation of novel theoretical ideas and the diffusion of those ideas within a community of researchers. Although researchers in &HPS question the long-upheld positivist distinction between context of discovery and context of justification, they agree that questions about how science progresses, how theories are chosen, or how theories change should still be answered systematically, rigorously, and on the conceptual level.

To balance the relation between specific case studies and abstract philosophical concepts, scholars in &HPS need instruments to carefully, yet systematically study when and how, for example, rational and non-rational factors influence theory choice, progress, and scientific change and draw inferences from the case study that are justified. In this talk, I propose that empirical network analysis provides such a systematic approach that stays true to the historical details while allowing for concept development, – refinement, and – analysis. We argue that empirical network analysis is particularly useful method for research in &HPS because it allows for addressing a set of methodological issues arising from the use of (historical) case studies. More specifically, my claim is that empirical network analysis can enable case studies in &HPS to better fulfil their core functions of concept generation, concept refinement, and empirical justification. It does so by allowing for a systematic iterative process between the concrete level of the case study and the abstract level of a philosophical concept. By discussing the example of model transfer in science in greater detail, I will discuss the advantages of empirical network analysis in developing a general theory of scientific change but caution that it cannot replace more traditional philosophical methods. Rather, I suggest, it must rely on them to fully develop its potentials towards such a goal.

“Epigenetic This, Epigenetic That: Comparing Two Digital Humanities Methods for Analyzing a Slippery Scientific Term“

Stefan Linquist & Brady Fullerton (University of Guelph)

We compared two digital humanities methods in the analysis of a contested scientific term. “Epigenetics” is as enigmatic as it is popular. Some authors argue that its meaning has diluted over time as this term has come to describe a widening range of entities and mechanisms (Haig 2012). Others propose both a Waddingtonian “broad sense” and a mechanistic “narrow sense” definition to capture its various scientific uses (Stotz & Griffiths 2016). We evaluated these proposals by first replicating a recent analysis by Linquist and Fullerton (2021). We analyzed the 1100 most frequently cited abstracts on epigenetics across four disciplines: proximal biology, biomedicine, general biology, and evolution. Each abstract was coded for its heritability commitments (if any) and functional interpretation. A second study applied LDA topic modelling to the same corpus, thus providing a useful methodological comparison. The two methods converged on a discipline-relative ambiguity. Within such disciplines as biomedicine or molecular biology that focus on proximate mechanisms, “epigenetic(s)” refers to a range of molecular structures while specifying nothing in particular about their heritability. This proximal conception was associated with the functions of gene regulation and disease. In contrast, a second relatively uncommon sense of “epigenetics” is restricted to a small proportion of evolutionary abstracts. It refers to many of the same molecular structures, but regards them as trans-generationally inherited and associated with adaptive phenotypic plasticity. This finding underscores the benefit of digital tools in complementing traditional conceptual analysis. Philosophers should be cautious not to conflate the relatively uncommon evolutionary sense of epigenetics with the more widely used proximal conception.

“Mapping the Contours of the Emerging Discipline of Astrobiology with Text-Mining Approaches”

Christophe Malaterre & Francis Lareau (Université du Québec à Montréal)

Astrobiology is often defined as the study of the origin, evolution, distribution, and future of life on Earth and in the universe. Its contours, however, have not always been straightforward, and still aren’t according to some. In the present contribution, we investigate the history of this nascent discipline, from its early days in the late 1960s up till today, by applying text-mining approaches to the complete full-text corpus of its three flagship journals: Origins of Life and Evolution of Biospheres (1968-2020), Astrobiology (2001-2020), and the International Journal of Astrobiology (2002-2020). We notably show how such computational and quantitative methods can help map the different research topics that have been present in this emerging discipline and investigate the significant changes that have occurred in the course of the past four decades or so. We also show how specific underlying groups of authors, that can be called “Hidden Communities of Interest” (HCoI’s), can be retrieved from corpus analyses, and used to understand the changing sociological landscape of this scientific endeavor, from the time when it was labeled exobiology and endorsed a strong origins-of-life perspective to its present time, characterized by a more space- and planetary-sciences view of the discipline. This research illustrates the type of contribution that text-mining approaches can make to science studies.

“A Computational Complement to Case Studies”

Maximilian Noichl (University of Vienna, University of Bamberg)

How do philosophers of science gain knowledge about science? As Mizrahi (2020) has shown empirically, conducting case studies has become one dominant answer to this question (See also Knuuttila & Loettgers 2016). Many episodes in the history of science come with their own unique features, justifying the use of such a detail-oriented method. This justification runs into trouble though, when it is attempted to characterize a larger configuration of interacting scientists or institutions through case studies, which, embedded into a larger disciplinary context, have lost their uniqueness. Larger structures, like the relative prominence of competing methods, models, or subjects might get lost or suffer distortions through the biases introduced by the focus on smaller parts of the literature.

As it is well documented, the scientific literature does grow at a breath-taking pace (See Bornmann & Mutz 2015, Larsen & Ins 2010). Thus the problem of representability of case studies grows worse, the closer their subjects come to the present. This has called onto the scene the proponents of computational methods (see e.g. Mizrahi 2020; Pence & Ramsey 2018), who argue that through the automatic mining of large databases, more adequate representations might be found.

In this contribution, we introduce a novel, computational way of assessing the degree to which individual case studies can be expected to remain justified even when they are drawn from a large body of literature. This method is meant to complement and assess case studies, not replace them. We do so by providing a technique of measurement for the structural similarity between overlapping sub-samples drawn from comprehensive corpora of scientific literature. The general conceptual structure of this approach is outlined here. To address the question of the generalizability of case studies, we take many small sub-samples from a large known corpus. Using machine-learning methods, we identify structures in these samples, for example, groups of articles that use similar language, or that refer to the same methodological foundations. We then measure, how well the structures which the small samples express fit the structures that arise when the whole corpus is considered.

The results allow us to propose a general upper bound for the degree to which we can expect case studies on small samples to faithfully represent a whole field. We can further present answers to the question whether scientific case studies ought to focus on elite samples of highly cited literature, or whether they can be expected to increase their epistemic value by consciously incorporating marginalized literature. Apart from these general points, the proposed method can also be used as a way to suggest improvements to the scope of individual case studies through interactive exploration of computationally refurbished material.

“History and Philosophy of the Immune Epitope Database”

James Overton (Knocean, Inc.)

As historians and philosophers of science consider the potential benefits and pitfalls of building databases and big data applications for HPS, it is worthwhile to reflect on the history of scientific databases, the challenges they have faced, and the approaches they have used to overcome these challenges. We present a first-hand account of the history of the Immune Epitope Database (IEDB) and how it has participated in the Open Biological and Biomedical Ontologies (OBO) community of open-source scientific database projects.

The IEDB (https://iedb.org) is a freely accessible database funded by the National Institute of Allergy and Infectious Diseases (NIAID) in the United States. An immune epitope is anything that is recognized by the adaptive immune system – often a fragment of a protein from a bacterium or virus that has infected a cell. The IEDB’s mandate is to curate epitopes from all publications on allergy and immunology, including autoimmunity and transplantation. There have been exclusions for immunology of cancer, which is the purview of the National Cancer Institute (NCI), but recently the NCI funded the CEDAR project to extend the IEDB’s approach in this direction. The IEDB employs a team of specialist curators with PhD-level education in immunology, who follow an extensive set of guidelines (publicly available, with public revision history dating back to 2008) as they curate journal articles and data submissions into the database. The IEDB has curated 23,554 publications – more than 95% of all the relevant published literature, from 1960 to the present day.

The IEDB contains millions of rows across dozens of tables, some of which have hundreds of columns. Many of those columns use controlled vocabulary from open-source scientific ontologies. These “domain ontologies” are computational artifacts that include both human-readable labels, synonyms, definitions, and other annotations, and machine-readable logical axioms. IEDB has been a key contributor and user of many Open Bio Ontologies. OBO is an open community of open-source ontology project for biology and biomedicine, with a set of shared best practices, principles, tools, and infrastructure. The OBO community includes the Gene Ontology, the Human Disease Ontology, the Ontology for Biomedical Investigations, and a growing list of more than 200 others.

OBO projects keep a full history of changes, so anyone can trace the history of each terminology as it has evolved over ten, twenty, or sometimes thirty years. These open-source projects are developed through online collaboration, and archives of email, issue trackers, version control history, and meeting minutes are often available, free for anyone to access.

For historians and philosophers of science, this offers an embarrassment of riches. Scientists are expressing detailed understanding of their disciplines in databases with open curation guidelines, and ontologies expressed with textual definitions and logical axioms, with version control history tracking every change, and often public records of the discussions behind each change. Any HPS big data project interested in the recent history of science should take account of these riches, and any HPS database project about the more distant past should learn lessons from the past few decades of scientific databases.

“Scientific Disciplines and the Scientonomic Ontology”

Paul Patton & Cyrus Al-Zayadi (University of Toronto)

Scientonomy seeks to construct a standard ontology for the systems of ideas and social institutions that constitute the scientific enterprise. Such a standard ontology will facilitate applying big data approaches to the history of science, by making it possible to create a database of intellectual history and formulate and test hypotheses about general patterns of scientific change. Here we focus on the role of scientific disciplines in our ontology. As we define them, they are a ubiquitous feature of science. Social communities devoted to pursuing particular areas of knowledge emerged in the late eighteenth and nineteenth centuries as the sciences professionalized. But the practice of classifying areas of knowledge is much older and more widespread and serves generally to organize teaching and writing. For this reason, we will treat disciplines first as categories of knowledge, and then, sometimes, as a basis for disciplinary communities.

In our standard ontology, an epistemic agent is an agent capable of taking epistemic stances, like acceptance and rejection, towards epistemic elements like theories or questions. A communal epistemic agent is one that assesses theories by means that make the resulting stance belong distinctively to the community (Patton 2019). The knowledge accepted by such an agent consists of a mosaic of theories and questions, related to one another hierarchically. Questions presuppose theories, and other theories answer those questions. A question Q’ is a subquestion of another question Q if a direct answer to Q’ partially answers Q (Barseghyan 2018; Barseghyan & Levesley 2021; Rawleigh 2018).

A discipline is identified by its core questions. These are questions essential to the discipline. Subquestions of the core questions, and theories which answer the core questions and their subquestions, are contained within a discipline. A discipline consists of a set of core questions, and a delineating theory stating that this set of questions are the core questions of the discipline. A discipline exists if some epistemic agent accepts its delineating theory. A discipline is accepted by an epistemic agent when that agent accepts both its delineating theory and its core questions. The importance of the distinction can be appreciated by considering that the community of modern astronomers accepts the delineating theory that a discipline of astrology exists, but do not accept its core questions (Patton & Al-Zayadi 2021).

Social communities devoted to producing new knowledge in particular disciplines appeared as natural science faculties were established at European universities. Disciplinary communities have a collective intentionality to answer questions contained within their discipline by formulating and assessing theories. They typically form part of a larger community that accepts both the discipline, and the theory that the disciplinary community is expert regarding the questions that discipline contains. They thereby maintain relations of authority delegation with it. When this relationship exists, the larger community accepts whatever theories the disciplinary community accepts as answers to questions contained within the discipline (Patton & Al-Zayadi 2021).

Keynote: “Scientific Disagreement: A Textual Analysis Perspective”

Charles H. Pence (Université Catholique de Louvain)

My talk has two aims. First, I will discuss the uncertainty surrounding concepts in biodiversity and taxonomy, along with approaches developed in the philosophy of biology to resolve or eliminate it. It has long been recognized that disagreement is rampant in taxonomy, and that this disagreement over species inventories has a direct (and a potentially damaging) effect on our understanding of biodiversity, with concomitant worries about practice in conservation biology. Philosophers have offered a number of diagnoses of this state of affairs, including various kinds of fatalism, proposals for standardization, and careful analyses of the roles of social and ethical values in conservation. Second, I will present preliminary work from my group using empirical analyses of the literature in taxonomy with the goal of better understanding the factors that shape and modulate taxonomic disagreement – and thus, we hope, better understanding where this disagreement might negatively affect conservation efforts.

“Computational History of Chemistry: The Expansion of the Chemical Space”

Guillermo Restrepo & Jürgen Jost (Max Planck Institute for Mathematics in the Sciences)

The increasing amount of data and of computing power are turning computational approaches into an integral part of historians’ tools. Beyond providing novel ways to solve historical questions, computational history allows for asking and solving novel questions related to large scale patterns. Chemistry, being the science with the largest output of publications associated with its exponential growth of new substances and reactions, is therefore not short of data. This information is today collected in huge electronic databases, which not only bring this corpus of information at our fingertips but also offer many
possibilities for conducting computational analyses shedding light on the history of chemistry and the evolution of chemical knowledge.

Here we summarise our results on the historical unfolding of the chemical space, understood as the collection of chemicals and reactions reported over the years in the scientific literature. Records of how the space is realized through chemical reactions exist in more than 200 years of scientific publications, now available in electronic databases. By analyzing millions of reactions stored in Reaxys® database, we found that the exploration of the chemical space has been marked by three statistical regimes and by social and scientific factors. The first regime was dominated by volatile inorganic production and ended about 1860, when structural theory gave place to a century of guided production, the organic regime. After 1980 began the least volatile regime, the current organometallic one. We found a stable 4.4% annual growth rate of production of new compounds not affected in the long run neither by World Wars nor by the introduction of new theories. However, World Wars have delayed production. Moreover, we found that chemists have been conservative in the selection of their starting materials but have been historically motivated to unveil new compounds of the space.

“Towards a Database of Intellectual History: Digital Linguistic Strategies for Identifying Theories Accepted in 18th-century England”

Grace Shan (University of Toronto)

From web-based phylogenetic trees to planetarium software that simulate historical configurations of astronomical objects, scientific databases and their user interfaces provide a useful way of retrieving time-dependent snapshots of various catalogued aspects of the world. One of the goals of the scientonomy community is to create one such tool—an interactive database of intellectual history—that would allow snapshots of historical and contemporary scientific worldviews to be retrieved at will according to user-specified dates and locations.

In order for a database of intellectual history to take shape, its contents must be organized according to a well-defined ontology of epistemic entities and relations. The scientonomic ontology has established three key concepts in addition to the given parameters of time and location. These are the notions of epistemic agent, epistemic elements, and epistemic stance. An epistemic agent is an individual or an epistemic community that can take epistemic stances towards epistemic elements. An epistemic element is a question or an answer to a question, i.e. a theory. An epistemic stance is an attitude taken towards an epistemic element; e.g. a theory can be accepted, used, and/or pursued by an agent. Thus, each historical record stored in the database will indicate that agent A, took stance S toward an epistemic element E at time T (e.g. “The existence of Higgs boson [element] has been accepted [stance] by the physics community [agent] since 2013 [time]”).

In order to populate the database, the primary and secondary historical sources indicative of respective belief systems need to be comprehensively studied. The imperative to sample these documents thoroughly, combined with the bulk of sources that need to be considered for each communal or individual agent, might mean that it will be some time before a substantial database of intellectual history is realized. There may, however, be a way to accelerate this work. Nowadays, large-scale humanistic studies are increasingly enabled by computational methods of processing large quantities of textual sources. One such method is digital corpus analytics, the computer-aided process of identifying linguistic trends within a large collection of texts.

This study is a preliminary investigation into the feasibility of using digitally identified linguistic trends to reconstruct historically accepted epistemic elements. Our pilot project focuses on identifying theories accepted in 18th-century England. It conducts a linguistic analysis on the Royal Society Corpus 6.0, a grammatically-tagged text of all publications of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1920 (Fischer et al. 2020; Kermes et al. 2016). The study probes the question of whether there are any linguistic trends that could indicate that a certain theory was accepted at some point in the 18th century. The finding is a resounding yes, and these trends have consolidated linguistic indicators of theory acceptance based on noun-adjective combinations that can computationally retrieve hundreds of potentially accepted theories from the Royal Society Corpus. This efficiency is arguably a step forward in the construction of a database of intellectual history, but the novel methodology also needs to find its place within scientonomic strategies for interpreting empirical observations, the broader treatment of the English intellectual landscape, as well as its potential coexistence with other methods of identifying theory acceptance in 18th-century England.

Keynote: “Big Data in Religious History”

Rachel A. Spicer (The London School of Economics and Political Science)

The Database of Religious History (DRH) is a large historical database, consisting of both quantitative and qualitative data, launched in 2015. It is composed of a series of polls, each containing a wide range of questions about religion. Academic scholars complete entries by providing data on their area of expertise, in the form of answers to specific questions contained within a poll. The creation of the DRH posed many challenges. In particular, how to convert the historical record into quantitative data points? The need for flexibility of coding to accurately quantify the historical record had to be balanced with the need of precision for analysis, with consideration given to the needs of experts from across disciplines: history, religious studies, archeology, data science, etc. In this presentation I will discuss how the DRH was designed to accommodate the needs of a diverse group of scholars, challenges this presents for analysis, and how these challenges have been addressed in recent analyses.

“The Oxymoron of Coding Uncertainty in Big Data Out-of-Context: Insights from Mundane Decisions in Professional Statistical Work”

Samantha Vilkins (Queensland University of Technology)

“In most parts of the world it has long been illegal to die of anything except causes on the official list — although the list of causes is regularly revised.” Ian Hacking (1991). How Should We Do A History of Statistics?

Critical discussion of the application of big data and computational methods in the humanities is often presented as a tension between the boons of scale minimising the influence of individual errors, and the loss of localised and detailed interpretation. This presentation leads to problematizations – and therefore solutions – that focus on increasing the resolution of data, attempts to code in nuances and uncertainties, and bring in maximum inputs for maximum outputs.

A different rendering is suggested: that such a loss of information is itself the positive (as it allows for wider viewpoints) or otherwise that the loss is merely one of the neutral processes of capturing and translating data for new contexts and uses, that must all be accepted and worked within limitation, rather than problems to be solved, or worse, ignored. Unfortunately, the seductions of quantification and the authority we grant it only encourage such ignorance. But such a view would lead to different problematizations: the worry of incorrect conclusions from incomplete data becomes a focus not on more data but limiting conclusions. Questions of ‘can’ become negotiations, instead, of extents of utility and further purpose.

A particularly evident facet of this is the challenge of translating uncertainty across contexts. The problematization of how to ‘code uncertainty’ into data is oxymoronic; uncertainty is wholly antithetical to the precision of coding. Attempts to communicate incomplete data or unknowns – for example, with greyed out sections of maps – are precision reconstructions of chaos, introduced information. Communicating not just incompleteness of data as a binary presence or absence, but other detailed conditional characteristics regarding data capture and intention is part of a reframing from concerns of limited data to purposeful limiting of conclusions.

For certain subject areas, such considerations of translating uncertainty through many contexts is no novel nor minor concern. However, there is great disparity in areas seen as contested or controversial, where these considerations might take priority, and areas which have settled in and let one particular viewpoint overtake others. Such social perspectives, as history tells us, may always be overturned, but the hard-coding and loss of data of statistics create a sort of data inertia that is difficult to resist.

Using a theoretical perspective from sociology of quantification, and drawing on qualitative interviews with professionals working on contemporary public statistics in Australia, I highlight the sort of challenges and acknowledgements required in holding together such immense scopes in tabulations, and issues of constantly communicating out of context, as well as new perspectives on how specific and localized ‘context’ can be. On uncertainty specifically, the interviews bring it light issues of conceptualizations of uncertainty for different datasets and domains that are by nature non-computable, antithetical to commensuration. How then do we incorporate these into a growing total datafication project?

I discuss perspectives on solutions, particularly industry-wide regulation approaches from other fields in statistics and politics as they could possibly be applied to historical interpretation. I conclude with reference to both Wittgenstein’s argument of the potency of mathematics being in being useful, rather than necessarily true, and Desrosières’s argument on the potency of statistics being in their simultaneous rigidity and flexibility, to reframe concerns for big data projects to use-above-truth, with focus on the bounds and contexts of multiple contextual utilities over one universal truth hard-coded into data, the way social lives cannot be.

“Datasets: A Narratological and Semiotic Perspective”

Joel West (University of Toronto)

While we use big datasets to train AIs, we may also question the way that these datasets create meaning. The methodology chosen to understand these datasets are based on the Semiotics of CS Peirce and the Narratology of Mieke Bal, where Peirce’s semiotics understands that a sign is meaningful to the recipient and Bal’s narratology examines the manner in which we choose information to become discursive. The relationship between the dataset and the way that it becomes a kind of discourse is discussed, based on the above two ideas. A dataset and its meaning are shown to be asymptotic to each other. The idea here is that these datasets become not just a kind of discourse but also that these discourses are also theory laden, based on the presumptions of the learner, just as the meaning of a story carries a semiotic meaning.

“Local Contingencies: Modelling ‘Artificial Intelligence’ in Mid-Century America”

Joseph Wilson (University of Toronto)

Yanni Alexander Loukissas’ book All Data Are Local: Thinking Critically in a Data-Driven Society (2019) provides a powerful framework for using data critically when constructing historical narratives in science studies. I apply Loukissas’ framework to a corpus of scientific papers from the ‘golden age’ of artificial intelligence research (from 1956-1976) to uncover some of the contingencies and local attachments that these artifacts retain, despite my ability render them a seamless corpus of machine-readable text. In his book, Loukissas asks researchers to consider four lessons in data analysis that will be relevant here:

1. That data have complex attachments to place (NB: The syntactic treatment of ‘data’ as a plural reinforces the ‘multifaceted perspective’ Loukissas argues is necessary for robust analysis);

2. That data is collected from heterogenous sources, each with their own local attachments;

3. That data and the algorithms used to parse them are inextricably entangled; and

4. That the interface(s) used to present data recontextualizes them in subtle but ontologically significant ways.

For a Master’s degree in 2019 I assembled a corpus of 32 academic papers (385,642 tokens) in order to trace how the metaphor of ‘artificial intelligence’ (that is, an explicit comparison of computational processes to that of the human brain) was first used and how it spread. Although I was able to trace the spread of this foundational metaphor throughout this twenty-year period of incredible growth (Wilson 2019; Wilson 2023), there remained some persistent ‘local effects’ that revealed the quirks of a particular writer or a particular school of academic thought. From the insistence of MIT’s Oliver Selfridge that John Milton’s Paradise Lost should be used as a foundational metaphor for the field (casting what we now call ‘neurons’ or ‘perceptrons’ as demons in Milton’s vision of hell), to Newell and Simon’s model of computation based at Carnegie-Mellon that used trees as a metaphor for decision-making and growth, local acceptance of the concept of ‘artificial intelligence’ always retained markers of the scientific vernacular used by specific institutions. Moving forward, it is important to retain space in our analyses for both ‘global’ trends and consistencies in data and also ‘local’ variables and idiosyncrasies. In this way we can re-centre humans in the study of the History and Philosophy of Science.

“Human Enhancement and Related Topics in Bioethical Discussions: A Computational Approach”

Tomasz Żuradzki (Jagiellonian University in Kraków)

This paper uses topic modeling and citation data to systematically analyze scholarly discussions on ethical and regulatory issues stemming from the direct manipulation of the human genome and other recent developments in genetic engineering.

Although the direct manipulation of the genome of organisms (e.g. plants for agriculture) was embraced by scientists years ago, and discussions on regulatory issues concerning genetic engineering have been vivid since the 1970th (e.g. the Asilomar Conference on DNA in 1975), the development of CRISPR/Cas9 method in 2012 is considered a revolution due to its efficiency and cost-effectiveness. In 2015, CRISPR/Cas9 germline modifications were first used in non-viable human embryos, opening a real possibility of making permanent, heritable changes to the human genome.

These technological developments are related to one of the most central challenges in ethics: shall we care only about the benefits and harms of particular identified people, or also about the welfare in the world that may involve creating “better” people in the future? Some scholars claim that ethical and regulatory issues stemming from genetic engineering are foundational for at least some parts of bioethics, e.g., the editor-in-chief of The American Journal of Bioethics (AJOB) stated in the 100th-anniversary issue of the journal: “Dolly the sheep gave birth to AJOB, that the journal issued from developing embryonic stem cells” (Magnus 2013). A standard manner in which practitioners of an academic discipline reflect upon the development of their field is through “close reading” of selected texts, what is mediated by their personal experience and academic interests. Here is a typical statement based on such an approach: “enhancement is coming to the forefront of bioethical scholarship” since this topic “combines cutting-edge science with mainstream philosophy” (Harris 2012).

The approach we adopt in this paper takes seriously the epistemological question of how one can justify this type of statement. Referring to our previous studies based on the corpus of about 20.000 texts published since 1971 in seven leading journals in the field of bioethics (Bystranowski, Dranseika, & Żuradzki 2022a; 2022b), we use a ‘distant reading’ approach based on topic modeling and Web of Science citation data. We concentrate on the topic we previously interpreted as Enhancement (characterized by terms ‘enhancement’, ‘enhance’, ‘technology’, ‘intervention’, ‘cognitive’, ‘capacity’, ‘trait’, ‘morally’, ‘improve’, ‘bioenhancement’), which was “the biggest winner” in terms of relative growth in our corpus (the increase of the mean prominence from 0.03% in 1971-75 to 0.97% in 2016-20). We also include in our analyses four correlated topics that are the most frequently present together in the same texts with Enhancement: Germline, Ecology, Offspring, and Genetics. We delineate a sub-corpus of papers that “belong” to this five-topic cluster, which we interpret as the core of bioethical discussions on ethical or regulatory challenges stemming from genetic engineering.

This enables us to conduct several interesting analyses: Which ethical and regulatory challenges seem the most important for bioethics? How closely does the field follow recent scientific breakthroughs? To which philosophical problems, if any, bioethics refer while discussing genetic engineering and related topics. The result of our study may be interpreted as undermining the claim that bioethical discussions on these matters “combine cutting-edge science with mainstream philosophy”.