Practical Author Name Disambiguation under Metadata Constraints: A Contrastive Learning Approach for Astronomy Literature

blog-post-image

We are excited to present our new paper, "Practical Author Name Disambiguation under Metadata Constraints," where we tackle the critical challenge of correctly linking publications to their authors. Accurately grouping a researcher’s body of work is essential for ensuring proper credit, guiding funding allocations, and informing hiring decisions However, this task is often hindered by widespread name ambiguity—common names like "J. Smith" can appear in over 20,000 distinct records in systems like NASA/ADS. While many existing algorithms rely on extensive metadata like emails or affiliations to solve this, these features are frequently missing or inconsistent in large digital libraries. To bridge this gap, we introduced the Neural Author Name Disambiguator (NAND), a scalable method that effectively identifies researchers using only widely available data: names, titles, and abstracts

Our approach formulates disambiguation as a similarity learning problem, utilizing a Siamese neural network and contrastive learning to distinguish between authors. We leverage foundation models, such as Chars2Vec to handle name variations and SPECTER to capture the semantic content of titles and abstracts. To validate our model, we constructed and released the Large-Scale Physics ORCID-Linked (LSPO) dataset, a new benchmark connecting over 550,000 NASA/ADS publications to unique ORCID identifiers. On this dataset, NAND achieved up to 94% accuracy and over 95% F1-score, proving that disambiguation is scalable and reliable without perfect metadata. We are releasing both the model and the dataset to support open science and future development in this area.