Download PDFOpen PDF in browser

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

14 pagesPublished: April 19, 2026

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.

Keyphrases: agentic framework, battery research, domain adaptive pretraining (dapt), electrolyte solvent screening, gpqa diamond benchmark, instruction tuning, large language models (llms), llama 3.1 70b, molecular ranking, omniscience, reasoning based knowledge distillation, retrieval augmented generation (rag), s1k reasoning dataset, scientific discovery, scientific literature corpus, scientific reasoning

In: Jernej Masnec, Hamid Reza Karimian, Parisa Kordjamshidi and Yan Li (editors). Proceedings of AI for Accelerated Research Symposium, vol 3, pages 36-49.

BibTeX entry
@inproceedings{AIAS2025:OmniScience_Domain_Specialized_LLM,
  author    = {Vignesh Prabhakar and Md Amirul Islam and Adam Atanas and Yao-Ting Wang and Joah Han and Aastha Jhunjhunwala and Rucha Apte and Robert Clark and Kang Xu and Zihan Wang and Kai Liu},
  title     = {OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery},
  booktitle = {Proceedings of AI for Accelerated Research Symposium},
  editor    = {Jernej Masnec and Hamid Reza Karimian and Parisa Kordjamshidi and Yan Li},
  series    = {EPiC Series in Technology},
  volume    = {3},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2516-2322},
  url       = {/publications/paper/wmdt},
  doi       = {10.29007/5h18},
  pages     = {36-49},
  year      = {2026}}
Download PDFOpen PDF in browser