Stanford’s Oil Production Greenhouse Gas Emissions Estimator (OPGEE) team needed to extract structured data from decades of technical papers in the Oil & Gas sector, each taking researchers 4–5 hours to process manually. With over 30,000 papers in backlog, completing the task by hand would have required 40+ years of manual effort.
57Blocks developed the first domain-specific LLM extraction model for the oil industry, trained before GPT’s public release. Using an ensemble approach, domain-tuned LLMs, and prompt-engineering pipelines, the system automatically extracted and normalized emissions-related data from unstructured academic and industry papers.
Reduced extraction time from
4–5 hours to under 10 minutes per paper
Achieved ~96% accuracy
across diverse publication sources
Lowered extraction cost to
Less than $1 per paper
Built scalable pipelines
supporting cross-disciplinary research at Stanford
Enabled OPGEE researchers to process tens of thousands of legacy papers efficiently, unlocking decades of emissions data and accelerating climate-impact modeling for the global energy sector.
Read the publication in ScienceDirect →This research was funded by the Aranco Services Company and Natural Gas Initiatives at Stanford University. We thank the Microsoft Accelerate Foundation Models Research Program and Kenji Takeda for providing Azure OpenAI services and credits. We appreciate 57Blocks and Lastmile.ai for their GitHub and cloud computing infrastructure support. Thanks to Jill Marie O’Nan for writing suggestions, and to Thuy Nguyen and Cerise Burns for managing administrative tasks.