Built the First LLM Extraction Model for Oil Industry Research Papers (Pre-GPT Era)
#Text Extraction LLM

Stanford’s Oil Production Greenhouse Gas Emissions Estimator (OPGEE) team needed to extract structured data from decades of technical papers in the Oil & Gas sector, each taking researchers 4–5 hours to process manually. With over 30,000 papers in backlog, completing the task by hand would have required 40+ years of manual effort.

57Blocks developed the first domain-specific LLM extraction model for the oil industry, trained before GPT’s public release. Using an ensemble approach, domain-tuned LLMs, and prompt-engineering pipelines, the system automatically extracted and normalized emissions-related data from unstructured academic and industry papers.

Reduced extraction time from

4–5 hours to under 10 minutes per paper

Achieved ~96% accuracy

across diverse publication sources

Lowered extraction cost to

Less than $1 per paper

Built scalable pipelines

supporting cross-disciplinary research at Stanford

Business Impact

Enabled OPGEE researchers to process tens of thousands of legacy papers efficiently, unlocking decades of emissions data and accelerating climate-impact modeling for the global energy sector.

Read the publication in ScienceDirect →

Acknowledgments

This research was funded by the Aranco Services Company and Natural Gas Initiatives at Stanford University. We thank the Microsoft Accelerate Foundation Models Research Program and Kenji Takeda for providing Azure OpenAI services and credits. We appreciate 57Blocks and Lastmile.ai for their GitHub and cloud computing infrastructure support. Thanks to Jill Marie O’Nan for writing suggestions, and to Thuy Nguyen and Cerise Burns for managing administrative tasks.

Build with purpose,
Scale with us