Datasets Powering the Food Industry

Datasets are a key ingredient that feeds data-driven tools for the food system. They are repositories of statistics, images, and/or other relevant information about a particular subject. In the food system, these collections of information on agriculture, food retail, or nutrition, for example, are among the important reservoirs of data used to develop food and agriculture-specific apps or machine learning systems.

Because data collection and creation is a time – and resource-intensive process, open datasets that are accessible to the public are among the most popular sources to draw from to build applications.

If you’ve ever wondered what datasets are powering AI tools for food production, distribution, and consumption, here’s a sampling of some of the most comprehensive and popular open datasets related to the food system that can be found on GitHub, Kaggle, or Google’s Dataset Search

Agriculture and Food Production Datasets

ImageNet is a massive dataset of over 14 million images culled from the internet, including major subsets of data classified specifically as plant, vegetable, and food and beverage categories. The original brainchild of Dr. Fei-Fei Li, this massive compilation of visual images has been used by researchers for more than a decade.

The Inter-university Consortium for Political and Social Research created the United States Agriculture Data, 1840 – 2012 dataset by compiling statistics from the United States Census Bureau and the United States Department of Agriculture, including “data about the number, types, output, and prices of various agricultural products” for the past 170+ years. 

PlantVillage’s dataset includes about 55,000 images of more than a dozen crops that classify a range of plant diseases, from bacteria on bell peppers to raspberry ringspot virus

Food Distribution Datasets

The United States Food and Nutrition Service published a Summer Food Service Program (SFSP) Mobile Route Maker dataset that compiles data from a publicly-funded meal distribution program in order to optimize food delivery routes and more efficiently and effectively serve program participants. 

The International Association of Refrigerated Warehouses, a subsidiary of the Global Cold Chain Alliance, indexes hundreds of “Public Refrigerated Warehouses” across the continental United States to improve cold-chain logistics, storage, and warehousing. 

Data Driven Detroit created a repository of grocery stores in the Detroit area to help users identify grocery stores that sell fresh food in the city, a particularly useful dataset given that approximately 33% of the city’s population is food insecure.

Food Consumption Datasets

The USDA National Nutrient Database for Standard Reference provides nutrition information for nearly 8,000 different foods and is identified as one of the most important and comprehensive “sources of food composition data in the United States.” It is also said to “provide the foundation for most food composition databases in the public and private sectors.”

Instacart’s “Online Grocery Dataset” offers a record of 3 million orders collected from the app that can help to inform the future of online grocery shopping and delivery. 

Finally, the Yelp Dataset includes more than 5 million user-generated restaurant and business reviews that can be used for a multitude of applications, including supporting inspections by state and local public health agencies. 

Sign up for our newsletter to stay up-to-date.