Files
MasterarbeitCode/data
..
2021-04-11 23:28:41 +02:00
2021-04-11 23:28:41 +02:00
2021-04-11 23:28:41 +02:00
2021-04-11 23:28:41 +02:00
2021-04-11 23:28:41 +02:00

#Datasets

Terminology

"Cleaning the ingredients" means removing amounts, comments, etc. from the ingredients. This results in eg. "2 m.-große Tomate(n)" and "400 g Tomate(n) , in Stückchen, frisch oder aus der Dose" to be simplified to the actual ingredient "Tomate".

"Cleaning the steps/instructions" means replacing the shortened form of the ingredient (e.g. "Spargel") with the full, cleaned ingredient (e.g. "Spargel_grün").

Full Datasets

A full dataset contains the complete recipes and additional information. Each recipe is structured as follows:

Recipe URL (String)

    image: Image URL (String)

    name: Name of the recipe (String)

    quantity: Amount the recipe makes, e.g. number of portions (String)

    ingredients: List of ingredients the recipe calls for (List of Strings)

    instructions: the instructions on how to make the recipe (String or List of Strings)

    comments: List of user comments (Strings) about the recipe

dataset_fin.json: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.

dataset_test.json: A test dataset that only contains a few recipes. This dataset was used to test code, as running it with the entire dataset often takes a while. Ingredients are cleaned. Instructions are separated into sentences (List of Strings).

dataset_cleaned_nice.json: Entire dataset with cleaned ingredients. Instructions of each recipe are one String.

dataset_sep_sentences.json: Entire dataset with cleaned ingredients. Instructions are separated into sentences (List of Strings).

dataset_cleaned_steps.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings).

dataset_cleaned_steps_not_empty.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing instructions": [].

full_dataset_vers1.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into blocks of multiple sentences with up to 512 tokens, separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing instructions": [].

full_dataset_vers2.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into blocks of multiple sentences with up to 512 tokens, not separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing instructions": [].

full_dataset_vers3.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences. Recipes without instructions removed manually by searching for recipes containing instructions": [].

##Occurrances

Cleaned Ingredients

In these datasets each cleaned ingredient is only listed once (key). The value of each key is the number of occurrences of that ingredient in all recipe ingredient lists.

all_ingredients_nice.json: Contains all cleaned ingredients. Umlauts are written as such.

mult_ingredients_nice.json: contains all cleaned ingredients that occurred more than 20 times (number can be changed) in all ingredient lists. Umlauts are written as such.

mult_ingredients_sorted.json: contains all cleaned ingredients that occurred more than 20 times (number can be changed) in all ingredient lists. Umlauts are written as such. Ingredients are sorted by number of occurrences.

Cleaned Steps

cleaned_steps_occurrance.json: contains all cleaned ingredients and how many times each ingredient occurrs in the cleaned steps. This is important for training the model later on, as training will not be good for ingredients that only occur a few times in the steps.

food_categories directory

These json files contain all ingredients that could belong to the different categories. These files are: bread.json, fish.json, meats.json, pasta.json, rolls.json, sausage.json

Instructions Only

cleaned_sep_sentences.json: contains only the instructions of each recipe. Each set of instructions is separated into its sentences. Each step is cleaned. The structure is:

 Recipe URL (String)

    instructions: the instructions on how to make the recipe (List of Strings)

cleaned_sep_sentences_not_empty.json: same as above, but removed recipes that don't have any instructions

Other

ground_truth.json: ground truth used for evaluation

synonyms.json: synonyms of ingredients found in ground truth