4.6 KiB
#Datasets
Terminology
"Cleaning the ingredients" means removing amounts, comments, etc. from the ingredients. This results in eg. "2 m.-große Tomate(n)" and "400 g Tomate(n) , in Stückchen, frisch oder aus der Dose" to be simplified to the actual ingredient "Tomate".
"Cleaning the steps/instructions" means replacing the shortened form of the ingredient (e.g. "Spargel") with the full, cleaned ingredient (e.g. "Spargel_grün").
Full Datasets
A full dataset contains the complete recipes and additional information. Each recipe is structured as follows:
Recipe URL (String)
image: Image URL (String)
name: Name of the recipe (String)
quantity: Amount the recipe makes, e.g. number of portions (String)
ingredients: List of ingredients the recipe calls for (List of Strings)
instructions: the instructions on how to make the recipe (String or List of Strings)
comments: List of user comments (Strings) about the recipe
dataset_parts: Directory contains the dataset parts that were pulled from chefkoch.
dataset_fin.json: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
dataset_test.json: A test dataset that only contains a few recipes. This dataset was used to test code, as running it with the entire dataset often takes a while. Ingredients are cleaned. Instructions are separated into sentences (List of Strings).
dataset_cleaned_nice.json: Entire dataset with cleaned ingredients. Instructions of each recipe are one String.
dataset_sep_sentences.json: Entire dataset with cleaned ingredients. Instructions are separated into sentences (List of Strings).
dataset_cleaned_steps.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings).
dataset_cleaned_steps_not_empty.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing instructions": [].
full_dataset.json: Entire dataset with cleaned ingredients and instructions. Instructions are separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing instructions": [].
##Occurrances
Cleaned Ingredients
In these datasets each cleaned ingredient is only listed once (key). The value of each key is the number of occurrences of that ingredient in all recipe ingredient lists.
all_ingredients_nice.json: Contains all cleaned ingredients. Umlauts are written as such.
mult_ingredients_nice.json: contains all cleaned ingredients that occurred more than 20 times (number can be changed) in all ingredient lists. Umlauts are written as such.
mult_ingredients_sorted.json: contains all cleaned ingredients that occurred more than 20 times (number can be changed) in all ingredient lists. Umlauts are written as such. Ingredients are sorted by number of occurrences.
Cleaned Steps
cleaned_steps_occurrance.json: contains all cleaned ingredients and how many times each ingredient occurrs in the cleaned steps. This is important for training the model later on, as training will not be good for ingredients that only occur a few times in the steps.
Ingredient sets
These json files contain all ingredients that could belong to the different categories. These files are: bread.json, fish.json, meats.json, pasta.json, rolls.json, sausage.json
Instructions Only
cleaned_sep_sentences.json: contains only the instructions of each recipe. Each set of instructions is separated into its sentences. Each step is cleaned. The structure is:
Recipe URL (String)
instructions: the instructions on how to make the recipe (List of Strings)
cleaned_sep_sentences_not_empty.json: same as above, but removed recipes that don't have any instructions
complete_dataset512.json: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP] tokens.
complete_dataset_SEP512.json: sentences of instructions of a recipe are combined until token amount nears 512. [SEP] tokens included.
model_datapoints.txt and model_datapoints_SEP.txt: list of only the datapoints from complete_dataset... .json
training_data.txt: instruction datapoints from recipes set aside for training
testing_data.txt: instruction datapoints from recipes set aside for testing
Other
ground_truth.json: ground truth used for evaluation synonyms.json: synonyms of ingredients found in ground truth