95 lines
4.6 KiB
Markdown
95 lines
4.6 KiB
Markdown
#Datasets
|
|
|
|
## Terminology
|
|
"Cleaning the ingredients" means removing amounts, comments, etc. from the ingredients.
|
|
This results in eg. "2 m.-große Tomate(n)" and "400 g Tomate(n) , in Stückchen, frisch
|
|
oder aus der Dose" to be simplified to the actual ingredient "Tomate".
|
|
|
|
"Cleaning the steps/instructions" means replacing the shortened form of the ingredient
|
|
(e.g. "Spargel") with the full, cleaned ingredient (e.g. "Spargel_grün").
|
|
|
|
## Full Datasets
|
|
A full dataset contains the complete recipes and additional information. Each recipe is structured as follows:
|
|
|
|
Recipe URL (String)
|
|
|
|
image: Image URL (String)
|
|
|
|
name: Name of the recipe (String)
|
|
|
|
quantity: Amount the recipe makes, e.g. number of portions (String)
|
|
|
|
ingredients: List of ingredients the recipe calls for (List of Strings)
|
|
|
|
instructions: the instructions on how to make the recipe (String or List of Strings)
|
|
|
|
comments: List of user comments (Strings) about the recipe
|
|
|
|
**dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.
|
|
|
|
**dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
|
|
|
|
**dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code,
|
|
as running it with the entire dataset often takes a while. Ingredients are cleaned.
|
|
Instructions are separated into sentences (List of Strings).
|
|
|
|
**dataset_cleaned_nice.json**: Entire dataset with cleaned ingredients. Instructions of each recipe are one String.
|
|
|
|
**dataset_sep_sentences.json**: Entire dataset with cleaned ingredients. Instructions are separated into sentences (List of Strings).
|
|
|
|
**dataset_cleaned_steps.json**: Entire dataset with cleaned ingredients and instructions. Instructions are separated into sentences (List of Strings).
|
|
|
|
**dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
|
separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
|
|
|
**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
|
separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
|
|
|
##Occurrances
|
|
### Cleaned Ingredients
|
|
In these datasets each cleaned ingredient is only listed once (key). The value of each key
|
|
is the number of occurrences of that ingredient in all recipe ingredient lists.
|
|
|
|
**all_ingredients_nice.json**: Contains all cleaned ingredients. Umlauts are written as such.
|
|
|
|
**mult_ingredients_nice.json**: contains all cleaned ingredients that occurred more than 20 times
|
|
(number can be changed) in all ingredient lists. Umlauts are written as such.
|
|
|
|
**mult_ingredients_sorted.json**: contains all cleaned ingredients that occurred more than 20
|
|
times (number can be changed) in all ingredient lists. Umlauts are written as such.
|
|
Ingredients are sorted by number of occurrences.
|
|
|
|
### Cleaned Steps
|
|
**cleaned_steps_occurrance.json**: contains all cleaned ingredients and how many times each
|
|
ingredient occurrs in the cleaned steps. This is important for training the model later on,
|
|
as training will not be good for ingredients that only occur a few times in the steps.
|
|
|
|
## Ingredient sets
|
|
These json files contain all ingredients that could belong to the different categories.
|
|
These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**
|
|
|
|
## Instructions Only
|
|
**cleaned_sep_sentences.json**: contains only the instructions of each recipe. Each set of
|
|
instructions is separated into its sentences. Each step is cleaned. The structure is:
|
|
|
|
Recipe URL (String)
|
|
|
|
instructions: the instructions on how to make the recipe (List of Strings)
|
|
|
|
**cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions
|
|
|
|
**complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
|
|
tokens.
|
|
|
|
**complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
|
|
tokens included.
|
|
|
|
**model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**
|
|
|
|
**training_data.txt**: instruction datapoints from recipes set aside for training
|
|
|
|
**testing_data.txt**: instruction datapoints from recipes set aside for testing
|
|
|
|
## Other
|
|
**ground_truth.json**: ground truth used for evaluation
|
|
**synonyms.json**: synonyms of ingredients found in ground truth |