added to README files, added full dataset versions to data
This commit is contained in:
16
README.md
16
README.md
@@ -3,4 +3,18 @@
|
|||||||
German FoodBERT models for ingredient substitute recommendation
|
German FoodBERT models for ingredient substitute recommendation
|
||||||
|
|
||||||
The 3 German FoodBERT versions can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
|
The 3 German FoodBERT versions can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
|
||||||
The zip has to be unpacked in final_Versions/
|
The zip has to be unpacked in **final_Versions/**
|
||||||
|
|
||||||
|
More infos about each step can be found in other README files in each directory. The overall order is:
|
||||||
|
- crawl_recipes
|
||||||
|
- clean_dataset
|
||||||
|
- train_model
|
||||||
|
- evalutation
|
||||||
|
|
||||||
|
|
||||||
|
Dataset versions can be found in **data**.
|
||||||
|
|
||||||
|
## Run Configuartion
|
||||||
|
|
||||||
|
All python scripts should be run from the base directory (from here) using Python 3.9.
|
||||||
|
Example: python evaluation/final_eval.py
|
||||||
@@ -25,8 +25,6 @@ A full dataset contains the complete recipes and additional information. Each re
|
|||||||
|
|
||||||
comments: List of user comments (Strings) about the recipe
|
comments: List of user comments (Strings) about the recipe
|
||||||
|
|
||||||
**dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.
|
|
||||||
|
|
||||||
**dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
|
**dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
|
||||||
|
|
||||||
**dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code,
|
**dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code,
|
||||||
@@ -42,8 +40,14 @@ Instructions are separated into sentences (List of Strings).
|
|||||||
**dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
**dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||||
separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||||
|
|
||||||
**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
**full_dataset_vers1.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||||
separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
separated into blocks of multiple sentences with up to 512 tokens, separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||||
|
|
||||||
|
**full_dataset_vers2.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||||
|
separated into blocks of multiple sentences with up to 512 tokens, not separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||||
|
|
||||||
|
**full_dataset_vers3.json**: Entire dataset with cleaned ingredients and instructions. Instructions are
|
||||||
|
separated into sentences. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
|
||||||
|
|
||||||
##Occurrances
|
##Occurrances
|
||||||
### Cleaned Ingredients
|
### Cleaned Ingredients
|
||||||
@@ -64,7 +68,7 @@ Ingredients are sorted by number of occurrences.
|
|||||||
ingredient occurrs in the cleaned steps. This is important for training the model later on,
|
ingredient occurrs in the cleaned steps. This is important for training the model later on,
|
||||||
as training will not be good for ingredients that only occur a few times in the steps.
|
as training will not be good for ingredients that only occur a few times in the steps.
|
||||||
|
|
||||||
## Ingredient sets
|
## food_categories directory
|
||||||
These json files contain all ingredients that could belong to the different categories.
|
These json files contain all ingredients that could belong to the different categories.
|
||||||
These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**
|
These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**
|
||||||
|
|
||||||
@@ -78,18 +82,7 @@ instructions is separated into its sentences. Each step is cleaned. The structur
|
|||||||
|
|
||||||
**cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions
|
**cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions
|
||||||
|
|
||||||
**complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
|
|
||||||
tokens.
|
|
||||||
|
|
||||||
**complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
|
|
||||||
tokens included.
|
|
||||||
|
|
||||||
**model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**
|
|
||||||
|
|
||||||
**training_data.txt**: instruction datapoints from recipes set aside for training
|
|
||||||
|
|
||||||
**testing_data.txt**: instruction datapoints from recipes set aside for testing
|
|
||||||
|
|
||||||
## Other
|
## Other
|
||||||
**ground_truth.json**: ground truth used for evaluation
|
**ground_truth.json**: ground truth used for evaluation
|
||||||
|
|
||||||
**synonyms.json**: synonyms of ingredients found in ground truth
|
**synonyms.json**: synonyms of ingredients found in ground truth
|
||||||
BIN
data/full_dataset_vers1.json
Normal file
BIN
data/full_dataset_vers1.json
Normal file
Binary file not shown.
BIN
data/full_dataset_vers2.json
Normal file
BIN
data/full_dataset_vers2.json
Normal file
Binary file not shown.
BIN
data/full_dataset_vers3.json
Normal file
BIN
data/full_dataset_vers3.json
Normal file
Binary file not shown.
@@ -0,0 +1,18 @@
|
|||||||
|
Some parameters (model version, etc.) need to be adjusted in all scripts.
|
||||||
|
|
||||||
|
## Generate Substitute Recommendations
|
||||||
|
**generate_substitutes.py** is used to generate the substitute recommendations for each model using various scoring thresholds. Model version and scoring threshold need to be specified.
|
||||||
|
|
||||||
|
## Prepare Data for Evaluation
|
||||||
|
**find_ground_truth_ingredients.py** was used to find "rare" and "frequent" ingredients for the ground truth.
|
||||||
|
|
||||||
|
Ingredients for which no substitute recommendations are found need to be added to the substitute-JSON file. This is done using **add_unused_ingredients.py**
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
An intermediate evaluation was done using **stats_engl_substitutes_compare.py** to gain insight into the various versions of the substitute recommendations. However, this script is not used for the final evaluation.
|
||||||
|
|
||||||
|
The ingredient substitute recommendations made using each FoodBERT version can be evaluated using **final_eval.py**.
|
||||||
|
The version that is to be used has to be adjusted in the first line of the main().
|
||||||
|
|
||||||
|
Stats for the dataset and the ground truth can be found using **dataset_stats.py** and **ground_truth_stats.py**, respectively.
|
||||||
@@ -1,523 +0,0 @@
|
|||||||
import json
|
|
||||||
import statistics
|
|
||||||
|
|
||||||
data_path = "data/"
|
|
||||||
occurances_path = "mult_ingredients_nice.json"
|
|
||||||
ground_truth_path = "ground_truth.json"
|
|
||||||
engl_data_path = "evaluation/engl_data/"
|
|
||||||
|
|
||||||
evaluation_path = "evaluation/"
|
|
||||||
synonyms_path = "synonyms.json"
|
|
||||||
|
|
||||||
found_substitutes_path = "final_Versions/models/vers2/eval/complete_substitute_pairs_50.json"
|
|
||||||
# model_name = "Versions/vers3/"
|
|
||||||
|
|
||||||
german_ground_truth = {
|
|
||||||
"Karotte": ["Pastinake", "Steckrübe", "Staudensellerie", "Kürbis", "Süßkartoffel", "Rettich", "Radieschen", "Kartoffel", "Paprika_rot", "Butternusskürbis", "Petersilienwurzel"],
|
|
||||||
"Kirsche": ["Aprikose", "Pflaume", "Nektarine", "Himbeeren", "Weintrauben", "Erdbeere", "Johannisbeeren", "Brombeeren", "Beeren_gemischte", "Pfirsich", "Cranberries", "Cranberries_getrocknet", "Blaubeeren", "Maraschino", "Beeren", "Trockenpflaumen"],
|
|
||||||
"Huhn": ["Truthahn", "Kaninchen", "Austernpilze", "Kalbfleisch", "Fisch", "Tofu", "Rindfleisch", "Tofu_fester", "Schweinefleisch", "Seitan", "Ente", "Lamm", "Pilze", "Shrimps", "Wachtel", "Gans", "Wildfleisch"],
|
|
||||||
"Petersilie": ["Kerbel", "Koriander", "Estragon", "Basilikum", "Oregano", "Liebstöckel", "Dill", "Koriandergrün", "Rosmarin", "Kapern", "Thymian", "Schnittlauch", "Minze", "Basilikum_getrockneter", "Oregano_getrocknet", "Thymian_getrocknet"],
|
|
||||||
"Schokolade": ["Nutella", "Kakaopulver_Instant", "Zucker", "Marmelade", "Marshmallow", "Kakao", "Süßigkeiten", "Erdnussbutter"],
|
|
||||||
"Frühstücksspeck": ["Pancetta", "Schinken_Prosciutto", "Speck", "Schinken_rohen", "Parmaschinken", "Schinken", "Salami", "Chorizo", "Wurst_Krakauer", "Schweineschwarte", "Schinkenwürfel", "Croûtons", "Speckwürfel", "Kochschinken", "Corned_Beef", "Wurst_Mortadella"],
|
|
||||||
"Grünkohl": ["Spinat", "Chinakohl", "Lauch", "Endiviensalat", "Mangold", "Wirsing", "Kohl", "Blumenkohl", "Brunnenkresse", "Rucola", "Blattspinat", "Kopfsalat", "Römersalat", "Babyspinat"],
|
|
||||||
"Zucker": ["Honig", "Stevia", "Süßstoff", "Stevia_flüssig", "Süßstoff_flüssigen", "Reissirup", "Ahornsirup", "Kondensmilch_gezuckerte", "Agavendicksaft", "Schokolade", "Vanille", "Melasse", "Zuckerrübensirup", "Sirup"],
|
|
||||||
"Brie": ["Camembert", "Gorgonzola", "Schmelzkäse", "Cheddarkäse", "Ziegenkäse", "Doppelrahmfrischkäse", "Blauschimmelkäse", "Roquefort", "Gouda", "Käse_Fontina", "Käse_Provolone", "Feta_Käse", "Scheiblettenkäse"],
|
|
||||||
"Truthahn": ["Huhn", "Kaninchen", "Ente", "Kochschinken", "Fasan", "Gans", "Rindfleisch", "Lammfleisch", "Schweinefleisch", "Roastbeef", "Kalbfleisch", "Geflügelfleisch", "Hähnchenfilet", "Hühnerkeule", "Wachtel", "schweinekotelett", "Wildfleisch"]
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def no_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
|
|
||||||
if get_occurrences:
|
|
||||||
with open(data_path + occurances_path, "r") as whole_json_file:
|
|
||||||
occurrences_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
if not ground_truth_dict:
|
|
||||||
with open(data_path+ground_truth_path, "r") as whole_json_file:
|
|
||||||
ground_truth_dict = json.load(whole_json_file)
|
|
||||||
if synonyms:
|
|
||||||
with open(data_path + synonyms_path, "r") as whole_json_file:
|
|
||||||
synonyms_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
synonyms_dict = {}
|
|
||||||
|
|
||||||
if not found_substitutes_dict:
|
|
||||||
with open(found_substitutes_path, "r") as whole_json_file:
|
|
||||||
model_substitutes_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
model_substitutes_dict = found_substitutes_dict
|
|
||||||
|
|
||||||
found_ground_ingr = {}
|
|
||||||
correctly_found = 0
|
|
||||||
incorrectly_found = 0
|
|
||||||
average_precision = 0.0
|
|
||||||
average_recall = 0.0
|
|
||||||
number_correct_subs_found_overall = []
|
|
||||||
total_number_subs_found_overall = []
|
|
||||||
# base ingredient without synonyms, substitutes with synonyms
|
|
||||||
for base_ingred in ground_truth_dict.keys():
|
|
||||||
if get_occurrences:
|
|
||||||
occurrences = occurrences_dict[base_ingred]
|
|
||||||
found_substitutes = model_substitutes_dict[base_ingred].copy()
|
|
||||||
|
|
||||||
# if len(found_substitutes) > 30:
|
|
||||||
# found_substitutes = found_substitutes[:30]
|
|
||||||
|
|
||||||
found = []
|
|
||||||
# remove synonyms of base ingredient
|
|
||||||
new_found_substitutes = []
|
|
||||||
for subst in found_substitutes:
|
|
||||||
if base_ingred in synonyms_dict.keys():
|
|
||||||
if subst not in synonyms_dict[base_ingred]:
|
|
||||||
new_found_substitutes.append(subst)
|
|
||||||
else:
|
|
||||||
new_found_substitutes.append(subst)
|
|
||||||
found_substitutes = new_found_substitutes
|
|
||||||
|
|
||||||
# check which substitutes were found
|
|
||||||
for subst in ground_truth_dict[base_ingred]:
|
|
||||||
# only add substitute if not already added
|
|
||||||
if subst in found_substitutes and subst not in found:
|
|
||||||
found.append(subst)
|
|
||||||
found_substitutes.remove(subst)
|
|
||||||
|
|
||||||
# check if synonyms of substitute were found
|
|
||||||
# check if ingredient has synonyms
|
|
||||||
if subst in synonyms_dict.keys():
|
|
||||||
for synon in synonyms_dict[subst]:
|
|
||||||
if synon in found_substitutes:
|
|
||||||
if synon not in found and subst not in found:
|
|
||||||
found.append(subst)
|
|
||||||
found_substitutes.remove(synon)
|
|
||||||
# if base_ingred == "Erdbeere":
|
|
||||||
print(base_ingred + ": " + str(found_substitutes))
|
|
||||||
found_ground_ingr[base_ingred] = found
|
|
||||||
# print(base_ingred + ": ")
|
|
||||||
# if get_occurrences:
|
|
||||||
# print("occurrences in dataset: " + str(occurrences))
|
|
||||||
# print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
|
|
||||||
# print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
|
|
||||||
# print("correctly found substitutes: " + str(found))
|
|
||||||
# print("incorrectly found substitutes: " + str(found_substitutes))
|
|
||||||
# print("-----------------------------\n")
|
|
||||||
if len(found) > 0:
|
|
||||||
average_precision += len(found)/(len(found) + len(found_substitutes))
|
|
||||||
# print(len(found))
|
|
||||||
average_recall += len(found)/len(ground_truth_dict[base_ingred])
|
|
||||||
correctly_found += len(found)
|
|
||||||
incorrectly_found += len(found_substitutes)
|
|
||||||
number_correct_subs_found_overall.append(len(found))
|
|
||||||
total_number_subs_found_overall.append(len(found) + len(found_substitutes))
|
|
||||||
|
|
||||||
print("average precision: " + str(average_precision/40))
|
|
||||||
print("average recall: " + str(average_recall/40))
|
|
||||||
print("median number of correctly found subs: " + str(statistics.median(number_correct_subs_found_overall)))
|
|
||||||
print("median number of found subs overall: " + str(statistics.median(total_number_subs_found_overall)))
|
|
||||||
return found_ground_ingr
|
|
||||||
|
|
||||||
|
|
||||||
def merge_lists(all_lists):
|
|
||||||
max_len = 0
|
|
||||||
min_len = 99999
|
|
||||||
output = []
|
|
||||||
for curr_list in all_lists:
|
|
||||||
if len(curr_list) < min_len:
|
|
||||||
min_len = len(curr_list)
|
|
||||||
if len(curr_list) > max_len:
|
|
||||||
max_len = len(curr_list)
|
|
||||||
for index_counter in range(max_len):
|
|
||||||
for curr_list in all_lists:
|
|
||||||
if index_counter < len(curr_list):
|
|
||||||
if curr_list[index_counter] not in output:
|
|
||||||
output.append(curr_list[index_counter])
|
|
||||||
return output
|
|
||||||
|
|
||||||
|
|
||||||
def with_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
|
|
||||||
if get_occurrences:
|
|
||||||
with open(data_path + occurances_path, "r") as whole_json_file:
|
|
||||||
occurrences_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
if not ground_truth_dict:
|
|
||||||
with open(data_path+ground_truth_path, "r") as whole_json_file:
|
|
||||||
ground_truth_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
if synonyms:
|
|
||||||
with open(data_path + synonyms_path, "r") as whole_json_file:
|
|
||||||
synonyms_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
synonyms_dict = {}
|
|
||||||
|
|
||||||
if not found_substitutes_dict:
|
|
||||||
with open(found_substitutes_path, "r") as whole_json_file:
|
|
||||||
model_substitutes_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
model_substitutes_dict = found_substitutes_dict
|
|
||||||
|
|
||||||
correctly_found = 0
|
|
||||||
incorrectly_found = 0
|
|
||||||
average_precision = 0.0
|
|
||||||
average_recall = 0.0
|
|
||||||
number_correct_subs_found_overall = []
|
|
||||||
total_number_subs_found_overall = []
|
|
||||||
|
|
||||||
found_ground_ingr = {}
|
|
||||||
# base ingredient with synonyms, substitutes with synonyms
|
|
||||||
for base_ingred in ground_truth_dict.keys():
|
|
||||||
base_synonyms = [base_ingred]
|
|
||||||
if get_occurrences:
|
|
||||||
occurrences = 0
|
|
||||||
# get list of all synonyms of base ingredient
|
|
||||||
if base_ingred in synonyms_dict.keys():
|
|
||||||
synonyms = synonyms_dict[base_ingred]
|
|
||||||
base_synonyms = base_synonyms + synonyms
|
|
||||||
found_substitutes = []
|
|
||||||
all_substitutes = []
|
|
||||||
# get top 30 substitutes of each base synonym
|
|
||||||
for synon in base_synonyms:
|
|
||||||
if get_occurrences:
|
|
||||||
occurrences += occurrences_dict[synon]
|
|
||||||
all_substitutes.append(model_substitutes_dict[synon].copy())
|
|
||||||
# synon_subs = model_substitutes_dict[synon].copy()
|
|
||||||
# if len(synon_subs) > 30:
|
|
||||||
# synon_subs = synon_subs[:30]
|
|
||||||
# for sub in synon_subs:
|
|
||||||
# if sub not in found_substitutes:
|
|
||||||
# found_substitutes.append(sub)
|
|
||||||
found_substitutes = merge_lists(all_substitutes)
|
|
||||||
else:
|
|
||||||
found_substitutes = model_substitutes_dict[base_ingred].copy()
|
|
||||||
|
|
||||||
if len(found_substitutes) > 30:
|
|
||||||
found_substitutes = found_substitutes[:30]
|
|
||||||
|
|
||||||
found = []
|
|
||||||
|
|
||||||
# remove all base synonyms from found substitutes
|
|
||||||
new_found_substitutes = []
|
|
||||||
for subst in found_substitutes:
|
|
||||||
if subst not in base_synonyms:
|
|
||||||
new_found_substitutes.append(subst)
|
|
||||||
found_substitutes = new_found_substitutes
|
|
||||||
|
|
||||||
# check which substitutes were found
|
|
||||||
for subst in ground_truth_dict[base_ingred]:
|
|
||||||
# only add substitute if not already added
|
|
||||||
if subst in found_substitutes and subst not in found:
|
|
||||||
found.append(subst)
|
|
||||||
found_substitutes.remove(subst)
|
|
||||||
|
|
||||||
# check if synonyms of substitute were found
|
|
||||||
# check if ingredient has synonyms
|
|
||||||
if subst in synonyms_dict.keys():
|
|
||||||
for synon in synonyms_dict[subst]:
|
|
||||||
if synon in found_substitutes:
|
|
||||||
if synon not in found and subst not in found:
|
|
||||||
found.append(subst)
|
|
||||||
found_substitutes.remove(synon)
|
|
||||||
|
|
||||||
found_ground_ingr[base_ingred] = found
|
|
||||||
# print(base_ingred + ": ")
|
|
||||||
# if get_occurrences:
|
|
||||||
# print("occurrences in dataset: " + str(occurrences))
|
|
||||||
# print("number of synonyms incl. original word: " + str(len(base_synonyms)))
|
|
||||||
# print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
|
|
||||||
# print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
|
|
||||||
# print("correctly found substitutes: " + str(found))
|
|
||||||
# print("incorrectly found substitutes: " + str(found_substitutes))
|
|
||||||
# print("-----------------------------\n")
|
|
||||||
|
|
||||||
if len(found) > 0:
|
|
||||||
average_precision += len(found) / (len(found) + len(found_substitutes))
|
|
||||||
average_recall += len(found) / len(ground_truth_dict[base_ingred])
|
|
||||||
correctly_found += len(found)
|
|
||||||
incorrectly_found += len(found_substitutes)
|
|
||||||
number_correct_subs_found_overall.append(len(found))
|
|
||||||
total_number_subs_found_overall.append(len(found) + len(found_substitutes))
|
|
||||||
|
|
||||||
print("average precision: " + str(average_precision / 40))
|
|
||||||
print("average recall: " + str(average_recall / 40))
|
|
||||||
print("median number of correctly found subs: " + str(statistics.median(number_correct_subs_found_overall)))
|
|
||||||
print("median number of found subs overall: " + str(statistics.median(total_number_subs_found_overall)))
|
|
||||||
|
|
||||||
return found_ground_ingr
|
|
||||||
|
|
||||||
|
|
||||||
def translate_engl_ground_truth(ground_truth, ger_transl):
|
|
||||||
new_ground_truth = {}
|
|
||||||
for base_ingr in ground_truth.keys():
|
|
||||||
new_ground_truth[ger_transl[base_ingr]] = []
|
|
||||||
for subst in ground_truth[base_ingr]:
|
|
||||||
if subst in ger_transl.keys():
|
|
||||||
new_ground_truth[ger_transl[base_ingr]].append(ger_transl[subst])
|
|
||||||
|
|
||||||
return new_ground_truth
|
|
||||||
|
|
||||||
|
|
||||||
def with_base_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
|
|
||||||
if get_occurrences:
|
|
||||||
with open(data_path + occurances_path, "r") as whole_json_file:
|
|
||||||
occurrences_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
if not ground_truth_dict:
|
|
||||||
with open(data_path+ground_truth_path, "r") as whole_json_file:
|
|
||||||
ground_truth_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
if synonyms:
|
|
||||||
with open(data_path + synonyms_path, "r") as whole_json_file:
|
|
||||||
synonyms_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
synonyms_dict = {}
|
|
||||||
|
|
||||||
if not found_substitutes_dict:
|
|
||||||
with open(found_substitutes_path, "r") as whole_json_file:
|
|
||||||
model_substitutes_dict = json.load(whole_json_file)
|
|
||||||
else:
|
|
||||||
model_substitutes_dict = found_substitutes_dict
|
|
||||||
|
|
||||||
found_ground_ingr = {}
|
|
||||||
# base ingredient with synonyms, substitutes with synonyms
|
|
||||||
for base_ingred in ground_truth_dict.keys():
|
|
||||||
base_synonyms = [base_ingred]
|
|
||||||
if get_occurrences:
|
|
||||||
occurrences = 0
|
|
||||||
# get list of all synonyms of base ingredient
|
|
||||||
if base_ingred in synonyms_dict.keys():
|
|
||||||
synonyms = synonyms_dict[base_ingred]
|
|
||||||
base_synonyms = base_synonyms + synonyms
|
|
||||||
found_substitutes = []
|
|
||||||
all_substitutes = []
|
|
||||||
# get top 30 substitutes of each base synonym
|
|
||||||
for synon in base_synonyms:
|
|
||||||
if get_occurrences:
|
|
||||||
occurrences += occurrences_dict[synon]
|
|
||||||
all_substitutes.append(model_substitutes_dict[synon].copy())
|
|
||||||
|
|
||||||
found_substitutes = merge_lists(all_substitutes)
|
|
||||||
else:
|
|
||||||
found_substitutes = model_substitutes_dict[base_ingred].copy()
|
|
||||||
|
|
||||||
if len(found_substitutes) > 30:
|
|
||||||
found_substitutes = found_substitutes[:30]
|
|
||||||
|
|
||||||
found = []
|
|
||||||
|
|
||||||
# remove all base synonyms from found substitutes
|
|
||||||
new_found_substitutes = []
|
|
||||||
for subst in found_substitutes:
|
|
||||||
if subst not in base_synonyms:
|
|
||||||
new_found_substitutes.append(subst)
|
|
||||||
found_substitutes = new_found_substitutes
|
|
||||||
|
|
||||||
# check which substitutes were found
|
|
||||||
for subst in ground_truth_dict[base_ingred]:
|
|
||||||
# only add substitute if not already added
|
|
||||||
if subst in found_substitutes and subst not in found:
|
|
||||||
found.append(subst)
|
|
||||||
found_substitutes.remove(subst)
|
|
||||||
|
|
||||||
# check if synonyms of substitute were found
|
|
||||||
# check if ingredient has synonyms
|
|
||||||
# if subst in synonyms_dict.keys():
|
|
||||||
# for synon in synonyms_dict[subst]:
|
|
||||||
# if synon in found_substitutes:
|
|
||||||
# if synon not in found and subst not in found:
|
|
||||||
# found.append(subst)
|
|
||||||
# found_substitutes.remove(synon)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
found_ground_ingr[base_ingred] = found
|
|
||||||
print(base_ingred + ": ")
|
|
||||||
if get_occurrences:
|
|
||||||
print("occurrences in dataset: " + str(occurrences))
|
|
||||||
print("number of synonyms incl. original word: " + str(len(base_synonyms)))
|
|
||||||
print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
|
|
||||||
print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
|
|
||||||
print("correctly found substitutes: " + str(found))
|
|
||||||
print("incorrectly found substitutes: " + str(found_substitutes))
|
|
||||||
print("-----------------------------\n")
|
|
||||||
|
|
||||||
return found_ground_ingr
|
|
||||||
|
|
||||||
|
|
||||||
def engl_compare():
|
|
||||||
# with open(data_path + occurances_path, "r") as whole_json_file:
|
|
||||||
# occurrences_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
with open(engl_data_path + "translation.json", "r") as whole_json_file:
|
|
||||||
ger_transl = json.load(whole_json_file)
|
|
||||||
|
|
||||||
# with open(data_path + synonyms_path, "r") as whole_json_file:
|
|
||||||
# synonyms_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
with open(found_substitutes_path, "r") as whole_json_file:
|
|
||||||
model_substitutes_dict = json.load(whole_json_file)
|
|
||||||
|
|
||||||
with open(engl_data_path + "substitute_pairs_foodbert_text.json", "r") as whole_json_file:
|
|
||||||
engl_list = json.load(whole_json_file)
|
|
||||||
|
|
||||||
with open(engl_data_path + "engl_ground_truth.json", "r") as whole_json_file:
|
|
||||||
engl_ground_truth = json.load(whole_json_file)
|
|
||||||
|
|
||||||
engl_dict = {}
|
|
||||||
for foo in engl_list:
|
|
||||||
if foo[0] in engl_dict.keys():
|
|
||||||
engl_dict[foo[0]].append(foo[1])
|
|
||||||
else:
|
|
||||||
engl_dict[foo[0]] = [foo[1]]
|
|
||||||
|
|
||||||
translated_ground_truth = translate_engl_ground_truth(engl_ground_truth, ger_transl)
|
|
||||||
|
|
||||||
# without any synonyms
|
|
||||||
print("Engl compare without any synonyms:")
|
|
||||||
engl_replacements = {}
|
|
||||||
# ger_replacements = {}
|
|
||||||
for ingred in engl_ground_truth.keys():
|
|
||||||
found = []
|
|
||||||
incorr = []
|
|
||||||
found_ger = []
|
|
||||||
incorr_ger = []
|
|
||||||
engl_replacements[ingred] = {}
|
|
||||||
engl_replacements[ingred]["engl"] = 0
|
|
||||||
engl_replacements[ingred]["ger"] = 0
|
|
||||||
# ger_replacements[ingred] = 0
|
|
||||||
if ingred in engl_dict.keys():
|
|
||||||
for sub in engl_ground_truth[ingred]:
|
|
||||||
if sub in engl_dict[ingred]:
|
|
||||||
engl_replacements[ingred]["engl"] += 1
|
|
||||||
found.append(sub)
|
|
||||||
if ger_transl[ingred] in model_substitutes_dict.keys():
|
|
||||||
for sub in german_ground_truth[ger_transl[ingred]]:
|
|
||||||
if sub in model_substitutes_dict[ger_transl[ingred]]:
|
|
||||||
engl_replacements[ingred]["ger"] += 1
|
|
||||||
found_ger.append(sub)
|
|
||||||
# ger_replacements[ingred] += 1
|
|
||||||
for found_sub in engl_dict[ingred]:
|
|
||||||
if found_sub not in engl_ground_truth[ingred]:
|
|
||||||
incorr.append(found_sub)
|
|
||||||
for found_sub in model_substitutes_dict[ger_transl[ingred]]:
|
|
||||||
if found_sub not in translated_ground_truth[ger_transl[ingred]]:
|
|
||||||
incorr_ger.append(found_sub)
|
|
||||||
|
|
||||||
|
|
||||||
print(ger_transl[ingred] + ": ")
|
|
||||||
print("number of found substitutes: " + str(len(found_ger)) + "/" + str(len(translated_ground_truth[ger_transl[ingred]])))
|
|
||||||
print("correctly found substitutes: " + str(len(found_ger)) + "/" + str(len(found_ger) + len(incorr_ger)))
|
|
||||||
print("correctly found substitutes: " + str(found_ger))
|
|
||||||
print("incorrectly found substitutes: " + str(incorr_ger))
|
|
||||||
print("-----------------------------\n")
|
|
||||||
|
|
||||||
print(ingred + ": ")
|
|
||||||
print("number of found substitutes: " + str(len(found)) + "/" + str(len(engl_ground_truth[ingred])))
|
|
||||||
print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(incorr)))
|
|
||||||
print("correctly found substitutes: " + str(found))
|
|
||||||
print("incorrectly found substitutes: " + str(incorr))
|
|
||||||
print("-----------------------------\n")
|
|
||||||
|
|
||||||
with open(evaluation_path + "engl_comparison_results/engl_no_syn.json", 'w') as f:
|
|
||||||
json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
|
|
||||||
|
|
||||||
|
|
||||||
# with synonyms of substitutes
|
|
||||||
print("Engl compare with synonyms of substitutes only:")
|
|
||||||
# german
|
|
||||||
new_german_result = no_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
|
|
||||||
#engl
|
|
||||||
new_engl_result = no_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict, get_occurrences=False, synonyms=False)
|
|
||||||
|
|
||||||
engl_replacements = {}
|
|
||||||
for ingred in engl_ground_truth.keys():
|
|
||||||
engl_replacements[ingred] = {}
|
|
||||||
engl_replacements[ingred]["engl"] = 0
|
|
||||||
engl_replacements[ingred]["ger"] = 0
|
|
||||||
if ingred in new_engl_result.keys():
|
|
||||||
for sub in engl_ground_truth[ingred]:
|
|
||||||
if sub in new_engl_result[ingred]:
|
|
||||||
engl_replacements[ingred]["engl"] += 1
|
|
||||||
if ger_transl[ingred] in new_german_result.keys():
|
|
||||||
for sub in german_ground_truth[ger_transl[ingred]]:
|
|
||||||
if sub in new_german_result[ger_transl[ingred]]:
|
|
||||||
engl_replacements[ingred]["ger"] += 1
|
|
||||||
|
|
||||||
with open(evaluation_path + "engl_comparison_results/engl_sub_syn.json", 'w') as f:
|
|
||||||
json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
|
|
||||||
|
|
||||||
# with synonyms for substitutes and base words
|
|
||||||
print("Engl compare with synonyms of both:")
|
|
||||||
# german
|
|
||||||
new_german_result = with_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
|
|
||||||
# engl
|
|
||||||
new_engl_result = with_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict, get_occurrences=False, synonyms=False)
|
|
||||||
|
|
||||||
engl_replacements = {}
|
|
||||||
for ingred in engl_ground_truth.keys():
|
|
||||||
engl_replacements[ingred] = {}
|
|
||||||
engl_replacements[ingred]["engl"] = 0
|
|
||||||
engl_replacements[ingred]["ger"] = 0
|
|
||||||
if ingred in new_engl_result.keys():
|
|
||||||
for sub in engl_ground_truth[ingred]:
|
|
||||||
if sub in new_engl_result[ingred]:
|
|
||||||
engl_replacements[ingred]["engl"] += 1
|
|
||||||
if ger_transl[ingred] in new_german_result.keys():
|
|
||||||
for sub in german_ground_truth[ger_transl[ingred]]:
|
|
||||||
if sub in new_german_result[ger_transl[ingred]]:
|
|
||||||
engl_replacements[ingred]["ger"] += 1
|
|
||||||
|
|
||||||
with open(evaluation_path + "engl_comparison_results/engl_all_syn.json", 'w') as f:
|
|
||||||
json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
|
|
||||||
|
|
||||||
# with synonyms for base words
|
|
||||||
print("Engl compare with synonyms of base words only:")
|
|
||||||
|
|
||||||
# german
|
|
||||||
new_german_result = with_base_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
|
|
||||||
# engl
|
|
||||||
new_engl_result = with_base_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict,
|
|
||||||
get_occurrences=False, synonyms=False)
|
|
||||||
|
|
||||||
engl_replacements = {}
|
|
||||||
for ingred in engl_ground_truth.keys():
|
|
||||||
engl_replacements[ingred] = {}
|
|
||||||
engl_replacements[ingred]["engl"] = 0
|
|
||||||
engl_replacements[ingred]["ger"] = 0
|
|
||||||
if ingred in new_engl_result.keys():
|
|
||||||
for sub in engl_ground_truth[ingred]:
|
|
||||||
if sub in new_engl_result[ingred]:
|
|
||||||
engl_replacements[ingred]["engl"] += 1
|
|
||||||
if ger_transl[ingred] in new_german_result.keys():
|
|
||||||
for sub in german_ground_truth[ger_transl[ingred]]:
|
|
||||||
if sub in new_german_result[ger_transl[ingred]]:
|
|
||||||
engl_replacements[ingred]["ger"] += 1
|
|
||||||
|
|
||||||
with open(evaluation_path + "engl_comparison_results/engl_base_syn.json", 'w') as f:
|
|
||||||
json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
|
|
||||||
|
|
||||||
print("test")
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# compare english and german results
|
|
||||||
# engl_compare()
|
|
||||||
|
|
||||||
print("--------------------------------------------------------")
|
|
||||||
print("--------------------------------------------------------")
|
|
||||||
print("--------------------------------------------------------\n")
|
|
||||||
|
|
||||||
# get results, synonyms only used in substitutes
|
|
||||||
no_synonyms()
|
|
||||||
|
|
||||||
print("--------------------------------------------------------")
|
|
||||||
print("--------------------------------------------------------")
|
|
||||||
print("--------------------------------------------------------\n")
|
|
||||||
|
|
||||||
# get results, synonyms used in substitutes and base ingredients
|
|
||||||
with_synonyms()
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
main()
|
|
||||||
@@ -1,3 +1,19 @@
|
|||||||
|
## German FoodBERT Models
|
||||||
Unzip German FoodBERT models here!
|
Unzip German FoodBERT models here!
|
||||||
|
|
||||||
They can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
|
They can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
|
||||||
|
|
||||||
|
|
||||||
|
## Datasets
|
||||||
|
Each model has a folder "dataset" with the following files:
|
||||||
|
|
||||||
|
**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. This is the same file as found for each version in the main data directory.
|
||||||
|
|
||||||
|
**complete_dataset.json**: dataset containing only URLs and instructions, separated depending on the version
|
||||||
|
|
||||||
|
**model_datapoints.txt**: list of only the instruction datapoints from **complete_dataset.json**
|
||||||
|
|
||||||
|
**training_data.txt**: instruction datapoints from recipes set aside for training
|
||||||
|
|
||||||
|
**testing_data.txt**: instruction datapoints from recipes set aside for testing
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user