added to README files, added full dataset versions to data

2021-04-15 20:19:09 +02:00
parent cf40ad15fb
commit 1ea0677029
9 changed files with 61 additions and 543 deletions
--- a/README.md
+++ b/README.md
@@ -3,4 +3,18 @@
 German FoodBERT models for ingredient substitute recommendation
 The 3 German FoodBERT versions can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
-The zip has to be unpacked in final_Versions/
+The zip has to be unpacked in **final_Versions/**
 More infos about each step can be found in other README files in each directory. The overall order is:
 - crawl_recipes
 - clean_dataset
 - train_model
 - evalutation
 Dataset versions can be found in **data**.
 ## Run Configuartion
 All python scripts should be run from the base directory (from here) using Python 3.9. 
 Example: python evaluation/final_eval.py
--- a/data/README.md
+++ b/data/README.md
@@ -25,8 +25,6 @@ A full dataset contains the complete recipes and additional information. Each re
        comments: List of user comments (Strings) about the recipe
 **dataset_parts**: Directory contains the dataset parts that were pulled from chefkoch.
 **dataset_fin.json**: Entire dataset (combined dataset parts) as retrieved from chefkoch. Instructions of each recipe are one String.
 **dataset_test.json**: A test dataset that only contains a few recipes. This dataset was used to test code, 
@@ -42,8 +40,14 @@ Instructions are separated into sentences (List of Strings).
 **dataset_cleaned_steps_not_empty.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
 separated into sentences (List of Strings). Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
-**full_dataset.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
+**full_dataset_vers1.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
-separated into blocks of multiple sentences with up to 512 tokens. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
+separated into blocks of multiple sentences with up to 512 tokens, separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
 **full_dataset_vers2.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
 separated into blocks of multiple sentences with up to 512 tokens, not separated by [SEP]. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
 **full_dataset_vers3.json**: Entire dataset with cleaned ingredients and instructions. Instructions are 
 separated into sentences. Recipes without instructions removed manually by searching for recipes containing **instructions": []**.
 ##Occurrances
 ### Cleaned Ingredients
@@ -64,7 +68,7 @@ Ingredients are sorted by number of occurrences.
 ingredient occurrs in the cleaned steps. This is important for training the model later on,
 as training will not be good for ingredients that only occur a few times in the steps.
-## Ingredient sets
+## food_categories directory
 These json files contain all ingredients that could belong to the different categories.
 These files are: **bread.json**, **fish.json**, **meats.json**, **pasta.json**, **rolls.json**, **sausage.json**
@@ -78,18 +82,7 @@ instructions is separated into its sentences. Each step is cleaned. The structur
 **cleaned_sep_sentences_not_empty.json**: same as above, but removed recipes that don't have any instructions
 **complete_dataset512.json**: sentences of instructions of a recipe are combined until token amount nears 512. Not [SEP]
 tokens.
 **complete_dataset_SEP512.json**: sentences of instructions of a recipe are combined until token amount nears 512. [SEP]
 tokens included.
 **model_datapoints.txt** and **model_datapoints_SEP.txt**: list of only the datapoints from **complete_dataset... .json**
 **training_data.txt**: instruction datapoints from recipes set aside for training
 **testing_data.txt**: instruction datapoints from recipes set aside for testing
 ## Other
 **ground_truth.json**: ground truth used for evaluation
 **synonyms.json**: synonyms of ingredients found in ground truth
--- a/data/full_dataset_vers1.json
+++ b/data/full_dataset_vers1.json
--- a/data/full_dataset_vers2.json
+++ b/data/full_dataset_vers2.json
--- a/data/full_dataset_vers3.json
+++ b/data/full_dataset_vers3.json
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -0,0 +1,18 @@
 Some parameters (model version, etc.) need to be adjusted in all scripts.
 ## Generate Substitute Recommendations
 **generate_substitutes.py** is used to generate the substitute recommendations for each model using various scoring thresholds. Model version and scoring threshold need to be specified.
 ## Prepare Data for Evaluation
 **find_ground_truth_ingredients.py** was used to find "rare" and "frequent" ingredients for the ground truth.
 Ingredients for which no substitute recommendations are found need to be added to the substitute-JSON file. This is done using **add_unused_ingredients.py**
 ## Evaluation
 An intermediate evaluation was done using **stats_engl_substitutes_compare.py** to gain insight into the various versions of the substitute recommendations. However, this script is not used for the final evaluation.
 The ingredient substitute recommendations made using each FoodBERT version can be evaluated using **final_eval.py**. 
 The version that is to be used has to be adjusted in the first line of the main().
 Stats for the dataset and the ground truth can be found using **dataset_stats.py** and **ground_truth_stats.py**, respectively.
--- a/evaluation/evaluate.py
+++ b/evaluation/evaluate.py
@@ -1,523 +0,0 @@
 import json
 import statistics
 data_path = "data/"
 occurances_path = "mult_ingredients_nice.json"
 ground_truth_path = "ground_truth.json"
 engl_data_path = "evaluation/engl_data/"
 evaluation_path = "evaluation/"
 synonyms_path = "synonyms.json"
 found_substitutes_path = "final_Versions/models/vers2/eval/complete_substitute_pairs_50.json"
 # model_name = "Versions/vers3/"
 german_ground_truth = {
        "Karotte": ["Pastinake", "Steckrübe", "Staudensellerie", "Kürbis", "Süßkartoffel", "Rettich", "Radieschen", "Kartoffel", "Paprika_rot", "Butternusskürbis", "Petersilienwurzel"],
        "Kirsche": ["Aprikose", "Pflaume", "Nektarine", "Himbeeren", "Weintrauben", "Erdbeere", "Johannisbeeren", "Brombeeren", "Beeren_gemischte", "Pfirsich", "Cranberries", "Cranberries_getrocknet", "Blaubeeren", "Maraschino", "Beeren", "Trockenpflaumen"],
        "Huhn": ["Truthahn", "Kaninchen", "Austernpilze", "Kalbfleisch", "Fisch", "Tofu", "Rindfleisch", "Tofu_fester", "Schweinefleisch", "Seitan", "Ente", "Lamm", "Pilze", "Shrimps", "Wachtel", "Gans", "Wildfleisch"],
        "Petersilie": ["Kerbel", "Koriander", "Estragon", "Basilikum", "Oregano", "Liebstöckel", "Dill", "Koriandergrün", "Rosmarin", "Kapern", "Thymian", "Schnittlauch", "Minze", "Basilikum_getrockneter", "Oregano_getrocknet", "Thymian_getrocknet"],
        "Schokolade": ["Nutella", "Kakaopulver_Instant", "Zucker", "Marmelade", "Marshmallow", "Kakao", "Süßigkeiten", "Erdnussbutter"],
        "Frühstücksspeck": ["Pancetta", "Schinken_Prosciutto", "Speck", "Schinken_rohen", "Parmaschinken", "Schinken", "Salami", "Chorizo", "Wurst_Krakauer", "Schweineschwarte", "Schinkenwürfel", "Croûtons", "Speckwürfel", "Kochschinken", "Corned_Beef", "Wurst_Mortadella"],
        "Grünkohl": ["Spinat", "Chinakohl", "Lauch", "Endiviensalat", "Mangold", "Wirsing", "Kohl", "Blumenkohl", "Brunnenkresse", "Rucola", "Blattspinat", "Kopfsalat", "Römersalat", "Babyspinat"],
        "Zucker": ["Honig", "Stevia", "Süßstoff", "Stevia_flüssig", "Süßstoff_flüssigen", "Reissirup", "Ahornsirup", "Kondensmilch_gezuckerte", "Agavendicksaft", "Schokolade", "Vanille", "Melasse", "Zuckerrübensirup", "Sirup"],
        "Brie": ["Camembert", "Gorgonzola", "Schmelzkäse", "Cheddarkäse", "Ziegenkäse", "Doppelrahmfrischkäse", "Blauschimmelkäse", "Roquefort", "Gouda", "Käse_Fontina", "Käse_Provolone", "Feta_Käse", "Scheiblettenkäse"],
        "Truthahn": ["Huhn", "Kaninchen", "Ente", "Kochschinken", "Fasan", "Gans", "Rindfleisch", "Lammfleisch", "Schweinefleisch", "Roastbeef", "Kalbfleisch", "Geflügelfleisch", "Hähnchenfilet", "Hühnerkeule", "Wachtel", "schweinekotelett", "Wildfleisch"]
    }
 def no_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
    if get_occurrences:
        with open(data_path + occurances_path, "r") as whole_json_file:
            occurrences_dict = json.load(whole_json_file)
    if not ground_truth_dict:
        with open(data_path+ground_truth_path, "r") as whole_json_file:
            ground_truth_dict = json.load(whole_json_file)
    if synonyms:
        with open(data_path + synonyms_path, "r") as whole_json_file:
            synonyms_dict = json.load(whole_json_file)
    else:
        synonyms_dict = {}
    if not found_substitutes_dict:
        with open(found_substitutes_path, "r") as whole_json_file:
            model_substitutes_dict = json.load(whole_json_file)
    else:
        model_substitutes_dict = found_substitutes_dict
    found_ground_ingr = {}
    correctly_found = 0
    incorrectly_found = 0
    average_precision = 0.0
    average_recall = 0.0
    number_correct_subs_found_overall = []
    total_number_subs_found_overall = []
    # base ingredient without synonyms, substitutes with synonyms
    for base_ingred in ground_truth_dict.keys():
        if get_occurrences:
            occurrences = occurrences_dict[base_ingred]
        found_substitutes = model_substitutes_dict[base_ingred].copy()
        # if len(found_substitutes) > 30:
        #     found_substitutes = found_substitutes[:30]
        found = []
        # remove synonyms of base ingredient
        new_found_substitutes = []
        for subst in found_substitutes:
            if base_ingred in synonyms_dict.keys():
                if subst not in synonyms_dict[base_ingred]:
                    new_found_substitutes.append(subst)
            else:
                new_found_substitutes.append(subst)
        found_substitutes = new_found_substitutes
        # check which substitutes were found
        for subst in ground_truth_dict[base_ingred]:
            # only add substitute if not already added
            if subst in found_substitutes and subst not in found:
                found.append(subst)
                found_substitutes.remove(subst)
            # check if synonyms of substitute were found
            # check if ingredient has synonyms
            if subst in synonyms_dict.keys():
                for synon in synonyms_dict[subst]:
                    if synon in found_substitutes:
                        if synon not in found and subst not in found:
                            found.append(subst)
                        found_substitutes.remove(synon)
        # if base_ingred == "Erdbeere":
        print(base_ingred + ": " + str(found_substitutes))
        found_ground_ingr[base_ingred] = found
        # print(base_ingred + ": ")
        # if get_occurrences:
        #     print("occurrences in dataset: " + str(occurrences))
        # print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
        # print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
        # print("correctly found substitutes: " + str(found))
        # print("incorrectly found substitutes: " + str(found_substitutes))
        # print("-----------------------------\n")
        if len(found) > 0:
            average_precision += len(found)/(len(found) + len(found_substitutes))
        # print(len(found))
        average_recall += len(found)/len(ground_truth_dict[base_ingred])
        correctly_found += len(found)
        incorrectly_found += len(found_substitutes)
        number_correct_subs_found_overall.append(len(found))
        total_number_subs_found_overall.append(len(found) + len(found_substitutes))
    print("average precision: " + str(average_precision/40))
    print("average recall: " + str(average_recall/40))
    print("median number of correctly found subs: " + str(statistics.median(number_correct_subs_found_overall)))
    print("median number of found subs overall: " + str(statistics.median(total_number_subs_found_overall)))
    return found_ground_ingr
 def merge_lists(all_lists):
    max_len = 0
    min_len = 99999
    output = []
    for curr_list in all_lists:
        if len(curr_list) < min_len:
            min_len = len(curr_list)
        if len(curr_list) > max_len:
            max_len = len(curr_list)
    for index_counter in range(max_len):
        for curr_list in all_lists:
            if index_counter < len(curr_list):
                if curr_list[index_counter] not in output:
                    output.append(curr_list[index_counter])
    return output
 def with_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
    if get_occurrences:
        with open(data_path + occurances_path, "r") as whole_json_file:
            occurrences_dict = json.load(whole_json_file)
    if not ground_truth_dict:
        with open(data_path+ground_truth_path, "r") as whole_json_file:
            ground_truth_dict = json.load(whole_json_file)
    if synonyms:
        with open(data_path + synonyms_path, "r") as whole_json_file:
            synonyms_dict = json.load(whole_json_file)
    else:
        synonyms_dict = {}
    if not found_substitutes_dict:
        with open(found_substitutes_path, "r") as whole_json_file:
            model_substitutes_dict = json.load(whole_json_file)
    else:
        model_substitutes_dict = found_substitutes_dict
    correctly_found = 0
    incorrectly_found = 0
    average_precision = 0.0
    average_recall = 0.0
    number_correct_subs_found_overall = []
    total_number_subs_found_overall = []
    found_ground_ingr = {}
    # base ingredient with synonyms, substitutes with synonyms
    for base_ingred in ground_truth_dict.keys():
        base_synonyms = [base_ingred]
        if get_occurrences:
            occurrences = 0
        # get list of all synonyms of base ingredient
        if base_ingred in synonyms_dict.keys():
            synonyms = synonyms_dict[base_ingred]
            base_synonyms = base_synonyms + synonyms
            found_substitutes = []
            all_substitutes = []
            # get top 30 substitutes of each base synonym
            for synon in base_synonyms:
                if get_occurrences:
                    occurrences += occurrences_dict[synon]
                all_substitutes.append(model_substitutes_dict[synon].copy())
                # synon_subs = model_substitutes_dict[synon].copy()
                # if len(synon_subs) > 30:
                #     synon_subs = synon_subs[:30]
                # for sub in synon_subs:
                #     if sub not in found_substitutes:
                #         found_substitutes.append(sub)
            found_substitutes = merge_lists(all_substitutes)
        else:
            found_substitutes = model_substitutes_dict[base_ingred].copy()
        if len(found_substitutes) > 30:
            found_substitutes = found_substitutes[:30]
        found = []
        # remove all base synonyms from found substitutes
        new_found_substitutes = []
        for subst in found_substitutes:
            if subst not in base_synonyms:
                new_found_substitutes.append(subst)
        found_substitutes = new_found_substitutes
        # check which substitutes were found
        for subst in ground_truth_dict[base_ingred]:
            # only add substitute if not already added
            if subst in found_substitutes and subst not in found:
                found.append(subst)
                found_substitutes.remove(subst)
            # check if synonyms of substitute were found
            # check if ingredient has synonyms
            if subst in synonyms_dict.keys():
                for synon in synonyms_dict[subst]:
                    if synon in found_substitutes:
                        if synon not in found and subst not in found:
                            found.append(subst)
                        found_substitutes.remove(synon)
        found_ground_ingr[base_ingred] = found
        # print(base_ingred + ": ")
        # if get_occurrences:
        #     print("occurrences in dataset: " + str(occurrences))
        # print("number of synonyms incl. original word: " + str(len(base_synonyms)))
        # print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
        # print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
        # print("correctly found substitutes: " + str(found))
        # print("incorrectly found substitutes: " + str(found_substitutes))
        # print("-----------------------------\n")
        if len(found) > 0:
            average_precision += len(found) / (len(found) + len(found_substitutes))
        average_recall += len(found) / len(ground_truth_dict[base_ingred])
        correctly_found += len(found)
        incorrectly_found += len(found_substitutes)
        number_correct_subs_found_overall.append(len(found))
        total_number_subs_found_overall.append(len(found) + len(found_substitutes))
    print("average precision: " + str(average_precision / 40))
    print("average recall: " + str(average_recall / 40))
    print("median number of correctly found subs: " + str(statistics.median(number_correct_subs_found_overall)))
    print("median number of found subs overall: " + str(statistics.median(total_number_subs_found_overall)))
    return found_ground_ingr
 def translate_engl_ground_truth(ground_truth, ger_transl):
    new_ground_truth = {}
    for base_ingr in ground_truth.keys():
        new_ground_truth[ger_transl[base_ingr]] = []
        for subst in ground_truth[base_ingr]:
            if subst in ger_transl.keys():
                new_ground_truth[ger_transl[base_ingr]].append(ger_transl[subst])
    return new_ground_truth
 def with_base_synonyms(ground_truth_dict=None, found_substitutes_dict=None, get_occurrences=True, synonyms=True):
    if get_occurrences:
        with open(data_path + occurances_path, "r") as whole_json_file:
            occurrences_dict = json.load(whole_json_file)
    if not ground_truth_dict:
        with open(data_path+ground_truth_path, "r") as whole_json_file:
            ground_truth_dict = json.load(whole_json_file)
    if synonyms:
        with open(data_path + synonyms_path, "r") as whole_json_file:
            synonyms_dict = json.load(whole_json_file)
    else:
        synonyms_dict = {}
    if not found_substitutes_dict:
        with open(found_substitutes_path, "r") as whole_json_file:
            model_substitutes_dict = json.load(whole_json_file)
    else:
        model_substitutes_dict = found_substitutes_dict
    found_ground_ingr = {}
    # base ingredient with synonyms, substitutes with synonyms
    for base_ingred in ground_truth_dict.keys():
        base_synonyms = [base_ingred]
        if get_occurrences:
            occurrences = 0
        # get list of all synonyms of base ingredient
        if base_ingred in synonyms_dict.keys():
            synonyms = synonyms_dict[base_ingred]
            base_synonyms = base_synonyms + synonyms
            found_substitutes = []
            all_substitutes = []
            # get top 30 substitutes of each base synonym
            for synon in base_synonyms:
                if get_occurrences:
                    occurrences += occurrences_dict[synon]
                all_substitutes.append(model_substitutes_dict[synon].copy())
            found_substitutes = merge_lists(all_substitutes)
        else:
            found_substitutes = model_substitutes_dict[base_ingred].copy()
        if len(found_substitutes) > 30:
            found_substitutes = found_substitutes[:30]
        found = []
        # remove all base synonyms from found substitutes
        new_found_substitutes = []
        for subst in found_substitutes:
            if subst not in base_synonyms:
                new_found_substitutes.append(subst)
        found_substitutes = new_found_substitutes
        # check which substitutes were found
        for subst in ground_truth_dict[base_ingred]:
            # only add substitute if not already added
            if subst in found_substitutes and subst not in found:
                found.append(subst)
                found_substitutes.remove(subst)
            # check if synonyms of substitute were found
            # check if ingredient has synonyms
            # if subst in synonyms_dict.keys():
            #     for synon in synonyms_dict[subst]:
            #         if synon in found_substitutes:
            #             if synon not in found and subst not in found:
            #                 found.append(subst)
            #             found_substitutes.remove(synon)
        found_ground_ingr[base_ingred] = found
        print(base_ingred + ": ")
        if get_occurrences:
            print("occurrences in dataset: " + str(occurrences))
        print("number of synonyms incl. original word: " + str(len(base_synonyms)))
        print("number of found substitutes: " + str(len(found)) + "/" + str(len(ground_truth_dict[base_ingred])))
        print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(found_substitutes)))
        print("correctly found substitutes: " + str(found))
        print("incorrectly found substitutes: " + str(found_substitutes))
        print("-----------------------------\n")
    return found_ground_ingr
 def engl_compare():
    # with open(data_path + occurances_path, "r") as whole_json_file:
    #     occurrences_dict = json.load(whole_json_file)
    with open(engl_data_path + "translation.json", "r") as whole_json_file:
        ger_transl = json.load(whole_json_file)
    # with open(data_path + synonyms_path, "r") as whole_json_file:
    #     synonyms_dict = json.load(whole_json_file)
    with open(found_substitutes_path, "r") as whole_json_file:
        model_substitutes_dict = json.load(whole_json_file)
    with open(engl_data_path + "substitute_pairs_foodbert_text.json", "r") as whole_json_file:
        engl_list = json.load(whole_json_file)
    with open(engl_data_path + "engl_ground_truth.json", "r") as whole_json_file:
        engl_ground_truth = json.load(whole_json_file)
    engl_dict = {}
    for foo in engl_list:
        if foo[0] in engl_dict.keys():
            engl_dict[foo[0]].append(foo[1])
        else:
            engl_dict[foo[0]] = [foo[1]]
    translated_ground_truth = translate_engl_ground_truth(engl_ground_truth, ger_transl)
    # without any synonyms
    print("Engl compare without any synonyms:")
    engl_replacements = {}
    # ger_replacements = {}
    for ingred in engl_ground_truth.keys():
        found = []
        incorr = []
        found_ger = []
        incorr_ger = []
        engl_replacements[ingred] = {}
        engl_replacements[ingred]["engl"] = 0
        engl_replacements[ingred]["ger"] = 0
        # ger_replacements[ingred] = 0
        if ingred in engl_dict.keys():
            for sub in engl_ground_truth[ingred]:
                if sub in engl_dict[ingred]:
                    engl_replacements[ingred]["engl"] += 1
                    found.append(sub)
        if ger_transl[ingred] in model_substitutes_dict.keys():
            for sub in german_ground_truth[ger_transl[ingred]]:
                if sub in model_substitutes_dict[ger_transl[ingred]]:
                    engl_replacements[ingred]["ger"] += 1
                    found_ger.append(sub)
                    # ger_replacements[ingred] += 1
        for found_sub in engl_dict[ingred]:
            if found_sub not in engl_ground_truth[ingred]:
                incorr.append(found_sub)
        for found_sub in model_substitutes_dict[ger_transl[ingred]]:
            if found_sub not in translated_ground_truth[ger_transl[ingred]]:
                incorr_ger.append(found_sub)
        print(ger_transl[ingred] + ": ")
        print("number of found substitutes: " + str(len(found_ger)) + "/" + str(len(translated_ground_truth[ger_transl[ingred]])))
        print("correctly found substitutes: " + str(len(found_ger)) + "/" + str(len(found_ger) + len(incorr_ger)))
        print("correctly found substitutes: " + str(found_ger))
        print("incorrectly found substitutes: " + str(incorr_ger))
        print("-----------------------------\n")
        print(ingred + ": ")
        print("number of found substitutes: " + str(len(found)) + "/" + str(len(engl_ground_truth[ingred])))
        print("correctly found substitutes: " + str(len(found)) + "/" + str(len(found) + len(incorr)))
        print("correctly found substitutes: " + str(found))
        print("incorrectly found substitutes: " + str(incorr))
        print("-----------------------------\n")
    with open(evaluation_path + "engl_comparison_results/engl_no_syn.json", 'w') as f:
        json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
    # with synonyms of substitutes
    print("Engl compare with synonyms of substitutes only:")
    # german
    new_german_result = no_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
    #engl
    new_engl_result = no_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict, get_occurrences=False, synonyms=False)
    engl_replacements = {}
    for ingred in engl_ground_truth.keys():
        engl_replacements[ingred] = {}
        engl_replacements[ingred]["engl"] = 0
        engl_replacements[ingred]["ger"] = 0
        if ingred in new_engl_result.keys():
            for sub in engl_ground_truth[ingred]:
                if sub in new_engl_result[ingred]:
                    engl_replacements[ingred]["engl"] += 1
        if ger_transl[ingred] in new_german_result.keys():
            for sub in german_ground_truth[ger_transl[ingred]]:
                if sub in new_german_result[ger_transl[ingred]]:
                    engl_replacements[ingred]["ger"] += 1
    with open(evaluation_path + "engl_comparison_results/engl_sub_syn.json", 'w') as f:
        json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
    # with synonyms for substitutes and base words
    print("Engl compare with synonyms of both:")
    # german
    new_german_result = with_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
    # engl
    new_engl_result = with_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict, get_occurrences=False, synonyms=False)
    engl_replacements = {}
    for ingred in engl_ground_truth.keys():
        engl_replacements[ingred] = {}
        engl_replacements[ingred]["engl"] = 0
        engl_replacements[ingred]["ger"] = 0
        if ingred in new_engl_result.keys():
            for sub in engl_ground_truth[ingred]:
                if sub in new_engl_result[ingred]:
                    engl_replacements[ingred]["engl"] += 1
        if ger_transl[ingred] in new_german_result.keys():
            for sub in german_ground_truth[ger_transl[ingred]]:
                if sub in new_german_result[ger_transl[ingred]]:
                    engl_replacements[ingred]["ger"] += 1
    with open(evaluation_path + "engl_comparison_results/engl_all_syn.json", 'w') as f:
        json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
    # with synonyms for base words
    print("Engl compare with synonyms of base words only:")
    # german
    new_german_result = with_base_synonyms(ground_truth_dict=translated_ground_truth, get_occurrences=False)
    # engl
    new_engl_result = with_base_synonyms(ground_truth_dict=engl_ground_truth, found_substitutes_dict=engl_dict,
                                    get_occurrences=False, synonyms=False)
    engl_replacements = {}
    for ingred in engl_ground_truth.keys():
        engl_replacements[ingred] = {}
        engl_replacements[ingred]["engl"] = 0
        engl_replacements[ingred]["ger"] = 0
        if ingred in new_engl_result.keys():
            for sub in engl_ground_truth[ingred]:
                if sub in new_engl_result[ingred]:
                    engl_replacements[ingred]["engl"] += 1
        if ger_transl[ingred] in new_german_result.keys():
            for sub in german_ground_truth[ger_transl[ingred]]:
                if sub in new_german_result[ger_transl[ingred]]:
                    engl_replacements[ingred]["ger"] += 1
    with open(evaluation_path + "engl_comparison_results/engl_base_syn.json", 'w') as f:
        json.dump(engl_replacements, f, ensure_ascii=False, indent=4)
    print("test")
 def main():
    # compare english and german results
    # engl_compare()
    print("--------------------------------------------------------")
    print("--------------------------------------------------------")
    print("--------------------------------------------------------\n")
    # get results, synonyms only used in substitutes
    no_synonyms()
    print("--------------------------------------------------------")
    print("--------------------------------------------------------")
    print("--------------------------------------------------------\n")
    # get results, synonyms used in substitutes and base ingredients
    with_synonyms()
 main()
--- a/final_Versions/README.md
+++ b/final_Versions/README.md
@@ -1,3 +1,19 @@
 ## German FoodBERT Models
 Unzip German FoodBERT models here!
 They can be found under https://cloud.marquis.site/s/ZUVIIIQv6yznBj6
 ## Datasets
 Each model has a folder "dataset" with the following files:
 **full_dataset.json**: Entire dataset with cleaned ingredients and instructions. This is the same file as found for each version in the main data directory.
 **complete_dataset.json**: dataset containing only URLs and instructions, separated depending on the version
 **model_datapoints.txt**: list of only the instruction datapoints from **complete_dataset.json**
 **training_data.txt**: instruction datapoints from recipes set aside for training
 **testing_data.txt**: instruction datapoints from recipes set aside for testing