Includes real-world queries on SEC regulations, financial responsibility rules, and market oversight.
Emphasizes cross-rule reasoning and compliance interpretation.

1K+ Regulatory Finance Questions

Finance GraphRAG Benchmark Dataset

Benchmark dataset of 1K+ real-world financial regulatory questions designed to test RAG models on SEC rules, compliance, and cross-rule reasoning.

tabular_utility_indistinguishability_desc

The higher the value, the better

The closer to zero or the lower the value, the better

The closer the value is to 0 or 1, or the lower the number, the better

Consent to use marketing and receive advertisements

We provide a variety of information such as new product news, event information, and customer bene-fits related to the service.

Create premium datasets

Contact Information is correct?

We hope this research data supports your success.

1. Where is your workplace or research institution?

2. Please enter the name of the ongoing project or research material/thesis title.

Instructions for downloading research datasets

azoo service information

● Please read the product registration request form carefully and fill it out correctly. If the registration information and the information in the original data are different, the product registration application may be canceled.

● The seller owns and is responsible for the copyright of the original data. If a copyright problem occurs in the processing of data and liability arises, it is not azoo's fault.

● Modification of data files once registered is not possible. Therefore, please check whether the data matches your request.

● At least 100 pieces of original data are required. If there is not enough data, synthetic data cannot be created.

● The sales amount for each data is determined through azoo's screening process. Sellers can specify their desired selling price, but this may vary depending on the quality of the data.

Name

About 50 to 200 characters
[Example] Sikshin Co., Ltd. is Korea's leading food tech b2b platform company that provides big data-based nationwide restaurant search
recommendation service and corporate mobile electronic meal ticket service.

A. Fill out the information for the product you wish to register in the document format.

B. The azoo administrator verifies and reviews the registered product information.

C. If it is confirmed as a product that can be traded on azoo, write a contract and Request a contract from the seller

D. Once the seller completes signing the contract, file registration (upload, DTS) is possible.

E. Data de-identification process After final product inspection, product registration is completed (product sales begin)

E. Once data registration is complete, the azoo administrator will inspect the data file. During this process, product registration may be rejected if it is different from the product registration request or if the data processing is not appropriate.

F. Once the suitability of the data file is confirmed, the data will be sold as original or processed. The processing process may take 3 to 5 business days.

Premium service

You have already downloaded this dataset. Would you like to download it again?

Compensation for Writing Research Feedback

Can’t find the data you want?

Account information for receiving payment

Report on research experience using datasets

Please set your password in the format of 8 to 16 English characters, numbers + lowercase letters + uppercase letters.

Search for the necessary data

When training a model, it is trained separately using real data and synthetic data, and then tested on a separate set of real data to measure real accuracy and synthetic accuracy respectively. The real accuracy serves as a benchmark for evaluating the synthetic accuracy. High accuracy: Indicates that the model is well-trained and performs well on each dataset. Low accuracy: Signals that the model may not be functioning properly, possibly due to training issues or data problems. Data accuracy is a crucial metric for determining whether the data can effectively replace real data in training machine learning models.

The accuracy rate is determined by comparing the classification accuracy of the model when trained on synthetic data to the accuracy when trained on real data. This metric measures how well a model, trained on synthetic data, performs in comparison to a model trained on real data. A higher downstream classification accuracy rate indicates that the synthetic data allows the model to perform similarly to how it would with real data, suggesting that the synthetic data is of high quality and representative of the real data. This comparison is crucial for determining the effectiveness of synthetic data in practical applications where accurate classification is essential.

The real average confidence reflects how confident the model is when trained on real data. The average confidence is calculated by taking the mean of the individual confidence scores across all test examples. A high real average confidence (e.g., above 0.8) indicates that the model has learned meaningful patterns from the real data and is generally confident in its predictions, which often suggests good model performance. A low real average confidence (e.g., below 0.6) suggests that the model is uncertain in its predictions and may struggle with generalization.

The synthetic average confidence shows how well the synthetic data supports the model in making confident predictions. A high synthetic average confidence (e.g., above 0.7) indicates that the synthetic data allows the model to make confident predictions, suggesting that the synthetic data closely resembles the real data in terms of complexity and distribution. A low synthetic average confidence (e.g., below 0.5) suggests that the synthetic data lowers the confidence in the model's predictions, meaning the synthetic data may not effectively represent the real data. In this case, the model could be uncertain or misclassified more often.

The goal of downstream evaluation is to assess how well synthetic data can replace original data by measuring the performance of a classification model when trained on synthetic data compared to the original data. Specifically, the metrics in the table were measured using mlnet-base for feature extraction and a 2-layer MLP classifier, with settings such as learning rate = [placeholder], weight decay = [placeholder], and number of epochs = [placeholder].

The F1 score evaluates both how well the model finds correct answers (recall) and how accurate the model's predictions are (precision), providing a more balanced useful for imbalanced datasets. High F1 score: This indicates that the model is good at both catching the right answers (few misses) and avoiding mistakes. Low F1 score: This indicates that the model is not effectively learning from the real data, which might point to issues with the dataset quality, class imbalance, or training process. If synthetic recall score is significantly lower than the real recall score, it implies that the synthetic data is poorly constructed or lacks sufficient similarity to real data.

The F1 score rate compares the f1 score of the model when trained on synthetic data to the f1 score when trained on real data. High f1 score rate: This indicates that the synthetic data allows the model to perform similarly to how it would with real data, suggesting that the synthetic data is of high quality and able to replicate the real data's patterns in a balanced way. Low f1 score rate: This indicates that the synthetic data does not support the model in the same way real data does, possibly due to a lack of diversity or accuracy in representing the real-world distribution of the classes.

Precision score measures how accurately the model makes positive predictions. In other words, it refers to the probability that when the model predicts "cat," it is actually a cat. This score is especially important when the cost of false positive is high. <be>High precision score: This means the model's positive predictions are likely to be truly positive, with fewer mistakes. For example, if the model predicts a cat, it is likely to actually be a cat. Low precision score: This means the model makes many mistakes, often predicting incorrect objects as positive. For example, it might predict a dog or another object as a cat. A low precision score could be a sign that the model has not learned properly. If synthetic precision score is significantly lower than the real precision score, it may lack diversity or accuracy.

The precision rate compares the prediction accuracy of a model trained on synthetic data and tested on real data. High precision rate: This means the synthetic data helps the model make predictions as accurately as real data, indicating that the synthetic data is of good quality and similar to real data. Low precision rate: This suggests the synthetic data reduces the model's performance, leading to more incorrect predictions. This metric is useful for evaluating how effective synthetic data is in situations where accurate predictions are crucial.

The recall score measures how well the model, trained on either real or synthetic data, can correctly identify the right answers. Recall measures how well the model finds all relevant instances, which is especially important for imbalanced datasets. A low real recall score may indicate issues such as dataset quality, class imbalance, or problems in the training process. High recall score: Indicates that the model accurately identifies most of the correct answers with minimal misses. For synthetic data, it suggests a high resemblance to real data. Low recall score: Suggests the model struggles to identify the majority of correct answers. If synthetic recall score is significantly lower than the real recall score, it implies insufficient similarity between synthetic and real data.

The recall rate compares how many correct answers the model identifies when trained on synthetic data versus real data. High recall rate: This means the synthetic data helps the model find most of the correct answers, similar to how it would with real data, indicating high-quality synthetic data. Low recall rate: This suggests the synthetic data doesn't help the model find many correct answers, possibly due to not reflecting the real data distribution well. This comparison is important for evaluating whether synthetic data works effectively in real-world applications, ensuring the model doesn't miss important instances.

BERTScore is a metric used to evaluate the semantic similarity between two texts, utilizing deep learning models like BERT (Bidirectional Encoder Representations from Transformers) to capture the meaning of the texts. This metric calculates similarity by comparing the contextual embeddings of each token, allowing for a more detailed evaluation of the texts' semantic content. The mean BERTScore value is calculated with 95% confidence by selecting a diverse set of real text data and generated synthetic text data, and then measuring BERTScore in a one-to-one comparison. High BERTScore: If the value is 0.9 or higher, it indicates that the generated text is almost semantically identical to the real text. This can increase the risk of sensitive information leakage. Low BERTScore: If the value is 0.6 or lower, it indicates lower similarity and a reduced risk of leakage.

ROUGE is a tool used to evaluate the similarity between two texts, primarily for comparing the quality of summaries or generated texts with reference texts. n-grams: These are sequences of n consecutive words in a text. For example, in "The cat sat on the mat," the 2-grams (or bigrams) are "The cat," "cat sat," and so on. ROUGE-N: Measures the overlap of n-grams between the generated text and the reference text. ROUGE-2 specifically evaluates the overlap of bigrams. High ROUGE-2: If the score is 0.8 or higher, it means the generated text is similar to the real text. Low ROUGE-2: If the score is 0.5 or lower, it indicates lower similarity and a reduced risk of leakage. dataset and ensure measurements are taken under consistent conditions.

Coverage is a metric that measures how well the generated dataset reflects the original dataset. It evaluates whether the generated data sufficiently captures the diverse characteristics and categories of the original data. Coverage is calculated by checking if each sample from the original data has a similar sample in the generated data. the coverage value is above 0.8: It indicates that more than 80% of the original data is well-represented, and the generated dataset effectively expresses diversity. the coverage value is between 0.5 and 0.8: It suggests that some diversity is reflected, but certain characteristics may be underrepresented. If the coverage value is below 0.5: It indicates that less than half of the original data is reflected, suggesting that the generated data lacks diversity.

MMD (Maximum Mean Discrepancy) is a metric used to assess the similarity between two probability distributions. It is commonly used to compare generated data with real data. MMD measures the difference between two distributions using kernel mean embedding and allows for flexible comparisons without assuming a specific distribution. High MMD score: A score above 0.05 indicates that the two distributions may differ. Low MMD score: Indicates that the generated data is similar to the real data. A score close to 0 is preferable, and a score below 0.01 suggests that the two data distributions are nearly indistinguishable.

Indistinguishability measures the ability of a machine learning model to distinguish between real and synthetic data. After the model learns the distribution of real data, it is tested on a combination of real and synthetic data to evaluate its classification accuracy. A low accuracy close to 0.5 indicates that the model frequently confuses synthetic data with real data, suggesting that the synthetic data closely resembles the real data. higher accuracy indicates that the model effectively distinguishes synthetic data from real data. To better interpret these results, consider comparing the rate of real data misclassified as synthetic with the rate of synthetic data misclassified as real to gain deeper insights.

BERTScore is a metric used to evaluate the semantic similarity between two texts, utilizing deep learning models like BERT (Bidirectional Encoder Representations from Transformers) to capture the meaning of the texts..

High BERTScore: If the value is 0.9 or higher, it indicates that the generated text is almost semantically identical to the real text. This can increase the risk of sensitive information leakage.
Low BERTScore: If the value is 0.6 or lower, it indicates lower similarity and a reduced risk of leakage.

Inference risk measures the risk of inferring sensitive information about the original data from synthetic data. It evaluates the likelihood of extracting original data information from synthetic data and is calculated by comparing the distance between synthetic and original data.

High Duplication Rate: Indicates lower data diversity and potential quality issues, which can reduce the reliability of analysis and models.
Interpretation: A lower risk value means that synthetic data is less likely to infer sensitive information, indicating higher data security.

Indistinguishability measures how well a machine learning model can differentiate between real and synthetic data. After the model has learned the patterns of real data, it is tested on mixed data to evaluate its accuracy.

High accuracy: Means the model effectively distinguishes between synthetic and real data.
Low accuracy close to 0.5: Indicates that the model frequently confuses synthetic data with real data, suggesting that synthetic data is quite similar to real data.

Downstream classification accuracy is an indicator used to evaluate the usefulness of synthetic data. It measures whether synthetic data performs similarly to real data. The method involves training the same model separately on real data and synthetic data, and then comparing the accuracies of the two models.

Interpretation: A high accuracy rate means that the model trained on synthetic data performs similarly to the one trained on real data, indicating that the synthetic data is of high quality and well represents the real data.

MMD (Maximum Mean Discrepancy) is a metric used to assess the similarity between two probability distributions. It is commonly used to compare generated data with real data.

High MMD score: A score above 0.05 indicates that the two distributions may differ.
Low MMD score: Indicates that the generated data is similar to the real data. A score close to 0 is preferable, and a score below 0.01 suggests that the two data distributions are nearly indistinguishable.

Finance GraphRAG Benchmark Dataset

Total Price

Data Information

See how this benchmark
could work for your model.

Finance GraphRAG Benchmark Dataset

Total Price

Data Information

See how this benchmark could work for your model.

See how this benchmark
could work for your model.