top of page
  • AND

OpenAI's ChatGPT shows promise as a fertility advisor, despite limitations

The widespread use of the internet by both healthcare providers and patients for accessing healthcare information has led to extensive exploration of fertility-related content. However, the lack of verified medical accuracy in the vast number of search results related to infertility is a concerning issue.

With advancements in Natural Language Processing (NLP), a field of Artificial Intelligence (AI) focused on understanding and generating human language, computers have become capable of learning and engaging in human-like conversations. OpenAI has recently developed an AI chatbot named ChatGPT, which allows users to interact with a computer interface in a conversational manner.

The recent evolution of ChatGPT

ChatGPT stands out for its ability to perform various language tasks, such as writing articles, answering questions, and even telling jokes. These functionalities have been made possible by recent advancements in deep learning (DL) algorithms.

An example of such a DL algorithm is Generative Pretrained Transformer 3 (GPT-3), which is noteworthy for its extensive training data set consisting of 57 billion words and 175 billion parameters from diverse sources.

In November 2022, ChatGPT was introduced as an updated version of the GPT-3.5 model. It quickly gained popularity and became the fastest-growing app of all time, amassing over 100 million users within two months of its release.

While there is potential for ChatGPT to be used as a clinical tool for patients seeking medical information, there are limitations when it comes to utilizing this model for clinical purposes. As of February 2023, ChatGPT's training data only goes up until 2021, lacking the latest information. Additionally, there are concerns about the generation of plagiarized and inaccurate content.

Given the user-friendly interface and human-like language capabilities, patients are drawn to using this application to inquire about their health and receive answers. Therefore, it is crucial to evaluate the model's performance as a clinical tool and determine whether it provides reliable information or potentially misleading responses.

About the study

The present study aimed to evaluate the consistency of ChatGPT, specifically the "Feb 13" version, in providing accurate answers to fertility-related clinical questions that a patient might pose to the chatbot. The performance assessment of ChatGPT was conducted across three distinct domains.

The first domain centered around frequently asked questions (FAQs) about infertility sourced from the website of the United States Centers for Disease Control and Prevention (CDC). Seventeen common queries, such as "what is infertility?" or "how do doctors treat infertility?" were selected for this evaluation. During a single session, these questions were inputted into ChatGPT, and the responses generated by the chatbot were compared against the answers provided by the CDC.

The second domain involved the utilization of essential fertility-related surveys. The Cardiff Fertility Knowledge Scale (CFKS) questionnaire, comprising questions about fertility, misconceptions, and risk factors associated with impaired fertility, was employed in this domain. Additionally, the Fertility and Infertility Treatment Knowledge Score (FIT-KS) survey questionnaire was utilized to assess the performance of ChatGPT.

The third domain focused on evaluating the chatbot's capacity to deliver medical advice consistent with established clinical standards. This domain was structured based on the American Society for Reproductive Medicine (ASRM) Committee Opinion titled "Optimizing Natural Fertility."

Study findings

In the first domain, ChatGPT's responses to questions about infertility closely resembled the answers provided by the CDC. The average length of responses from both the CDC and ChatGPT was found to be the same.

When assessing the reliability of the content generated by ChatGPT, no significant differences were identified in the factual information between the CDC data and ChatGPT's answers. There were no discernible variations in sentiment polarity or subjectivity. It is worth noting that only 6.12% of ChatGPT's factual statements were found to be incorrect, and one statement was referenced.

Moving to the second domain, ChatGPT achieved high scores that corresponded to the 87th percentile of Bunting's 2013 international cohort for the CFKS questionnaire and the 95th percentile based on Kudesia's 2017 cohort for the FIT-KS questionnaire. For all questions, ChatGPT provided context and justification for its chosen answers. Additionally, there was only one instance where ChatGPT provided an inconclusive answer, which was neither deemed correct nor incorrect.

In the third domain, ChatGPT successfully reproduced the missing facts for all seven summary statements from the document "Optimizing Natural Fertility" published by the ASRM. For each response, ChatGPT explicitly highlighted the fact that was omitted from the statement and did not provide conflicting information. Consistent results were obtained across all repeated administrations in this domain.


The present study has certain limitations, one of which is the evaluation of only a single version of ChatGPT. With the introduction of other similar models like AI-powered Microsoft Bing and Google Bard, patients will have access to alternative chatbots. Consequently, the nature and availability of these platforms are susceptible to rapid changes.

Although ChatGPT offers prompt responses, there is a potential risk that it may draw information from unreliable references. Furthermore, the consistency of the model's performance could be affected in future iterations. Therefore, it is crucial to understand and assess the volatility of the model's responses when exposed to updated data from various sources.

Source : ScienceDirect


bottom of page