Physician Review Websites: Understanding Patient Satisfaction with Ophthalmologists Using Natural Language Processing

Introduction The presence and influence of physician review websites (PRW) have increased significantly in the field of medicine. This study aims to better understand determinants of patient satisfaction and the sentiment of ophthalmologists using natural language processing of Healthgrades reviews. Methods Healthgrades is a PRW where patients submit verified reviews, containing a star rating and a narrative review, of US-based ophthalmologists. This was a quantitative observational study conducted on May 23, 2022. We identified associations between physician demographics and both the sentiment analysis scores of narrative reviews and star ratings using the Student's t-tests and one-way ANOVA tests. After natural language processing the reviews, a logistic regression explored the impacts of the most frequent words on the positivity of a given review. Results This study examined a total of 16700 reviews of 1125 ophthalmologists. Ophthalmologists of younger age and male gender received statistically significantly higher star ratings and sentiment analysis scores; analysis of location of practice did not affect scores. Textual analysis revealed that words describing the physician's personality, such as “friendly” and “caring,” increased the likelihood of reviews being positive more than descriptors of the visit's effectiveness, such as “results” and “efficient.” Conclusion Younger and male ophthalmologists received higher star ratings and sentiment analysis scores. Additionally, results indicated that words describing the ophthalmologist's pleasant personality and the visit's effectiveness most positively impacted a review, whereas descriptors of a wait or an unpleasant personality most negatively impacted a review.


Introduction
Over the last few decades, there has been a widespread increase in the utilization and reliance on the Internet and other digital technologies, a trend that has inevitably made its way into the healthcare landscape [1][2][3]. More recently, the surge in digital activity following the COVID-19 pandemic has pulled the relationship between healthcare and the Internet closer together [4]. Physician-rating websites (PRWs), which utilize patient reviews to rate, evaluate, and rank physicians, have grown in popularity as a way to gain valuable information about a physician before making a decision [5,6]. It is estimated that at least one in six practicing physicians in the United States has been reviewed online, and prior survey studies suggest that these websites can have a signifcant impact on patient decisionmaking [7,8]. For instance, Hanauer et al. [7] found that 35% of surveyed individuals chose a physician based on their positive ratings, and 37% chose against a physician based on their negative ratings. A deeper understanding of PRWs, their trends, and their true impact on patient decision-making is critical to ensuring that physicians understand both their public perception and their patients' sentiments.
Within the feld of ophthalmology, there is a need for deeper exploration of the physician review and rating landscape. Some smaller scale studies have been performed. For instance, Skrzypecki and Przybek in 2018 analyzed the online ratings of 105 ophthalmologists who exemplifed "outstanding scientifc performance," based on their overall number of citations or Hirsch index, and found that there was no correlation between these academic and scientifc achievement metrics and patient ratings on Healthgrades (https://healthgrades.com) [9]. Within the subspecialty of Ophthalmic Plastic and Reconstructive Surgery, Vu et al. [10] looked at the Healthgrades evaluations of 612 US-based members of the American Society of Ophthalmic Plastic and Reconstructive Surgery (ASOPRS) and found that most ASOPRS surgeons had at least one rating on the website. While ratings were generally high, long wait times were correlated strongly with lower recommendation scores. Smith et al. performed a larger-scale study [11], in which over 80,000 online reviews were analyzed. Tey found that higher scores were seen in ophthalmologists, who added photographs or a short biography to their page, as well as those who had shorter wait times, younger physicians, and those without a history of malpractice claims. Tere remains the opportunity to further understand the PRW landscape within ophthalmology through more large-scale studies and also through investigating the nuances of patient reviews, such as with more advanced technologies like machine learning and natural language processing.
Using natural language processing (NLP), this study explores PRWs through the analysis of star ratings and sentiment analysis scores in association with demographics and patient-written review content. Its aim is to use computational analysis to learn what specifc factors contribute to overall sentiment and ratings of ophthalmologists on PRWs. By gaining a better understanding of the most common reasons and sentiments behind a positive or negative PRW review, ophthalmologists may be better equipped to optimize their public-facing profle as well as provide more patient-centric care by adjusting to patient sentiment. Furthermore, the identifcation of certain qualities or demographics that lead to more positive sentiment may also elucidate bias within the platform and subsequently generate a deeper knowledge base for appropriate initiatives aimed at increased transparency.

Data Collection.
Patient-written reviews of US-based ophthalmologists from the online PRW "https:// healthgrades.com" were collected using web scraping [12] on May 23, 2022, which automates manual extraction of patient review data via a web browser. We selected Healthgrades for analysis in this quantitative observational study due to its popularity amongst physicians and patients as well as the fact that it was more permissive of web scraping compared to other physician review sites. In addition, Healthgrades includes all physicians who have active profles on the National Provider Identifer Registry. Te data collected consisted of physicians' demographic information (gender, age, and location of practice), narrative reviews (words/phrases used and average rating), and a star rating, which is scored out of fve and is assigned to each physician based on the average scores of their patientwritten reviews. Only patients with access to the Internet and an Internet-accessing electronic device were able to submit reviews to Healthgrades. We defned inclusion criteria to be as many physicians classifed as an "Ophthalmologist" that Healthgrades allowed to be extracted with search location parameters set to "National." Te exclusion criteria omitted any reviews given to physicians with fewer than a total of six reviews.

Natural Language Processing (VADER Sentiment
Analysis). In this study, sentiment analysis is used to study the opinions, feelings, and views of patients about their physicians through computational analysis of Healthgrades written reviews. Te Valence Aware Dictionary and sEntiment Reasoner (VADER) is a commonly used Python package in the feld of NLP to gather sentiment analysis from social platforms [13]. Te VADER model translates written phrases into normalized scores that represent the positivity or negativity of the sentiment, taking into account writing features such as punctuation, capitalization, and degree modifers. It accomplishes this by frst, converting each word to a sentiment "valence" score from negative four to positive four that accounts for polarity and intensity. Subsequently, these scores are summed and adjusted based on the general, grammatical, and syntactical rules as set by VADER. Finally, the scores are normalized to a compound score that ranges from a negative one to a positive one. We selected VADER given its past uses in online patient review analysis as well as its sensitivity to the language and format of social media platforms [14][15][16].

Data Analysis.
We trained a linear regression model to describe the relationship between the average VADER score and the Healthgrades star rating of each sampled ophthalmologist. If the VADER sentiment analysis were valid according to the Healthgrades star rating, the graph would display a linear relationship between the two variables. Te R-squared value of this model was also used to optimize the minimum number of reviews required per physician to be included in this study. Tis method in determining the exclusion criteria was also used in the studies by Tang et al. [14][15][16]. Te Student's t-tests were used to evaluate relationships between gender and both sentiment analysis scores and star ratings. One-way ANOVA tests were used to compare age and location of practice with both sentiment analysis scores and star ratings; the directionality of signifcant results was validated with a Tukey test. Geographic subgroups were defned as Northeast, Southeast, Southwest, Midwest, and West [17]. Age subgroups were defned in increments of ten to optimize ease of analysis and evenness of distribution. Prior to any text analysis, we completed natural language preprocessing, which tokenized and removed stop words from the text. Te word frequency analysis identifed every unique word in the patient-written reviews and documented their frequencies. We repeated the same procedure for bigrams, or two-word pairs. Further, a multivariate logistic regression model quantifed the most impactful words or bigrams in determining whether or not a review was positive, the cutof for which was a VADER sentiment score greater than 0.5.

Results
Out of 23815 reviews web scraped from Healthgrades on May 23, 2022, inclusion and exclusion criteria produced a total of 16700 reviews of 1125 ophthalmologists in the United States for analysis. Te gender identity, age, and geographic location of the physician were identifed as reported on the website (Table 1).

Model Validation.
We tested the validity of VADER sentiment analysis scores using a linear regression between the average VADER score and the Healthgrades star rating. We found a positive correlation (R-squared � 0.647; p < 0.001), showing VADER scores to be largely in agreement with star ratings (Figure 1).

Gender, Age, and Location of Practice Analysis.
We completed gender analysis of patient reviews the using Student's t-tests. Results indicated that male ophthalmologists received higher star ratings (4.61 v. 4.55; p < 0.001) than female ophthalmologists. Te diferences in sentiment analysis scores (0.645 v. 0.624; p < 0.002) between male and female ophthalmologists were also shown to be statistically signifcant ( Table 2).
We completed location analysis by separating ophthalmologists into fve geographical subgroups: West, Midwest, Southwest, Southeast, and Northeast. We studied the relationship between average star ratings and sentiment analysis scores with location of practice using one-way ANOVA tests. Unlike gender and age, diferences in star ratings  (Table 6).

Single-word and Bigram Frequency Analysis.
Single-words and bigrams most frequently used in Healthgrades best and worst patient-written reviews were identifed (Tables 7 and 8). Clinically irrelevant words and bigrams were omitted from this analysis. For instance, single-words such as "great" and "said" or bigrams such as "best," "ophthalmologist," "worst," and "enemy" were frequently used, but proved irrelevant to understanding the reasons behind patient sentiment through linguistic analysis.
Of the included single-words, both the best and worst reviews had high frequencies of words describing the ophthalmologist's approach and atmosphere as well as the visit's efectiveness and efciency. Te best reviews included words such as "friendly," "caring," "kind," and "comfortable" as well as "results," "helpful," and "efcient." Te worst reviews included words such as "rude," "unprofessional," "arrogant," and "condescending" as well as "waiting," "waited," and "rushed" (Table 7).

Multivariate Logistic Regression.
A multivariate logistic regression was performed using clinically relevant words to determine the likelihood that a specifc single-word or   bigram would be included in a positive patient-written review (Table 9).

Spread of Star Scores.
Tere was a total of 14620 fve-star and 1458 one-star reviews, in contrast with 196 two-star, 140 three-star, and 286 four-star reviews (Table 10). One-and fve-star reviews made up around 96% of total reviews, while two-, three-, and four-star reviews made up around 4%.

Discussion
Given the recent increase in popularity of PRWs in the past decade, it is important for ophthalmologists to better understand these platforms, which often serve as flter mechanisms for patients who are searching for a new ophthalmologist [5,6,18]. Te present study investigates patient-written reviews of US ophthalmologists on the popular physician review website, Healthgrades. Te      Journal of Ophthalmology statistical relationships between certain physician demographics and both star rating and average sentiment score were studied. Also, single-word and bigram frequencies were noted and analyzed via a multivariate logistic regression for their impact on determining whether or not a review was positive, i.e. having a VADER score > 5.
To the knowledge of the authors, only three studies in the feld of ophthalmology have explored PRWs [9][10][11]. Tis study represents the second largest analysis of ophthalmology-specifc PRWs, with 16700 reviews of 1125 ophthalmologists. It also represents the frst study to use natural language processing to gauge patient sentiment from patient-written reviews, ofering unprecedented granular insight into the impact of word choice on sentiment analysis outcomes, and thereby star rating outcomes. Tis study's results suggest that younger and male ophthalmologists tend to receive higher star ratings and reviews with higher sentiment analysis scores on Healthgrades. Moreover, the multivariate logistic regression indicated that being "confdent," "friendly," and "caring" held a greater odds ratio in determining the positivity of a sentiment analysis score than outcome-pertinent diction such as "results" and "efcient." In contrast to the results reported by Vu et al. on ASOPRS surgeons visible on Healthgrades, this study found that male ophthalmologists received higher star ratings than female ophthalmologists on Healthgrades [10,11]. Conversely, Smith et al. analyzed two PRWs, Healthgrades and Vitals, and determined no statistically signifcant diference between male and female ophthalmologists in star rating. Given the greater number and broader source of patientwritten reviews that were analyzed by Smith et al., it is likely that PRWs in general do not indicate male ophthalmologists are more favorably received by patients compared with female ophthalmologists. However, it is of note that when we analyzed Healthgrades in isolation, we found the aforementioned trend to be statistically signifcant. Furthermore, sentiment analysis scores, which were not analyzed by previous studies, were also shown to favor male ophthalmologists on Healthgrades. As such, additional research should explore sentiment analysis score diferences in male and female ophthalmologists in a wider range of PRWs as well as in the diferent subspecialties of ophthalmology, including but not limited to oculoplastics.
Age analysis determined that younger ophthalmologists received higher star ratings and sentiment analysis scores than older ophthalmologists, a fnding that was in accordance with those of Smith et al. [11]. Tis trend does not appear to be unique to the feld of ophthalmology, as Tang et al. found similar results among reviews of hand and spine surgeons [14,15]. However, the geographic location of practice did not yield any statistically signifcant diferences with respect to star ratings or sentiment analysis scores.
Single-word frequency analysis suggested that the highest-rated reviews more often contained words about the ophthalmologist's personality traits such as "friendly," "caring," and "kind" than the visit's efectiveness such as "results," "helpful," or "efcient," which was also refected in the odds ratios of these words as shown in the multivariate logistic regression. Bigram analysis showed similar trends, with "friendly," "helpful" "kind," and "caring" being the two most frequent bigrams in positively rated reviews. Te presence of "helpful" in the top bigram does, however, indicate the importance of both personality and efectiveness in patient satisfaction. In corroboration, the multivariate analysis underscored that both factors were highly correlated with more positive reviews. Analysis of negative reviews revealed that worst reviews most often involved words that related to timing issues, such as "waiting," "waited," and "rushed." Likewise, Smith et al. noted that ophthalmologists with longer wait times were more likely to receive lower star ratings [11]. Studies that analyzed patient satisfaction through physical surveys such as the Press Ganey survey also Best reviews Worst reviews "Friendly," "helpful" 135 "Never," "return" 14 "Kind," "caring" 113 "Refused," "see" 10 "Friendly," "efcient" 70 "Staf," "unprofessional" 9 "Truly," "cares" 64 "Waiting," "hours" 9 "Great," "results" 61 "Billed," "insurance" 8 "Staf," "best" 59 "Condescending," "rude" 8 "Everyone," "friendly" 51 "Horrible," "bedside" 7  found that longer waiting times corresponded to lower satisfaction scores [19,20]. Bigrams, most often found in the worst reviews, had much lower frequencies than for positive reviews. Still, three of the most frequently used were related to physician availability, namely, "never," "return;" "refused," "see;" and "waiting," "hours." Lastly, bigram analysis of both the best and worst reviews included references to the staf, underscoring that all these factors that contribute to star ratings were not only the responsibility of the physician but also of every staf member at the practice. Granted, multivariate analysis did not show a statistically signifcant relationship between "best staf" and positive reviews. However, it is noticeable that "staf" in isolation showed a statistically signifcant association with positive reviews. A limitation of PRWs at large is that they inevitably attract the most extreme of opinions. Te data suggests that an exceedingly positive or negative experience is more likely to encourage patients to write online reviews than a mediocre experience. For instance, one-and fve-star reviews constituted around 96% of the total number of reviews (Table 10). Te problem this presents is that it limits the quality of feedback that PRWs can provide ophthalmologists, as it represents a skewed sample of experiences. Consequently, this study advises ophthalmologists to generate their own patient surveys upon completion of the visit. Tese surveys should be composed of questions that are informed by the single-word, bigram, and multivariate analysis in this study, in order to optimize feedback as patient-centered. By doing so, questions will cater to patients' true concerns and values instead of noninformed surveys that typically cater to the practice's concerns and values. As such, they can ofer ophthalmologists a more comprehensive and patient-centric evaluation of their practice than PRWs.
Tis study itself contains a number of limitations that may warrant follow-up studies. First, it is an observational study that explores only one PRW, Healthgrades, and a broader evaluation of PRWs may yield diferent and more informed results. It does not include the sentiments of patients who do not have access to the Internet, which may bias against certain groups. Moreover, since the inclusion criteria included ophthalmologists with at least six reviews, it is possible that the ophthalmologists studied were those who either encouraged reviews to their patients or, for positive or negative reasons, attracted a higher number of reviews.

Conclusion
Tis study assessed various determinants of ophthalmologist star ratings and their reviews' sentiment analysis scores. Tese determinants ranged from demographic information such as age, gender, and location to the diction of their patient-written reviews, which revealed the aspects of the ophthalmic care provided that were most or least appreciated by patients. Tese fndings allow us some understanding of patient mindsets when seeking care and thus, should be kept in mind for optimizing practice growth and improving patient care. Tere was a positive correlation between the VADER score and the Healthgrades star rating, supporting the use of VADER sentiment analysis as a barometer of patient satisfaction. Te NLP results of this study suggest that ophthalmologists seeking positive online reviews and presumably increased patient satisfaction should treat patients with an emphasis on being "confdent," "kind," and "efcient" while also minimizing wait times. Tis may also be applied to training of staf, scheduling, the creation of practice workfows, and other arenas of practice implementation and governance. For optimization of patient care, ophthalmologists may consider using the single-word, bigram, and multivariate analyzes completed by this study to administer their own patient-surveys. When comparing their reviews and ratings to those of their colleagues, ophthalmologists should note that Healthgrades reviews tend to be biased in favor of younger and male members of the feld. However, these fndings may also expose selected priorities of these subgroups in patient interactions. Notably, results of the present study's sentiment analysis and word frequency analysis are limited to the Healthgrades database. Future studies may use sentiment analysis to explore multiple PRWs as well as distinguish results between subspecialties of ophthalmology.

Data Availability
All data analyzed in this paper is publicly available through https://www.Healthgrades.com.