A Geo-Social Characterization of Health Impact from Air Pollution in Mexico Valley

e impact of the air pollution phenomenon has been long studied, but most often with a fragmented approach, without closely looking at the relationship between dierent components that characterize it, such as sensor-based data, health data from institutional databases, and data on how it is perceived by human beings in social media. e research developed in this study introduces an integrated methodological framework that analyses sensor data on air pollution distributed in space and time, combined with health data and social data narratives that reect how dierent communities perceive this phenomenon in space and time; exploring how these dierent heterogeneous sources can be combined to better understand the impact of air pollution phenomena at the large-city level in the Valley of Mexico. We introduce a Spatio-temporal data integration and mining framework that aims to discover trends and insights regarding the distribution of the impact of an air pollution phenomenon in terms of human health and perception.e main peculiarity of our methodological framework is the integration of dierent large data sources by combining a series of methods: NLP (topic modeling), data mining (data cubes, unsupervised learning, and clustering), and GIS capabilities (spatial interpolation, choropleth maps) that together provide a better understanding of the quantitative and qualitative patterns emerging at a dierent spatial scale and temporal granularity. Overall, this shows how social data, when combined with quantitative data, can provide a better understanding of the impact of a given phenomenon, such as air pollution.


Introduction
Many environmental and urban phenomena have been long described by institutional, sensor-based databases or timeconsuming surveys [1][2][3][4] (e.g., crime gures, air pollution, and noise levels). While these studies provide broad quantitative insights for analysis and understanding of a phenomenon of interest, as many open data initiatives also facilitated their availability, they do not provide instant and qualitative reports of what is happening in the city. is is the case, for example, when the aim is to study how inhabitants perceive the impact of crime patterns or pollution levels. In recent years, social networks have provided valuable alternatives to record citizens' opinions on the progress of some urban facts and events, and this is distributed across the city and at di erent levels of scale and over time [5,6].
As expressed in social networks, narratives provide spontaneous and personal descriptions of geo-social and environmental phenomena that might provide complementary insights when combined with a repository of descriptive data. Indeed social media data are far from being as objective as sensor-based or statistical databases. However, they have the advantage of being regularly expressed, almost freely available, and generally much more qualitative, although they are surely heterogeneous, relatively imprecise, and not always complete. On the methodological side, the unstructured component of social data opens a new avenue of research as such narratives imply the development of Natural Language Processing (NLP) algorithms whose objective is to extract and reorganize the data to give a manipulable form to the whole data. erefore, the two questions that arise are as follows: first, how to reconcile these two data sources at different levels of description in space, time, and according to the embedded semantics? Second, can social data complement descriptive data available through open data sources, enhancing a knowledge discovery process and thus enriching the understanding of a phenomenon?
is is the main methodological aim of this research, which explores how open social data associated with open data on a given phenomenon can lead to a better understanding of an urban or environmental phenomenon, and not only how it is distributed in the city over time but also how it is perceived by people acting in the city. e context of our research is the impact of air pollution as recorded in the city and in the region; and how it is perceived by human beings at the local level.
e reason for this choice is that air pollution provides a context with substantial quantitative data available in the city over time and a sensitive for city inhabitants, as it has important health impacts. In particular, it has been observed that humans are very reactive in social networks when the observed pollution levels worsen. Humans are also keen to describe some symptoms associated with respiratory diseases, this expressing a potential qualitative relationship with air pollution levels. We observe three fundamental problems in the observation of air pollution: (1) enriching the understanding of air pollution from social networks in combination with open data, (2) extracting narratives that describe health effects in social networks, (3) providing a workflow able to enrich social perceptions with trends obtained from environmental and health databases.
From a methodological point of view, our research develops a framework that applies a series of data mining, natural language processing (NLP), and geographic information systems (GIS) techniques. e goal is to extend the understanding of air pollution impact from a regional perspective to an individual granularity by providing a workflow capable of, first, enriching social insights with historical environmental and health trends from open data sources and secondly allowing data integration and exploration. e main steps of our methodological framework are as follows: (1) record the historical behavior of a phenomenon, representing regional trends over time using multidimensional data cubes combining open and social data; (2) search for phenomenon narratives and emerging themes or topics in space and time using; (3) characterize the emerging social perceptions using a Topic Modeling process (an NLP algorithm) aiming at discovering the semantic structures hidden in large text datasets. e application of this algorithm should highlight the main themes that could reveal embedded social-health patterns associated with air pollution phenomena and; (4) finally, present and highlight trends emerging using Spatio-temporal analysis tools, such as choropleth maps and kernel interpolations on the distribution and concentration of PM10, PM 2.5, and CO pollutants. It should provide a global analysis of the correlation between pollution patterns in space and time, respiratory diseases, and trends revealed by health data and their perception through social media. e peculiarity of our approach is that it combines sensor-based pollution data, health data, and human perceptions in an integrated framework that favors the analysis and discovery of the respective objective and subjective impact of air pollution and urban health phenomena. e main advantage of our approach is that it analyses how trends and insights can be obtained through a robust integration of open data, health data, and people's perceptions as extracted from social media. It provides a relatively complete view of a given phenomenon, from quantitative descriptive data to qualitative patterns derived from social media narratives. However, the work might still be extended by first providing a stronger computational integration of the different NLP, data mining, and GIS capabilities. A further degree of flexibility might also be provided at the interface level by allowing health experts to "play with the data" to explore the patterns that emerge throughout all steps of data processing. Finally, additional visualization capabilities might be developed at different levels of spatial scale and temporal granularity. e remainder of the paper is organized as follows. Section 2 introduces the related work, while Section 3 develops the methodological principles of our approach. Section 4 presents the results and experiments, while finally, Section 5 draws the conclusions and outlines future work.

Related Work
While many previous works have applied GIS, NLP, and data mining together to some extent to analyze urban patterns, our methodological framework goes further by providing an integrated framework that offers a robust approach that combines their potential in time and space, and by applying the whole approach to a relatively large dataset composed of sensor-based data, historical health data, and human narratives as expressed in a social media. e main peculiarity of our approach is that it supports a spatiotemporal correlation of quantitative and qualitative data that provides a better understanding of the real and perceived impact of air pollution phenomena in an urban environment.

Spatial Data Analysis Applied to Air Pollution and Health
Studies. In related work, hourly data of PM2.5 (fine particle matter with diameters of generally 2.5 micrometers), PM10 (particles with diameters of 10 micrometers and smaller), and CO pollutants (carbon monoxide) were collected from 336 Chinese cities for two years to uncover the geographic and time variations and influential factors of these pollutants [7]. e study showed that all the pollutants exhibited significant weekly and diurnal cycles.
ese results highlighted the impact of meteorology on air pollution in China, the geographical and temporal variations, and the role played by a series of additional factors.
Under similar principles, but this time with SO2, NO2, PM2.5, PM10, CO, and O3 pollutants recorded over one year, they have aggregated an air quality index in 338 Chinese cities [8]. e air quality index values showed remarkable spatiotemporal variations across the country. e main findings were that air quality index values generally remained high throughout the country. Spatially, high or low index values were discovered in cities located in the North or South of China, and high index values were observed in the West and East of China. It also appears that the concentrations of PM contribute significantly to the index of air quality. e study presents trends and spatial patterns of pollutants in cities, with clusters of high pollutants in the Southwest Xinjiang province and clusters of low pollutants in cities in southern and northeast China. However, despite the interest in these two approaches, no comparison with health data and people's perception of the health impact of these pollution patterns has been developed.
In related work, a different approach studied risk perception as an indicator of the public perception of air pollution [9]. e authors found significant differences in public risk perception and attitude toward air pollution amongst regions. ey applied a hierarchical linear model to explore the effects of demographic, environmental, and economic factors on the trends that appear. ey found that PM2.5 has a significant influence on perceived risk factors and a negative correlation between risk perception and user satisfaction. However, health data, as potentially available from governmental institutions, were not integrated into the framework. Another study investigated the effects of ambient air pollution on hospital admissions for cardiovascular and respiratory diseases in the city of Bangkok [10]. e study analyses daily air pollution concentration (O3, NO2, SO2, PM10, and CO) and meteorological variables from January 2006 to December 2014 and daily hospital admissions for cardiovascular and respiratory diseases. A time series analysis examined the effects of air pollution on hospital admissions and other potential confounders. e results showed a series of clear patterns and evidence to show the effects of air pollution (O3, NO2, SO2, PM10, and CO) on hospital admissions for cardiovascular and respiratory diseases. Here, no correlation was made with public perception of these health patterns.
Another study examined the variability of the impact of air pollution on life satisfaction in the cities of Beijing and Shanghai [11]. A robust negative impact of air pollution on subjective well-being was demonstrated. e authors applied a surface interpolation technique on pollution data as sensed from different monitoring sites to spatially estimate SO2, NO2, PM10, and PM2.5 pollution; the results showed that all pollutants have robust negative impacts. is work uses a similar interpolation technique. Related work explores the particle matters (PM 2.5) and its relationship with lung cancer incidence in France [12], and the lung cancer burden attributable to PM2.5 exposure corresponded to 3.6% of all cases treated in 2015. e study uses a spatially refined nationwide chemistry-transport model with a spatial resolution of 2 km, neighborhood-scale population density data, and a relative risk from a published meta-analysis. However, the approach is purely quantitative and does not integrate additional qualitative data that could be drawn from people's perceptions of such health impacts.
Overall, these studies show the importance of detecting historical trends from open data sources about air pollution and health, as well as contextualizing such phenomena. Although air pollution impacts were either derived from health databases or, to a certain degree, social media, no sound cross-comparison and correlations were explored to evaluate the overall quantitative and qualitative impact of these health patterns over space and time.

Geo-Social Studies.
A general and conceptual framework has been developed to study specific situations of interest (e.g., epidemic outbreaks) using large-scale spatiotemporal multimedia streams [13]. Flu reports, as well as growth rates, were specifically extracted and aggregated from a case study. However, no explorations were reported to uncover trends emerging from new data in a sort of discovery knowledge process.
Another related work introduced a topic tracking system to identify, monitor, analyze, and visualize important local events posted on Twitter in urban environments [14]. e main idea was to obtain not only the spatial distribution of certain geo-topics but also to analyze the evolution of the patterns that emerged. However, the data were limited to social data and did not integrate additional data derived from additional resources.
A relevant example of the integration of social networks with sensor-based data to study the impact of air pollution has been developed in previous work [15]. e authors developed an implementation of a keyword-based geo-social search mechanism to look for spatial patterns in air quality complaints as revealed by Twitter data. e results showed a significant correlation over time in a series of cities in France, Brazil, and China. With the help of a dictionary of pollution-related terms, relevant posts are identified, classified, and then mapped to different urban neighborhoods and cross-compared with socio-cultural differences as they appeared from the city layout. From these historical patterns, some predictions are also generated. While sharing the methodological principles with this previous work, our methodological framework goes further by also analyzing the correlation with respiratory disease cases (e.g., "headache," "chest pain"). In addition, a series of space-time visualizations are generated, at different scales, achieving a better understanding of the patterns under review. Last but not least, our approach covers two years of tweets and 30 years of air quality measurements in a large urban region with over 20 million inhabitants.

Methodology
is section introduces the main principles of our methodology and geo-social framework. e large and complex incoming data comprise three components: (1) recorded and structured digital data from long or short periods [open and structured data], (2) time-stamped and geo-referenced public posts from social networks [unstructured data], and (3) geographical information associated to air pollution phenomena [structured data]. e first component, air quality data, contains the measurements per year of the following pollutants: PM10 (particles with diameters of 10 micrometers and smaller), PM2.5 (fine Particulate Matter with a diameter of generally 2.5 micrometers), and CO (Carbon Monoxide). ese data are gathered through a network of 40 monitoring stations distributed in the Valley of Mexico administered by the Mexico City Government [16]. CO, PM10, and PM2.5 are the main parameters considered since these pollutants are produced by transport and industry in urban environments [7] and are related to respiratory diseases [10,17,18].
Health data contains information on the hospital unit and patient, diseases treated, date, and hospital postal address [19]. Population data describes the population profile in terms of locality, city, state, sex, and age (1990, 2000, and 2010). e open geospatial data set contains the administrative boundaries of Mexico. Finally, narratives are collected through a web extraction process with the public Twitter API. e following challenges were identified: (a) integration of complex structured and unstructured data obtained from social networks and open data, respectively, (b) analysis of geographic and temporal data with their representation, and finally; (c) integration of all data for pattern matching, mining, and exploration to gain additional insights; (d) search parameters on what, where and when to extract narratives from Twitter. e data sets used in this work (structured data and gathered tweets) are available in [20].
is geo-social methodology analyses an air pollution phenomenon from a social perspective in different places and at different times. e methodology looks for trends, insights, and patterns in space and time that have occurred in the last three decades. e approach considers the heterogeneous and complex nature of the data, where new open/social media datasets appear regularly. Figure 1 depicts the main principles behind this methodology. Figure 1 summarizes the three stages of the methodological workflow. e first phase comprises the extraction of the geographic patterns of air pollution (i.e., how the pollutants are distributed in Mexico Valley over the last 30 years). e idea is to identify the main and significant historical trends of air pollution in the Mexico Valley and to provide a reference to guide the extraction of social data (where and when to extract data). Second, the search for emerging social issues or topics where social data is required to understand what mainly people are saying about air pollution, detecting topics related to different social contexts in space and time (e.g., where and when health and air pollution issues emerge). en, unsupervised learning methods are applied to extract hidden topics (e.g., headache) in large text data sets that describe their semantics [21].
Finally, the last step of the framework applies geo-social patterns integration, where the goal is to look for correlations between air pollution data and open data such as health, diseases, and symptoms reflected by social media. For example, to look for a relationship between PM10, PM 2.5, and CO with the terms of diseases and respiratory symptoms (i.e., health impact expressed as headache) reported in social media.

Data Cube Principles.
e main principle of a data cube architecture is to organize, blend and summarize a large database commonly derived from open data to derive a new structure (a data cube implementation phase). We have kept a spatial star schema introduced in a related work primarily oriented to high-dimensional data analysis [22].
Data cubes categorize information according to multiple hierarchies or semantic dimensions, allowing analysis of location, pollutants, and health in a multidimensional way. e goal is to detect emergent patterns in open data as they appear in space and time. We differentiate between fact tables (e.g., records of pollution measurements that are analyzed quantitatively) and dimension tables (e.g., tables that describe facts using categorical or numeric information, such as time or location).
Drill-down operations for data exploration are performed considering the various dimensions and granularity by applying statistical operations such as mean, min, and maximum values in quantitative columns. For example, to identify the periods and regions with the highest pollution levels, the "average" measure is applied in the <"[pollutantMeasure]"> column and includes location information, such as city limits or geographic coordinates of the monitoring station by performing drill-across operations. is average or summary is used individually for each pollutant. e first type of data cube is oriented to air pollution data using the time and location dimensions: >. e latitude/ longitude coordinates give the location from where a tweet was fetched; this location is available when the set of tweets is extracted. is social cube will be complemented by additional new topics derived from the next stage. In contrast to PM2.5 and PM10 trends, the highest average CO pollution levels raised from September (1.45 PPM (Particles per million)) to December (2.2 PPM) and then decreased from January (2.2 PPM) to June (1.45 PPM). e historic max values were in January (35 PPM), May (36.9 PPM), and December (35.40 PPM). e average values re ected a tolerable air quality that established a limit for the concentration in ambient air of 11 PPM for an average of 8 hours. Accordingly, September-December and January-May are critical periods to guide the social data extraction process from Twitter to search for a better impact understanding of air pollution. e main idea is to identify the relevant months to extract social data at stage 2 (i.e., unsupervised learning from social data).

Geographical Data and Temporal Analysis.
e purpose is to visualize the pollution trends that emerge by considering health and social data integration with pollutant measures as they appear in space and time. A speci c focus is made on a multi-criteria analysis of pollutant data CO, PM10, and PM2.5 in conjunction with respiratory diseases in the Valley of Mexico, such as "chronic obstructive pulmonary disease (COPD)," "acute lower respiratory illness (ALRI)," "cerebrovascular disease (CEV)," "ischemic heart disease (IHD)," "COPD and lung cancer (LC)." e aim is to explore their possible association, using data cubes to summarize (i.e., drill-up operation) 30 years of pollutant records.

Mobile Information Systems
Spatial analysis is done by transforming tabular data into map layers [23]. e respiratory disease data, geographical boundaries, and pollutants values stored in data cubes are transformed into layers. en, they are overlaid to observe some emerging patterns according to di erent time granularities over the last three decades (respiratory disease data is available over days from  (Figures 3(c) 3(b)).
Pollution trends remain constant in the municipalities and mayoralties of the north-central region of Mexico Valley and match high-density population areas and the highest number of respiratory disease reports. It is worth mentioning that in Mexico, people are generally treated in hospitals far from their houses because the hospital designation criteria are made according to their ailment and not on the proximity of the hospital to their house. erefore, large patient mobilities are generated (e.g., people who, from their house to the hospital, make a journey of up to two hours on average).

Pollutant Distribution Analysis. Spatial interpolation has been applied to estimate pollutant values at locations
where no data appeared as monitoring stations were available. e purpose is to estimate and derive a surface pattern of the highest pollutant distribution concentrations using all pollutant values measured per date-time (yy/mm/ dd, hours) over the last 30 years. We applied a kernel interpolation [24] that allows us to present the geographical and temporal distribution of these pollutants. It is worth mentioning that kernel interpolation has already been applied to interpolate environmental phenomena such as air pollution modeling [25]. e parameters used in the kernel interpolation are as follows: density in this area and heavy transportation infrastructure. A considerable concentration of respiratory diseases could be related to this. In addition, the patterns are helpful in identifying where to extract data from social networks. is is a sort of pattern that needs to be enriched by social discussions related to this region; to understand in a better way why the pollutants are dispersed in these directions and how people react to it (visualization in an integrated way of social-health insights and a ected regions by one or more pollutants).

Stage 2: Unsupervised Learning from Social
Data. e goal is to detect relevant Spatio-temporal topics from Twitter social narratives. An unsupervised learning approach identi es which relevant topics are re ected in the tweets. e rst task is to identify how many tweets belong to each identi ed topic (e.g., "headache" topic groups 3% of the tweet dataset). For instance, a dataset of thousands of tweets can be classi ed into several topics and a high number of categories. is re ects one of the main challenges of our study: to nd out what and how many topics categories could be contained in the whole dataset. is stage is divided into three tasks: data extraction, discovery, and topic visualization.

Data Extraction.
e data extraction process applied to the Twitter network uses historical trends data, as obtained in the previous phase, as search parameters (where and when) to re ne the temporal period and spatial extent to explore (e.g., September-December, north-center). ese search parameters provide a starting reference for orientating the social data extraction. Pollutant data is monitored monthly over September-December and January-May as concentrations are higher during these periods. e first extraction challenge is to define a suitable search parameter (what to extract) that can identify many tweets related to causes or effects of air pollutants (e.g., concentration of traffics, excessive and illegal use of fireworks, health and environmental effects). We collected more than 38,000 geo-referenced tweets from 2018 to 2019. e extraction process runs on a semi-automatic web process that collects and monitors narratives about air pollution. e extraction search parameters are keywords, hashtags, terms, location, and an extraction buffer (i.e., 5 to 10 kilometers to cover the largest area of each municipality or city). e extraction process still requires to be refined to extract additional tweets that could be directly or indirectly associated with air pollution. e next section addresses this.

Discovering Topics.
is task attempts to discover the underlying semantic structure in text datasets to identify recurring patterns; these are called topics. e patterns may or may not correspond to our intuitive notion of a topic. However, they are useful for analyzing relationships between concepts contained in a set of tweets.
From the tweet datasets extracted from the previous step, the next step is to discover relevant topics from narratives. For instance, "headache" is a term or topic that can be associated with a group of tweets that collect people's opinions about the health impact of exposure to high air pollutant concentrations. Furthermore, these topics might be associated with additional words (e.g., bi-gram or tri-gram terms) that can reflect some instances of pollution levels.
We retained a Topic Modeling technique previously applied to discover relevant topics introduced in related work [26]. is approach is a common mechanism in NLP, and it is based on statistics and linguistics inference. e main input is a set of tweets, and the output is a list of terms or topics where each topic represents a group of tweets. Let us describe the whole process that combines NLP techniques like bigram and trigram extraction, the K-means algorithm for textual clustering, and Latent Dirichlet Allocation (LDA) algorithm for Topic Modeling. e design process has been inspired by a series of related works where K-means and LDA algorithms have been mixed successfully in similar text mining scenarios [27][28][29].
Our process architecture is illustrated in Listing 1. e output includes a list of topics (e.g., a name identified during the LDA process), cluster size, and tweet IDs. e overall sequence is as follows: (1) setting hyperparameters; for example, the "K" parameter is the optimal number of topics that are defined after trial and error of many LDA models executions evaluating the performance (setting a "K" value to avoid repeating the same terms in different topics that offer meaningful and interpretable topics [30]); (2) defining stop words, necessary terms and symbols to be eliminated during analyses at each execution step (e.g., bad words, repeated terms, key-terms for data extraction); (3) running K-means; (4) running LDA; (5) finally, the discovered topics are used to create social data cubes for further visualization.
In order to visualize the topics in space and time, the next step extends the structure of the social cube (mentioned in section 4.1) by the topic modeling results. is is done by the dimensions " <[TopicName]>" that identifies the topic name, and <[ClusterSize]> that denotes the number of tweets grouped by topic. e "location" and "date" of the tweets that belong to the topic are included (i.e., information obtained when the tweet is extracted). e principle behind this is to cross-related the data to connect each cluster with the tweet's information. e final structure considers the following dimensions: is social cube organizes the topics by date and/or location. For example, to compute what is the max size of the tweets with the topic "headache" grouped for a specific location, or at what times the term "headache" appeared per month in 2019. is can be done for each category identified as a key term. In general, this highlights the main topics that appear in social data, when pollutant concentrations increase or decrease, and what other events are involved during this temporal frame.
From the results obtained, and as a sort of recursive process, the search process can be refined at the data extraction level introduced in the previous section. is might help the extraction process using better search parameters. For instance, while the initial extraction process was executed in 2018, we obtained the terms "air pollution" and "chest pain " which significantly appeared in the topic extraction process.
is term, in fact, has been used by people as a local and popular expression to describe a symptom caused by exposure to high levels of air pollutants. In 2019, the search process was improved by including the term "chest pain" as a search parameter.
e results of this data extraction generated a more specialized set of tweets related to air pollution events.

Topics Visualization.
e purpose of this task is to visualize the discovered topics. A common technique that helps to do this is the word cloud technique [31]. We found that the topic "pollution" ("contaminación" in Spanish) is the larger one: 27% of the tweets collected represent some textual narratives associated with air pollution. While the term "traffic" ("trafico" in Spanish) corresponds to 11% of all gathered tweets, and "health" ("salud" in Spanish) represents 20% of the gathered tweets. e "rain" term (in Spanish Lluvia") represents 4% of gathered tweets. is reflects the fact that in May 2019 in Mexico City, there was an environmental contingency (i.e., the highest concentration of pollutants).
is forced the Mexican government to restrict the circulation of most cars and industry activities, coupled with the fact that the weather at this time is the hottest of the year. But two days of rain caused people to express that rain helps reduce the concentration of pollutants. Over these two days, this rain topic was the major one discussed and no longer the high index of pollutant concentration.
ere are other categories of minor topics, such as "headache" ("dolor de Cabeza"), "chest pain" ("dolor de pecho"), "burning eyes" ("ardor ojos"), "headache" (dolor de 8 Mobile Information Systems Cabeza), "eyes" ("Ojos") among others. Together, they represent about 7% of recollected tweets and describe tweet views related to physical terms such as the effects of air pollution. Finally, a significant trend that appears is that the importance and extent of the topics related to air pollution constantly evolve, especially when some specific events related to new terms appear (e.g., rain or Popocatepetl volcano). Discussions on air pollution are taken into account and are considered as different degrees of "active" or "inactive" when this is more or less the case. Figure 5 shows to which degree of magnitude emerged the social topics associated with air pollution in 2019.
For instance, it appears that 11% of the tweets collected belong to the group of the topic "traffic," a topic is described by different events and contexts, and that emerged in 2019. "traffic" and "pollution" topics emerged with magnitude differences. e topics "environmental contingency" and "pollution" emerged from February to September, and that coincided with an increase in PM10, PM2.5, and CO values from high to low. e topic "rain" emerged from February to May. It reappears in September as "acid rain" from tweets that describe poor water quality and "heavy rain" this matches with the rainy season in Mexico City.
Summing up, we discovered at least three relevant social topics with the "air pollution" phenomenon. ey are: "health," "pyrotechnics," and "mass Traffic" that can be refined when performing social narrative extractions. In order to highlight the topic trends from a geographical perspective, another social cube is materialized by location. It then appears that topics with large cluster sizes are concentrated in the north-center of Mexico Valle, where there are higher pollutant concentrations and the highest index of the population. e emerging topics found are "pollution," "headache," "pyrotechnics," "traffic," and other respiratory diseases terms.

Stage 3: Social Patterns in Space and Time.
e purpose is to derive insights that reflect geographic, temporal, and social patterns.
is is done by analyzing trends and patterns discovered in the previous stages; through word clouds, social maps, and overall by applying GIS overlays. e "Social map" concept is introduced to overlay geographical topic distributions with other data layers (e.g., demography, respiratory diseases, air pollution surfaces concentration). e overlay operation can be applied to other events that have occurred in the same location and at different times. For example, the layers of respiratory symptoms topics are overlaid with "respiratory disease cases" and "pollution interpolation" layers. e overall process is similarly applied to other data layers. In order to interpret geo-social patterns, the topics and trends discovered in phase 1 can be used as input query parameters for cross-analyzing tweets with open data.

Experiments and Results
e objective of the experiments is to provide a geo-social characterization of the health impact of air pollution phenomenon using open data sources and social narratives. e main discovered insight is that social discourses reflect the air pollution phenomenon vary not only in function of the pollutant concentrations but also encompass a more general behavioral pattern that combines a series of additional dimensions and factors. e emerging trends reflect the roles of the population distribution, major social events, governmental policies, and additional environmental factors. e approach is oriented toward the three most important general categories of topics that appear: "health," "pyrotechnics," and "traffic." ese topics highly correlate with concentrations of CO, PM10, and PM2.5 pollutants, which are distributed around the northeast of the Valley of Mexico. e main topics that emerge in the north of the Valley of Mexico from July to September are: "health (breathe, throat)," "stress," and "traffic." Also, "traffic" from April to June, "stress and traffic" from July to September, and finally, from October to December, "health topics" (e.g., breathing, throat), "pyrotechnics," and "traffic" topics appeared.
e pollutant concentrations are compared with open data vs. social data patterns. e topics illustrated in Figure 6 show regular air quality values, but the highest values were in . is could explain why people suffer physical discomfort due to bad air quality during this period. at is, between regular and bad air quality levels. Figure 6 shows some health impacts of air pollution such as "headache," "burning eyes," and "throat" topics as they emerged when PM10 and PM2.5 values increased from October to December (there is also a significant increase when pyrotechnics also emerged as a topic). Overall, "burning eyes" is the most popular impact of air pollution as reflected by social data patterns. e mentioned topics also emerged in the central, north, and east of the Valley of Mexico. e next experiment considered health-related topics associated with pollution narratives: "nose," "eye," "head," "chest," "throat," "chest pain," "watery eyes," "headache" and "burning eyes." ese human body parts or respiratory symptoms are often associated with physical discomfort caused by air pollution. ese topics emerged in locations where there are large reports of respiratory diseases and during the most polluted months (January, February, March, May, October, and December). In addition, Figure 7 shows a historical increase in respiratory disease cases from October to December, considering only 10 years of available data (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) provided by the Mexican government. Figure 8 (a) overlays respiratory diseases and health topics. e topic location of each topic is given by the tweet locations, while the topic color intensity denotes the topic magnitude. According to the last 5 years of respiratory diseases data (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), there is a relevant number in the northern cities of the Valley of Mexico. For example that in the East in "Ciudad Nezahualcóyotl" 3738, the counties like "Cuauhtémoc" 14250 and "Gustavo A. Madero (GAM)" 11835 cases of respiratory diseases converged with respiratory symptoms topics such as "headache" ("dolor de cabeza"), "eyes" ("ojos") and "throat" ("garganta").  Figure 9 shows an overlay between health symptoms topics and interpolation surface of PM 10, PM 2.5, and CO pollutants concentration in 2019. e source describes the concentration of these pollutants in the north of Mexico. e locations of respiratory symptoms: "breathe" ("respirar"), "headache" ("dolor de cabeza"), "chest pain" ("duele el pecho"), "headache" (cabeza) and "burning eyes" ("arden los ojos") are in regions where the air quality is regular or bad (orange and red color).
ese topics represent groups of citizen narratives that describe respiratory symptoms and high pollution levels.
Summing up, the most recurrent social discourses associated with air pollution emerged when related to the following facts: the health impact of the air pollution phenomenon.
(1) Health" under the combination with a high index of the population in the area, a high index of pollutants concentration, rainy or non-rainy periods (South and North). (2) Health patterns appear, not surprisingly, as strongly correlated with the distribution of population and pollutants and under the influence of air pollution conditions. (3) Respiratory symptoms emerge in social networks when respiratory disease cases increase in hospitals.
Full forms and acronyms mentioned in the study are shown in Table 1.

Conclusions and Future Work
e methodology and computational approach presented in this study combine sensor-based data with social media narratives in order to describe and study the negative effects of air pollution in the Valley of Mexico. By combining social networks and open data, the study revealed a series of healthgeo-social patterns associated with the impact of the air pollution phenomenon. e social media narratives related to air pollution reveal the main topics that emerge when high levels of PM10, PM2.5, and CO arise. In general, the approach can also be used as a monitoring and predictive mechanism to anticipate some pollution patterns and thus enable some countermeasures.     Mobile Information Systems e fundamental principle behind our approach is to extract social narrative patterns that reflect peoples' perception of an air pollution phenomenon at the scale of a large city and over time. For example, by considering the historically most polluted months and regions obtained by mining 30 years of data, search terms are refined by observing the behavior narratives that emerged from the tweets gathered in a 2-year period. Our approach shows how social and open data complement each other to describe the impact of an environmental phenomenon, from the regional level to the personal level, and when combined with government and open data on respiratory disease cases to contrast health findings as they merged from social networks. e geo-social characterization of air pollution is derived from an implementable workflow framework by applying data mining and GIS methods whenever necessary and NLP techniques. We introduced a social characterization of an air pollution phenomenon that describes the Spatio-temporal dynamics of social narratives related to air quality impacts. An additional contribution is a data exploration process by topics and trends, which are further used as smart parameters to design data exploration mechanisms that support Spatio-temporal cross-analysis of tweets and open data. Another interest of the approach is that a given topic emerges not only when some associated patterns arise but also when some triggering precursor conditions or circumstances are likely to activate it.
Our framework combines traditional techniques from Natural Language Processing (Topic Modeling), Unsupervised Learning (Clustering), Geographic Information Systems (spatial interpolation and choropleth maps), and Data Mining (data cubes modeling) to integrate the understanding of air pollution impact knowledge from regional to individual granularity. Our work can be improved by considering new methods for Spatiotemporal topic tracking, sentiment analysis, deep learning, semantic modeling with ontologies to improve the extraction of social data, and new big data technologies. e whole methodological workflow framework that combines three complementary resources (i.e., text analysis, spatial data mining, topic modeling) has the advantage of being reproducible and applied to other phenomena at different spatial and temporal scales. Future work will be focused on integrating deep learning to analyze tweet patterns and study the health and social impacts of the COVID-19 pandemic associated with air pollution patterns.
Data Availability e data sets used in this work (structured data and gattered tweets) are available on http://antacom.org.mx/opendata/ airpollutioncdmx.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest.