Search of Fuzzy Periods in the Works of Poetry of Different Authors

We applied a new method for the identification of fuzzy periods and the insertion and deletion of characters were taken into consideration while studying theworks of poetry.The technique employs genetic algorithm, dynamic programming, and theMonte Carlo method. In the present work, the technique was applied to poems written by the famous Russian and foreign classics. A total of 95 poems were studied; and fuzzy periods possessing high statistical significance were identified with more than half of the poems under study. The existence of correlation between the stressed vowel letters in a poemwith the position of the fuzzy periods was shown. The present study shows that a work of poetry contains both semantic component and fuzzy periods of letters; hence a poem could have psychological impact on the audience.


Introduction
Works of poetry could be considered as a superposition of the semantic content and of the acoustic wave determined by a certain sound alternation periodicity.In relation to this, a poet is capable of combining the semantic content with a certain acoustic wave in a work of poetry.A certain periodicity of sounds alternation in a work of poetry is understood as an acoustic wave.If the meaning of a poetic text is easily understood by each person, the acoustic wave embedded in a work of poetry will be perceived rather intuitively, as some musicality, often fascinating the listeners and exposing them to a certain psychological impact [1].In order to understand the mechanism of the acoustic wave impact on listeners, it would be very interesting to attempt quantitatively identifying and studying the acoustic wave embedded in a work of poetry, in the form of a certain periodicity of the poetic text [2,3].To solve this problem, it seems important to develop and apply new mathematical methods that could quantitatively demonstrate the existence of an acoustic wave in a work of poetry in the form of fuzzy periods and provide the quantitative characteristics of the periodicity found.This task seems to be important, since the quantitative determination of acoustic waves would ensure the classification of existing acoustic waves in the works of poetry.Thus, we could correlate a certain type of acoustic wave and its impact on a listener.After introducing such an important concept as fuzzy periods [4,5], we could illustrate it with an example.Under the fuzzy periods, we shall obtain the mean of such periods, where the similarity between individual periods is insignificant or is missing at all; and the periodicity becomes statistically significant only on a certain set of periods (more than 2) [6].Fuzzy periods could be demonstrated with an example.Let us consider a sequence in the following form: (qzwrt)(qzwrt)(qzwrt)(qzwrt)(qzwrt)(qzwrt)(qzwrt) . . .The given sequence is characterized by a perfect periodicity consisting of 5 letters.In this study, each period is highlighted in parentheses, for clarity.There is absolute similarity between the separate periods and it is easily identified using the techniques described previously.Considering a case in the position of each period, a definite and limited set of alphabet letters could be found; for example, such set of letters for each 2 Advances in Fuzzy Systems period position is shown as follows: {q,i,u,s,t}; {u,c,i,a,s,r}; {o,p,f,g,l,k,w}; {a,b,n,m,v}; {p,f,g,h,t,j,r}.Now, let us create a character sequence taking from each set a letter with the use of random technique and corresponding to the period position; then, the sequence can be obtained in the following form: (iroap)(tufng)(sslmt)(uawaj)(qcgbf)(siknh)(sipvr) . . .The resulting character sequence lacks absolute periodicity.However, it should be noted that given the sufficient length of this sequence, it could be seen that, in the position of each period, only certain alphabet letters are located.Such a sequence is characterized by fuzzy periods, which could not be identified by pairwise comparison of any two periods but could be detected using a certain set of periods (more than 2).
Nowadays, several mathematical techniques are employed for the detection of fuzzy periods in character and numerical sequences.These include the wavelet transform [7] and the Fourier transform [8].Previously, the information decomposition (ID) technique was developed [4].The difference between the ID technique and the Fourier transform lies in the fact that the ID technique could be used for character sequence analysis without recoding it into a numerical series.Such a method of analysis makes it possible to obtain results that are unattainable with the Fourier transform.This allowed the fuzzy periods in DNA sequences [5], amino acid sequences [6], and of several works of poetry to be revealed [4].However, the ID technique, like other methods previously discussed, does not allow the finding of a statistically significant fuzzy period with insertions and deletions of characters, which in case of literary works could be registered in connection with pronunciation peculiarities.For example, certain sounds may not be pronounced at all or may be pronounced with a certain accent.Consequently, most of the fuzzy periods contained in the sequence could not be determined using the previously developed methods.
As of today, there are mathematical approaches based on dynamic programming that allow the accurate identification of fuzzy periods of time series or character sequence in the presence of characters insertion or deletion [9,10].All these techniques are used to construct the multiple alignment of periods; and they are based either on performing the pairwise alignment of periods, followed by the subsequent creation of a guide tree, or on the search for embryos or common words in periods.Thereafter, the initial multiple alignment of periods is provided; and the optimization thereof is carried out in one way or another, including the use of hidden Markov models, iterative procedures, and some other techniques [10][11][12].However, all the developed approaches do not ensure construction of the multiple alignment, if the statistically significant pair alignment is missing in the analyzed sequences.It does not allow the creation of a statistically significant guide tree for the progressive alignment; or the sequences are that different that they do not provide searching for the statistically significant embryos or common words.It turns out that nowadays, it is impossible to construct a multiple alignment for significantly different sequences (periods).In this case, it could be argued that all the developed approaches are "blind" and will not identify a statistically significant multiple alignment in the significantly different sequences (periods).Such an alignment could be found, if it would be possible to construct a multiple alignment through the direct application of dynamic programming for all the analyzed sequences.But this is the so-called NP-complete problem [13,14]; and such an approach requires gigantic computer resources that are not available at present; and it is difficult to think about its creation in the nearest future.
Previously, a new technique was developed for identifying the fuzzy periods in character sequences, which took into consideration the insertions and deletions of characters [15,16].This technique is based on the new solution of the NP-complete problem regarding the sequences (periods) multiple alignment.This method employs genetic algorithm, techniques aimed at optimizing weight matrices, dynamic programming, and the Monte Carlo method.It enables identification of the fuzzy periods of a character sequence with insertions and deletions in previously unknown positions.It is important to note that this analysis requires only the symbolic sequence itself (the text of the poetic work) and other information about the poetic work, including the placement of stresses and features of pronunciation, are not required.In the given work, this approach was applied while searching for fuzzy periods in the poems of famous Russian and English-speaking poets.We showed that, in more than half of the works of poetry, it is possible to find fuzzy periods.This study shows that a work of poetry contains both semantic component and fuzzy periods, which could be responsible for the psychological impact of a poem on the audience.Fuzzy periods can be a reflection of the sound "wave" which exists in a poetic work.

Fuzzy Periods Search Technique Algorithm Used with Consideration of Characters' Insertions and Deletions
At the beginning of the work, the poetics is transformed in such a way that all the spaces are deleted, uppercase letters are changed to lowercase, and punctuation marks are changed to spaces (Figure 1, Paragraph 1).Thus, the character sequence is created on the basis of the transformed work of poetry for further evaluation.In Figure 1, Paragraph 2, the () set of random matrices having the k×n dimension is generated, where  is the period length and  is the size of the original alphabet sequence.In Figure 1, Paragraph 3, modification and optimization of random matrices are performed, which is required for constructing the  sequence alignment.Then, in Figure 1, Paragraph 4, a search was conducted for a matrix that possesses the greatest value of the similarity function, when the  sequence is aligned.For this purpose, genetic algorithm and dynamic programming are applied.At each phase of the genetic algorithm and for each matrix and the  sequence, we calculate the maximum value of the E  similarity function using dynamic programming.In this case, E  appears to be a fitness function; and each matrix becomes a genotype.Then, to the () set of matrices we apply the genetic algorithm, which causes the mutation, multiplication, and destruction of matrices.As a result, we 4. e search for the matrix q m from the set Q(n), which has the greatest value of the similarity function F(n).

5.
Generation of the set R of random sequences.

6.
Aligning each random sequence with respect to the optimized matrix q m .

Calculation of Z(n)
If n is less than 100, then n=n+1 STOP find such M  matrix that possesses the greatest E  value; and we denote it as mE  .In order to estimate the statistical significance of mE  , we generate the  random sequences set (Figure 1, Paragraph 5).This set is generated using  sequence random mixing.Then, for each sequence out of the  set, the maximum value of the E  similarity function for the M  matrix is determined (Figure 1, Paragraph 6).This makes it possible to calculate the average value and variance for the E  with the  set and then calculate () (Figure 1, Paragraph 7).These calculations were performed for  values from 2 to 100.As a result of the algorithm, the dependence of  on  was obtained and was denoted as ().Let us consider Paragraphs 1-7 in more details.

Spaces Removal, Replacing Uppercase Letters with Lowercase Letters and Replacing Punctuation Marks with Spaces.
We submitted a poem, which we would like to study for the presence of fuzzy periods, to the program input.The program replaces all uppercase letters with lowercase letters and deletes all spaces (Figure 1, Paragraph 1).Next, the program replaces punctuation marks such as a dot, comma, dash, colon, interrogative and exclamation marks, and also the end of each line to the space character.The space plays the role of a pause [17].A space is included in the alphabet as an additional character and, thus, an alphabet of the  characters size is used (k = 34 for Russian works and k = 27 for works in English).At the program output, a transformed work consisting of only the characters of the given alphabet is obtained.This work is regarded as the  sequence with the  length.
The method of preparing the text can introduce certain distortions into the periodicity, which is in the poetic works.However, an incorrect method can only worsen the statistical importance of periodicity, since the shortcomings in the preparation of the text are compensated by the creation of additional insertions or deletions.This means that, for this text preparation, we may find that not all periods are significant.However, those that are discovered exist in the analyzed poems.

Creating
Random Matrices for the  Length Period.The k×n dimension random matrices, where  is the size of the alphabet and  is the period length, are generated as follows.Each element of (, ),i=1,...,k, j=1,... n matrix was randomly filled with equal probability of either 0 or 1.A total of 10 5 of such matrices are created for each  length period, where  varies consecutively in steps of 1 from 2 to 100.Out of the created random matrices and for each  length period, we selected only 10 3 of such matrices, which in the k×n dimension space were located at a certain value from each other.For this purpose, the matrix in the k×n space is considered as a point; and we should take only those points that are located at a distance not less than  0 from each other.The distance between the points is calculated as Here  1 (, ) is the  1 random matrix element, and  2 (, ) is the  2 random matrix element.The matrix (point in the k×n space) was added to the () set, if the  distance between it and every already included matrix (point) in this set was greater than the  0 value.The first generated random matrix was immediately included in the () set (Figure 1, Paragraph 2).

Modification and Optimization of Random Matrices for
the  Length Period.Then, a modification of the generated random matrices of the () set (Figure 1, Paragraph 3) was performed.This was done with the goal of ensuring that the mE  distribution functions from different matrices of the () set and at the  random sequences set were identical.For this purpose, the algorithm described in [15] was used.To do this, each matrix was modified, so that the  2 and   values would be identical for all the matrices.
where (, ) is the matrix element and (, ) =()(), while ∑ , (, ) = 1, () = ()/N, where () is the number of the  type characters in the  sequence with ∑  () =  and () = 1/n for any .Equation ( 3) is the equation of a sphere in the k×n space with  radius.Equation ( 4) is the equation of a plane in the k×n space.Then, the modified matrix was optimized using genetic algorithm and dynamic programming [15] (Figure 1, Paragraph 3).The genetic algorithm was applied immediately to the entire () set of matrices, in order to create such an M  matrix and such a S subsequence that would have the greatest E  .E  is the maximum similarity function when searching for local alignment [18] of the  sequence, in respect to a certain matrix of the () set [19].Each matrix in the genetic algorithm appears to be the genotype, and E  here acts as the fitness function.This procedure has already been described in [15].As a result of this algorithm operation, we obtained the M  matrix (let us call it the mM  ), as well as a fragment of the  sequence (let us call it   ), which possessed the maximum value of the similarity function, when it was aligned with the mM  matrix.
3.4.Generation of the  Random Sequences Set.Random sequences were generated using the   subsequence (Figure 1, Paragraph 5).The  random sequences set was created using the random mixing of characters in the original   subsequence.The size of this set contains 200 sequences.The random sequence was created on the basis of the original   subsequence, by randomly mixing the sequences.For this purpose and using the random numbers sensor, the  sequence was generated with a length of   , where   is the length of the initial   subsequence.Then, the  sequence was regularized in ascending order and the permutations made were memorized.Thereafter, the permutations made in the  sequence were applied to the   subsequence.In total, 200 random sequences were created and were included in the R set.

Random Sequences Alignment in respect to the Optimized Matrix.
For the obtained optimized M  matrix (Section 3.3), the E  average value and value variance were calculated.To do this, we constructed the local alignment of each random sequence out of the  set in respect to the optimized M  matrix (Figure 1, Paragraph 6) [20].Using this algorithm, we searched for the best local alignment between each sequence out of the  set and the sequence of column numbers of the optimized M  matrix.For this purpose, the matrix for the  similarity function was filled using the optimized m  (, ) matrix: where () is the character sequence element and d is the price for the character insertion or deletion from the alphabet in the  character sequence.Here, i and  vary from 1 to N,  = -⋅int((−1)/).This means that the matrix column with the  number always corresponds to the  index.The  matrix has the N×N dimension, where  is the length of the character sequence.After filling the  matrix, the following value was used: E  =(, ).

𝑍(𝑛)
Calculation.As a result of calculations using formula (5), the value for each random sequence out of the  set was found.Then, the   average value and the (  ) variance were calculated using the E  set obtained for the () random sequences as follows: All calculations were performed for the  period length from 2 to 100.

Constructing Multiple Alignment
After completing the algorithm operation, a specific  period length was selected that possessed the greatest  value.For this period length, a local alignment was constructed, which consisted of two sequences located one below the other.The first sequence was a sequence of indices that periodically varied from 1 to n (denoted as index S).The second sequence is the character sequence, and, namely, the transformed poem (denoted as symbol S).Local alignment was employed to construct multiple alignment in the following way.The local alignment was divided into short fragments as follows: if the index in the index S sequence reached the  value, then the alignment fragment was cut from the entire local alignment and so on until the very end of the local alignment.Thus, a set of the alignment fragments was obtained.It should be noted that both in the index S and in the symbol S sequences the insertion character or the " * " deletion character could be contained.Then, the multiple alignment was constructed using the obtained alignment fragments.In details, the process of constructing multiple alignment is described in [21].

Calculating the Chi-Square Distribution Using Multiple Alignment and Its Transfer into Normal Arguments
For the multiple alignment columns, the chi-square distribution was calculated.The column numbers were the period positions and the " * " character was not involved in the calculation.The number of letters in the entire local alignment was counted and denoted by .The number of  type letters was also counted in the entire local alignment and was denoted as (), where  is the letter from the alphabet.Then the  type letter probability was calculated within the entire local alignment, as () = ()/.The number of letters was counted without taking in to consideration the " * " in the  column, and it was denoted as (), where  varied from 1 to .The (, ) value denoted the number of  letters in the  column.As a result, the chi-square value was calculated for each column with the (n-1) degree of freedom: Thereafter, the obtained chi-square distribution was transformed into the arguments of normal distribution.For this purpose, the Wilson-Hilferty approximation was used, based on the fact that the ( 2 V /V) 1/3 distribution and the increasing V were approaching normal distribution with the  = 1−2/(9⋅V) mathematical expectation and the  = √2/(9 ⋅ V) variance [22].As a result, we obtained a formula for converting to a normal distribution with the number of degrees of freedom equal to (n-1): where  is the period length.

Calculating Mutual Information
In order to check the relationship between stresses in a poem and period positions, we calculated the mutual information between the stressed vowels and the period positions.In the multiple alignment, the stressed vowel letters were replaced by 1 and other letters were replaced by 0. Then, the number of zeros (0s) and ones (1s) in each multiple alignment column was counted.As a result, the 2×n dimension table was constructed; and the first line indicated the number of zeros (0s) in each column, whereas the second line indicated the number of ones (1s) in each column.In the end, mutual information was calculated according to the following formula [23]: where  ⋅ = ∑     ,  ⋅ = ∑     ,  = ∑   ∑     , and   are the table elements.The 2I value was distributed as the chi-square with (r-1)(n-1) degree of freedom.In order to convert the normal distribution, formula (7) was used.Thus, the  value reflects the correlation of the stressed vowel letters and all the other letters with the period positions.If a correlation is present, the (2) values should be greater than 4.0.

Study of Artificial Sequences
The developed algorithm was first applied to the study of artificial sequences, one of which was a sequence in the following form: [] 45 (the set of abcdefg letters was repeated 45 times); the sequence had an alphabet of 7 letters.Then random substitutions were introduced to this sequence (the number of random substitutions was indicated in % of the initial sequence length).Figure 2 shows that the application of the developed algorithm to an artificial sequence randomly changed by 50%.It could be seen that the  value takes the maximum value at the 7 letters period length, while the maxima are significant for length periods that are multiples of 7, but these maxima gradually decrease.Figure 3 shows that the result for the artificial sequence is randomly changed by 60% and involves the addition of 8 inserts and 12 deletions of letters, as well as for the random sequence (obtained by random mixing of the initial sequence).It could be seen that there are no fuzzy periods in the random sequence.The test results show that the technique confidently identifies fuzzy periods in the presence of insertions and deletions of characters, as well as of the substitution of random characters.

Searching for Periodicity in Works of Poetry
Afterwards, the developed algorithm was applied to identify fuzzy periods in the works of poetry written in Russian and English languages.Certain results were presented which contained the discovered fuzzy periods in the poems of famous classics.The poem by A. Pushkin, the Russian classic, entitled "I remember a wonderful moment. .." was written with the iambic tetrameter (iamb is a two-syllable verse meter with stress on the second syllable in the foot; the foot in this case consists of two syllables).After applying a set of programs to the poem, a graph of the  dependence upon the  period length was obtained (Figure 4).Z=7.04 and the maximum value exceeds the  0 threshold value and is reached at n=4, while the fuzzy period length is equal to 4. The  0 threshold value was determined experimentally by calculation based upon the random sequences obtained from the converted poem (initial sequence) by adding a large number of random substitutions to it.It was calculated that after taking into consideration the probability of the Z>6.0, accidental occurrence was less than 5% for all the analyzed poems.William Blake's poem entitled "Spring" was written in the two-legged trochee (trochee is the two-syllable verse meter with stress on the first syllable in the foot).The Z=9.58 maximum value (Figure 5) is reached at n=8, which means that the length of the fuzzy period is equal to 8. It should be noted that, in some cases and for lengths close to a period with the maximum Z, for example, at n=7, a sufficiently great value of  is registered.This is due to the addition of superfluous inserts and deletions and, thus, periodicity close to that found is being simulated.The (2)=4.68mutual information value was calculated, which indicated the presence of a correlation between the stressed vowels and the period positions.An evaluation of the multiple alignment positions (Section 5) shows that the 6th period position is the most significant.
The poetic meter of the famous poem by William Blake called "The Lamb" is the base trochee.The Z=10.17 maximum value (Figure 6) is reached at n=11; i.e., the length of the fuzzy period is equal to 11.It is interesting to note that the  values of the length period are greater and are close to n=11.However, this happens with the additional inserts and deletions, thus simulating the main period.The mutual information value for this poem is equal to (2) = 7.96, which indicates a strong relationship between stresses in the poem and the period positions.A study of the multiple alignment positions (Section 5) shows that the 10th period position is the most significant.
The poem "Fire and Ice" by Robert Frost was studied; the poem was written with the iambic tetrameter with a variable number of feet stops in the lines, either 4 or 8.The Z=7.55 maximum value (Figure 7) was reached at n=13; i.e., the length of the fuzzy period is equal to 13 letters.
The mutual information value is (2)=6.00.After calculating the chi-square distribution, it turned out that the most significant is the 4th period position in the multiple alignment.Table 2 presents the multiple alignment.By selecting the most common letter in each period position in the multiple alignment, the following set of letters will be received: iresoleshith.After substituting all the stressed letters in the poem with 1s and all the remaining letters with 0s, it becomes absolutely evident that there is a relation between the stresses in the poem and the period positions in the multiple alignment, because the first period position practically consists of 1s (Table 3).In an additional study, we considered the possibility of the existence of an interrelation between the positions of the fuzzy period and the stressed letters in the poem.For example, the result is given for the poem "Fire and Ice" by Robert Frost.In this example, the placement of stresses was done manually.To do this, all the percussive letters in the poem were replaced by 1 and all the other letters by 0. After this, it became evident that there is a relationship between the stresses in the poem and the positions of the period in the multiple alignment, so the first position of the period consists of almost only 1 (Table 3).
The poem by George Gordon Byron "Remember Thee" was written with the iambic tetrameter.The Z=6.16 maximum value (Figure 8) was obtained at n=15; i.e., the length of the fuzzy period is equal to 15.The mutual information value is (2)=4.90;and the most significant is the 11th position of the period in the multiple alignment, which practically consists of the letter "h."If the most popular letter is selected in the position of each period of the multiple alignment and in case of the same number of certain letters in the column, the one used most rarely in this poem is selected, then the following set will be obtained: lremombertheea.
It should be noted that, in English, it is not the letter that strikes but the sound.Since the sound can consist of several vowel letters, then for the sake of certainty, the first letter in the sound was considered (marked) by the stressed letter.Concerning the very arrangement of accents in this poem, we arranged them according to the poetic size (iambic, trochee, etc.).Therefore, in the case of chorea, which is characterized by an alternating sequence, a shock and then an unstressed sound, the poem was placed stress.However, to search for the periods themselves, as earlier noted, only the text itself is used and no other information is required.
In total, 95 poems by Russian and foreign poets were studied.In more than half of the poems studied, fuzzy periods with the Z>6.0 value were found.The other half also had fuzzy periods, but the  level is lower than 6.0.These results could be explained by the fact that not all poems have a "clear structure" and rhyme.They also combine poetic dimensions that make it difficult to detect the periodicity that is often used.There is another explanation, which is connected with the fact that in many cases a large number of insertions or deletions of symbols in the text are required to notice fuzzy periods.Such causes can lead to a relatively low level of statistical significance of fuzzy periods.
Table 1 shows that the short lengths of the fuzzy period are mostly often encountered in Russian poems, whereas in English poems the lengths of the fuzzy period are longer.This can be explained by the structure of the language.For example, in the Russian language there is a frequent alternation of vowel and consonant letters, and in the English language a case is more widely spread, where several consonant or vowel letters are consecutive, which in turn prolongs the period.

Conclusions
The present study aimed at evaluating the efficiency of a new technique in searching for fuzzy periods which are accompanied by insertions and deletions [15] in the texts of  poems.The applied mathematical method analyzes the text of a poetic work and does not require any other information.It should also be noted that we were unable to find a statistically significant periodicity in ordinary novels both in the Russian and in the English languages.This shows that the fuzzy periods within the linguistic texts are observed exclusively in works of poetry.In general, the results of the present study suggest that a poet uses certain acoustic waves, when writing a poem.It could be noted that poets use a fairly diverse set of acoustic wave lengths, when creating a poem.Probably the fuzzy periodicity of the text is reflection of such acoustic waves.It could be assumed that the acoustic wave is rather important to ensure the psychological impact on the audience, when reciting a poem.

1 .
Remove spaces, replace capital letters with lowercase letters, replace punctuation marks with spaces in the work.Creating the S sequence.2. Creating random matrices Q(n) for a period of length n. 3. Modification and optimization of random matrices for a period of length n.

Figure 1 :
Figure 1: The main stages of the algorithm mused for calculation Z(n) of the analyzed sequence S.

Figure 2 :Figure 3 :
Figure 2: Graph of Z(n) for an artificial sequence randomly changed to 50%.

Figure 7 :Figure 8 :
Figure 7: Graph Z (n) for the poem by Robert Frost "Fire and ice."

Table 1 :
Lengths of fuzzy periods found in 95 works of poetry of different authors.

Table 2 :
Multiple alignment of Robert Frost's poem "Fire and Ice."The zero line shows the positions of the period.

Table 3 :
Multiple alignment with the replacement of letters by 0 and 1 for the work of Robert Frost "Fire and Ice."The null line shows the positions of the period.