AI-Assisted Failure Location Platform for Optical Network

In the paper, we applied the customized AI module to the OTDR device and, combined with the optical power monitoring module, realized the AI-assisted optical network fault location mechanism for the high-density interconnection scenario of data centers. Te mechanism can make full use of the data from optical links. Based on the link data, the AI module can predict the links that may fail, and then the target links will be monitored by the optical power module. Te mechanism can quickly locate and respond to faulty links. Trough the test, the introduction of an AI model can improve the average fault detection efciency of the link by 98.41%.


Introduction
As the data center gets bigger and bigger and the topological structure becomes more and more complex, a data center failure is a disaster that can cause the loss of huge amounts of data and the interruption of large calculations. At the same time, as the number of devices and links increases rapidly, the frequency of failure in optical networks of data centers increases and the number of alarms increases, which makes it difcult to locate faults and takes more time to rectify faults. How to locate the fault quickly and accurately from a large number of alarm devices has proven to be a thorny problem [1].
As reported by the Federal Communications Commission (FCC), more than one-third of service disruptions are caused by fber-cable problems [2]. Terefore, automatic monitoring and diagnosis of optical fber links are very benefcial. By introducing machine learning (ML) in data centers, it will not only revolutionize the (mainly manual and human) approach to the traditional management of fber-optic network fault management [3]. It also helps optical network operators plan and schedule their maintenance activities more efciently [4] and thereby save CAPEX/OPEX and reduce the time to repair (MTTR) by quickly discovering and pinpointing the link faults. Tis enables operators to more easily meet service level agreements (SLAs) and improve customer satisfaction by reducing downtime and improving network quality. In 2018, Rafque et al. [5,6] proposed an optical layer fault detection architecture based on machine learning and defned four types of optical layer fault types. It was suggested to acquire and collect optical power monitoring data through the southbound interface of SDON, conduct data analysis through the ANN algorithm, and upload data analysis results through the northbound interface. In the same year, Huawei put forward the optical service fault prediction scheme combining artifcial intelligence and big data technology, mainly taking the bit error rate (BER) and optical power as input to predict the optical service fault, and cooperated with operators to carry out the initial verifcation of the OTN live network. Te prediction accuracy is 85%, which not only improves the robustness of their network but also reduces the network cost of inspection. Chen et al. [7] proposed a DNN-based optical transmission link fault detection scheme in which the clustering module of unsupervised learning and the DNN module of supervised learning were integrated to analyze the internal relationship between optical power and the alarm log to detect link faults. However, the above work only realizes the fault prediction and does not consider the problem of fault location.
Te optical time-domain refectometer (OTDR) is the most common way for quality evaluation and fault location of optical fbers [8]. At present, the commonly used data center fault monitoring scheme is to adopt optical switch polling and optical power monitoring. However, in the case of high-density interconnection of optical networks in data centers, fault detection in this way still consumes a lot of time, which is not conducive to troubleshooting and solving faults. In [4], the author proposes an OTDR optimization scheme based on LSTM. A LSTM model is used to predict possible faults according to OTDR detection results. However, this method requires continuous use of OTDR to detect link conditions, and the existing data center operation and maintenance data cannot be fully utilized. In this paper, based on the model that was realized in [9], an AI auxiliary judgment and failure location platform was designed and implemented. By using the operational data collected from optical network link, AI module predicts possible failure of the link, platform will send instructions to the optical switch according to the prediction result and monitor the optical power of optical links that may fail. Once the optical power is below the threshold, OTDR is enabled for link detection. After the test, the average fault detection efciency of the link increased by 98.41%.
Tis paper is organized as follows: Section 2 describes the system architecture and equipment introduction. Section 3 introduces the AI model that is used in our system. Practical application and performance analysis of the platform are discussed in Section 4. Conclusions are drawn in Section 5.

The Architecture of the Platform
Te architecture diagram of our test is shown in Figure 1. Te AI-assigned monitoring platform collects data from the optical link in real-time. Tese data are used by the AI model to predict the status of the optical link. According to the prediction result, the platform issues instructions to the optical switch array, which will switch the predicted failure link in turn before the next instruction arrives. At the same time, equipment A monitors the power of the link and starts the OTDR to detect the link when the power is lower than the threshold. Te above workfow is shown in Figure 2. Figure 3 is the architecture of equipment A. In Figure 3, the laser produces a 1650 nm laser burst according to the pulse generator. Te pulse enters the optical link through the circulator. Uplink light from the optical link enters the WDM flter module through the circulator. Uplink light and 1650 nm backward scattering light enter modules B, which is used for OTDR data acquisition and processing, and C, which is used to calculate optical power. Te calculation result is sent to the AI-assisted monitoring platform.

AI Model Used in the Platform
Tis section includes a theoretical introduction and the results of the failure prediction model. Part A is mainly about the LSTM model for each feature. Part B shows the classifcation result of the SVM model.

LSTM Model.
A typical LSTM neural network with cell, input gate, forget gate, and output gate, as shown in Figure 4. Memory-cell takes input from the output of the LSTM neural network in the last iteration. Te input-gate obtains a new input point from outside and processes newly coming data. Forget-gate decides when to forget the output results, which selects the optimal time lag for the input sequence. Te output-gate takes all the results calculated and generates output for the LSTM neural network cell. Compared with traditional RNNs, LSTM avoids the problem of gradient disappearance or gradient expansion while learning faster.
We chose six features for the training of the LSTM model, such as laser bias current, input optical power, output optical power, OSNR, temperature in the model, and detection point temperature. We show LSTM results for the four features below. Other results can be seen in the paper  Figure 1: Diagram of AI-assisted failure location platform. [9]. Te left image shows the loss of the LSTM model in training and validation, and the right image shows the comparison of test data and the LSTM model's prediction result. Figure 5 shows the results of using LSTM for Laser Bias Current prediction. It can be seen from the results that the validation loss is less than 0.001, and the model has high accuracy in the prediction of laser bias current. Figure 6 shows the results of using LSTM for input optical power prediction. It can be seen from the results that the validation loss is less than 0.005 and the model has high accuracy in the prediction of input optical power. Figure 7 shows the results of using LSTM for output optical power prediction. It can be seen from the results that there are a few less accurate numbers, but overall the results are accurate. Figure 8 shows the results of using LSTM for OSNR prediction. According to the prediction results, the prediction results of OSNR are relatively low compared with the actual data, which will be optimized in the follow-up work.

SVM Model.
Te SVM is essentially a binary classifcation algorithm that screens the support vectors from the training data and uses them to establish a decision function [10,11]. In practical application, in the case of linear inseparability, the kernel function of SVM can realize the mapping from a low-dimensional space to a highdimensional space and transform the two types of points in the low-dimensional space into linearly separable points, as shown in Figure 9.
Te trained SVM model is used to classify the optical network status data predicted by the LSTM model and judge whether it belongs to the failure state. We compare the classifcation results of the SVM model with the true results and calculate the accuracy according to (1). Te calculation accuracy is 90.63%. When we calculate the failure accuracy according to (2). Te calculation result is 99.38%, which means the AI module can predict almost all failures.
where TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative. As shown in Figure 10, to facilitate the presentation of the results, we divided the SVM classifcation results into ten pieces and counted the accuracy, TN, FN, and corresponding true network failure numbers, respectively. By the way, the fuctuation between each accuracy is related to the result distribution.
From Figure 10, the number of TN is very close to the actual number of failures, which means that the failure prediction accuracy is very high. Results FN show that some faultless links are predicted to be faulty links, and we will compensate for this defciency by monitoring the optical power of links predicted to be faulty.

Result Analysis
Tis section will show the performance of the platform in practical application. Figure 11 shows the details of link channel 3 in normal condition when the optical switch array changes the link in turn without an AI module. "Optical power" shows the current power of channel 3, whose value is −7.979 dBm. "Distance" represents the length of the optical link. "OTDR" is set as "manual," which means the parameters shown in the fgure are the result of manually turning on the OTDR probe. Figure 12 shows the logs of optical switch polling.
When the AI module predicts link failure, it will send an instruction to the optical switch array and record some prediction logs in platform. Figure 13 is the screenshot of the recorded prediction logs. Figure 14 shows logs that the optical switch array changes the link according to the AI prediction result. Figure 15 shows the monitoring results of optical power when the link failure predicted by AI occurs. Te OTDR mode is set to auto, which means that when the optical power is abnormal, OTDR detection is automatically started. Te optical power of link channel 3 currently detected is −54.457 dBm. Te value of "distance" is 9852.35, which means there is a breakpoint at 9852.35 m. Te curve of the OTDR detection is shown in Figure 16.
We can see from the fgure above that there is a dramatic change in the curve near 10,000 meters, which is the position of the breakpoint. Figure 17 shows the comparison of the time consumption between the conventional polling detection method and the AI-based detection method when a random fault occurs in 1024 links. Te calculation formula is (3). Te International Journal of Optics introduction of an AI model increases the average failure detection efciency of failure links by 98.41%.
where t1 represents the time consumption of discovering failure links without using the AI model, and t2 represents the time consumption of discovering failure links with the AI model.

Conclusion
In this paper, we design an AI-assisted optical link failure prediction and failure location platform based on AI module and test its performance. Te optical power monitoring can compensate for the shortage of the AI model, which may predict the normal state as the failure state. At the same time, the introduction of an AI model increases the average failure detection efciency of a failure link by 98.41%. Tis greatly improves the efciency of failure detection and location in data centers.

Data Availability
Te data used to support the fndings of this study are restricted by the China Mobile Communications Corporation in order to protect patient privacy or endangered species. Data are available from WeiJi at jiwww@sdu.edu.cn for researchers who meet the criteria for access to confdential data.

Conflicts of Interest
Te authors declare that they have no conficts of interest.