This paper focuses on compressor systems associated with major production deferments. An advanced machine-learning approach is presented for determining anomalous behavior to predict a potential trip and probable root cause with sufficient warning to allow for intervention. This predictive-maintenance approach has the potential to reduce downtime associated with rotating-equipment failures.
Introduction
The first step in using a machine-learning system is to train the model to identify normal and abnormal operating conditions. The model can then classify real-time data from the equipment and indicate when the equipment’s performance strays outside the identified steady state. The ability to identify anomalies is a major difference between the proposed approach and traditional monitoring tools. With advances in digital technologies, correlations and warnings can be achieved in a matter of minutes, allowing engineers to take appropriate preventative action when they receive a failure warning.
The authors used historical data for 2016 in their analysis of system efficiency in predicting failures. The proof-of-concept system correctly predicted 11 trip events over the course of the year, almost 50% of the 23 failures that occurred during that period. One of the more important findings was that the machine-learning model predicted many failures hours in advance. In one case, it gave 36 hours’ notice. The median period of notice for eight events that were subsequently analyzed was approximately 7 hours.
Support Vector Machines (SVMs)
SVMs are used in this study as a classifier for detecting abnormal machine states. SVMs were developed for binary classification. Some authors have argued that the SVM classifier has better results compared with techniques such as linear discriminant analysis and back-propagation neural networks.
The compressor usually operates under normal working conditions. This poses a highly unbalanced problem for a two-class classification. Because of this, one-class classification using an SVM is implemented. The algorithm is trained on only normal data and creates a representation of this data. When the new points inferred are substantially different from the modeled class, they are labeled as outliers. Linear as well as radial kernel functions are explored.
One of the properties of SVMs is that they may create a nonlinear decision boundary by projecting the data through nonlinear function to a space in higher dimension (Fig. 1).The one-class SVM creates a binary function that captures the region in the input space where most of the data exist. The function, thus formed, returns +1 for the region defined by training data points and –1 everywhere else.
The low-pressure compressor, a production-critical piece of equipment, is bypassed or manually shut down when engineers notice an anomaly. This is the primary reason for having a data set that does not include real trips but many manual shutdowns. For instance, in an analysis of 2017 data, of 31 documented deferments, four were manual shutdowns, seven were process standby, and 19 were classified as breakdowns. Of these 19 breakdowns, only six were caused by some fault with the low-pressure compressor (LPC) and the remaining 13 were the result of problems with subsystems upstream of the LPC. It was, therefore, not possible to discern all modes of failure of the LPC or repetitive modes of failures of the LPC. Accordingly, a one-class SVM was chosen as an appropriate strategy to model the normal working of the LPC. This would enable identification of any abnormal event, irrespective of the mode of failure.
To train the SVM for the LPC and its processes, approximately 300 analog tags were identified. Analog tags are fewer than digital tags, but the data points are continuous and not discrete. These 300 tags were split into two models, LPC and process. The LPC model consisted of almost 230 input tags, and the process model input consisted of almost 70 tags. The process model included a number of subsystems. Respective tags of both models underwent certain preprocessing steps before they were fed into the respective SVM models.
To identify the normal working condition of the LPC, two strategies were explored:
- Events log. Every event on the LPC is recorded as an error and later classified as a breakdown or other type of deferment.
- Process steady state. This is determined on the basis of positioning of key valves that indicate the LPC is online and producing. Furthermore, it was decided to create certain margins in time for both the LPC and process models. For the LPC model, data points up to 2 hours before and after the deferment were considered as abnormal. For the process model, a margin of 6 hours was used.
Feature Engineering. Several methods, outlined in the complete paper, were adopted to generate features other than the raw data to train better and improve the accuracy of SVM models.
The best-performing one-class SVM model is based on the process steady state to define the normal working condition of the machine, and the aggregation of data over the rolling window as lag to calculate rate of change is used as a feature to train the SVM for both the LPC and process models. The test data stream is input every 10 minutes, and the same preprocessing steps used to train the model are performed. A rolling window of 1 week is identified as the best measure to smooth the data and to calculate the rate of change from the current observation. Model output contains a reduced list of tags, approximately 200 for the LPC model and 30 for the process model. The reduction in the number of tags is explained by the preprocessing steps Removal of Low-Variance Tags and Removal of Alarms. The output of the SVM model is Boolean in nature, indicating a data point as either normal or abnormal. This continuous stream of output indicates the state of the LPC and its process in near-real time.
Whenever the one-class SVM model flips from normal to abnormal (from true to false), it is considered as an alert or a change in the state of the LPC. Whenever a flip is encountered, all the values of all tags are placed in descending order and the top 10 tags are picked and reported. The root-cause identification is a derived mechanism that considers the SVM output as the only input.
The performance data for the LPC are stored in a time-series database. A work flow was built in a commercial software program to carry out the end-to-end extract/transform/load and data-processing task. This work flow is deployed on a server and run every 10 minutes. Each time the workflow is run, the software program intakes the previous 10 minutes of data, at 1-minute frequencies, for the approximately 300 selected compressor and process tags. These data are passed through various modules to be cleaned, transformed, and validated. The data are then passed through modules embedded within the work flow; standard libraries are used to generate the binary outlier classifications from the compressor data with an SVM algorithm. The software then outputs the results, which are simply a Boolean true or false indicating failure prediction, to a structured-query-language database. When the results change from true to false and stay false for more than 10 minutes, the software sends an email to alert that there has been a failure prediction.
The continuous output from the two SVM models can be fed into any dashboard for visual inspection. An autogenerated email is also configured to alert engineers about the state of the LPC and the top 10 tags that can be considered potential root causes. Accuracy is measured on the basis of the number of flips generated by the model in each time period (e.g., 1 week or 1 day). The model should not flip or raise alerts too frequently so that it does not generate random noise and, at the same time, should flag the trips sufficiently in advance for the engineers to act. The current best model for an LPC generated approximately 70 flips over a 6-month period—on average, 11 alerts a month, or two alerts a week. However, it was noted that the number of alerts piled up toward an impending trip, indicating a buildup. Therefore, the random noise is fewer than two alerts a week.
The root-cause identification has performed well and showed clear patterns when similar types of failures were encountered. For instance, the feedback arm came loose and flow could not be controlled properly, leading to a trip. In such instances, “LPC Standard Flow” was identified by the algorithm as the top root cause. In one instance, the alert was given almost 12 hours in advance, whereas, in others, it was just few minutes. This is attributed to manual intervention that failed and resulted in trips. In most cases, sufficient warning time exists for the engineers to take preventive measures.