Machine Learning and Domain Knowledge

Machine Learning (ML) is neither hype nor bubble. It is an inherent part of digitization feeding it with first-class data. Superfast computers, cheap memory, and standardization (on algorithms and models interoperability) turn ML into an enticing opportunity in any business.

To realize the opportunity business direly needs Data Engineers. Unlike Data Scientists, they are capable of transforming the normal business activities representing the domain knowledge into a stream of data and data-based decisions - an input to ML. This task requires hands-on experience in both fields. The more you have, the deeper you are in your belief that ML is your business partner for good.

Not the complexity of ML's math, but a lack of broad expertise in both fields is the reason why the internet search for meaningful applications of ML in desalination produces meager results focused on predicting reverse osmosis fouling or rotating equipment vibration.

ML experts are well aware of the above-mentioned challenge. That's why the ML network is the fastest-growing community. It shares any tips-and-tricks helping to train a conventional mind on the ML vision of the world.

ML vision is a double-edged sword. It does not accelerate business unless it rebuilds its foundation first. ML renders all equal - CEOs, VPs, project managers, and engineers because what matters is everyone's contribution, not a title. This quality of ML is a huge disadvantage for non-digitally minded businesses fighting to preserve archaic decision-making hierarchy.

ML equally identifies data wastes - data that is not used in important decision-making due to its ambiguity, incompleteness, or inconsistency, and intuitive decisions not backed by any data.

ML disrupts engineering thinking as it transforms its purpose from amassing and memorizing unclassified intangible knowledge to producing tangible ML models (along with algorithms and data). They may be stored, shared, collected, and updated. ML models are modern bitcoins and a new rapidly expanding market.

Let's delve into some ML problems in desalination. All of them may be readily resolved today. We start with plant engineering.

RO membrane performance prediction

Undoubtedly, reverse osmosis performance prediction starts the list. The target is to build an ML model replacing multiple software applications for reverse osmosis membrane performance prediction offered by the membrane manufacturers. Such a model is at the core of the digital twin of desalination plant operation and maintenance demonstrated by crenger.com here.

It is a supervised - learning case ideal for neural networks (NN). For simplicity, we disregard the fact that NN is actually a part of deep learning - a subtype of ML.

The database should equally represent all the commercial membranes and at least contain the following columns - membrane type, membrane area, feed temperature, feed pressure, feed salinity, feed flow rate, osmotic pressure, product output, product salinity, and pressure differential. The last three features are the ML output. Here the success lies in the leverage of feature engineering with theoretical relationships.

Predicting equipment costs and lead times

This task may be defined as follows: predict the cost of the equipment piece based on previous quotations and market price dynamics. As a rule, every equipment class has a list of price-forming features - numerical and categorical. Below is an excerpt from the features list autogenerated by crenger.com for each equipment type.

Pump: material, schedule, design, head, flowrate, impeller quantity, impeller diameter, price
Valve: material, schedule, design, size, price

As in most cases, costs are non-linear with a substantial weight of categorical features, feedforward NNs with two hidden layers are an excellent starting point here.

Prediction of lead times re-uses the same features and the batch size.

Predicting equipment performance degradation

What is common between seawater intake piping, filters, reverse osmosis membranes, and pumps? The performance degradation of all these pieces is modeled using the same technique - simple recurrent neural networks (SRNN). For pumps, performance degradation is focused on vibration and efficiency change, and for other pieces - on fouling.

Here SRNN is fed continuously with measured temporal data and it predicts the future state. The temporal network implements 2 forward-sliding windows. The first is used for reading the lately added data, and the second is for performance prediction.

This class of problems poses an interesting question - how to extend the black box of the ML model pinned to a fixed design to new designs? The answer lies in applying nondimensional analysis as part of feature engineering to the input data describing physical phenomena.

Scheduling the equipment maintenance

Luckily, in the above-mentioned cases, the performance degradation is reversible. So scheduling the maintenance frequency is the next ML problem. It may be illustrated by the operation of the seawater intake station. It is connected to the off-shore seawater intake head with long piping.

Its unrestrained biofouling leads to a decrease in the intake station pit water level, it minimum being defined by the pump design. To mitigate biofouling the intake station is often equipped with a chlorination system. What dosing rate, duration, and frequency to apply? (Let's put aside the pros and cons of chlorination compared to other alternatives.)

Two applicable ML techniques for this sort of problem are simulated annealing and genetic algorithms. Although they are trained without a training set, they are still supervised in that feedback from the neural network’s output is constantly used to train the network. Simply training data is not supplied ahead of time. This problem requires setting the boundary conditions and selecting the score function.

The mentioned algorithms may be readily applied to reverse osmosis membrane fouling control and replacement rate selection. Both problems use the ML model for membrane performance prediction described previously. Now it is used for online performance normalization and fouling calculation.

The above-mentioned ML models' tandem leads us to the fascinating problem of ML models' collective behavior. How stable is it when they are connected in series or in parallel? I first encountered this problem when developing the algorithms for summing up the performance curves of the pumps connected in series. They are part of the PlantDesigner platform.

P&ID items recognition

The piping and Instrumentation Diagram (P&ID) is a roadmap for the desalination project. It is an image densely packed with interlinked symbols each one pointing to some equipment piece or a collection of pieces.

Images are best learned with convolutional neural networks (CNN). The program for the CNN model trained to recognize the P&ID symbols includes only 60-odd lines of code (Python or Java). An immediate use of this model is for P&ID development auditing - whether all the P&ID items are sized and described.

The second far more interesting problem of P&ID item recognition is the P&ID classification. Its categories may include intake station, chemical dosing system, pumping system, ultrafiltration, and others. Here we do not need to start from scratch thanks to advances in natural language processing (NLP) working with words instead of symbols. So instead of word embeddings and bag-of-words, we will use symbol embeddings and bag-of-symbols.

Similar techniques may be applied to the plant layout parts identification.

Plant auto-wiring

Equipment pieces used for assembling the plant may be sources of discrete or analog signals or be power consumers differentiated by voltage. The plant wiring starts from mapping these pieces to the plant layout. The second step is selecting the location and the number of cable terminals - analog and discrete junction boxes, PLC cabinets, and motor control centers (MCC) to which all these pieces shall be wired. This is a classic clustering problem related to unsupervised learning, with K-means being the algorithm of choice.

What if instead of analog and discrete junction boxes RIO drops with Fieldbus-like communication are used? How to loop with one cable through all the RIO drops? This is a typical Travelling Salesman Problem (TSP) reliably solved by ML.

The mentioned problems triggered the development of the PlantDesigner platform for plant auto-wiring.

Purchase items clustering

Once equipment pieces have been sized, they shall be grouped in purchase orders. As any manufacturer specializes in a limited range of products, orders shall contain only similar equipment types. It is the same clustering problem mentioned above with one significant difference. Now similarity is not a static feature, it is an output of the products researched on the market or procured in the past.

After clustering the equipment pieces into orders, they shall be linked to guiding documentation - terms and conditions, warranties, specifications, inspections, tests procedures, etc. This process is a typical classification problem extensively discussed in natural language processing.

Good data generators

The lack of good data for the ML model training is a serious obstacle for the ML application. Engineering offers an unlimited source of good data - digital twins. This is an approach selected by crenger.com. For example, digital twin for the plant operation and maintenance models the performance degradation, equipment pieces failures, and maintenance works. In other words, a digital twin is a context for ML development.