-
PDF
- Split View
-
Views
-
Cite
Cite
Haifeng Zhang, Sanyi Yuan, Huahui Zeng, Huan Yuan, Yang Gao, Shangxu Wang, Automatic velocity analysis using interpretable multimode neural networks, Geophysical Journal International, Volume 235, Issue 1, October 2023, Pages 216–230, https://doi.org/10.1093/gji/ggad202
- Share Icon Share
SUMMARY
Seismic velocity analysis is the basis for seismic imaging and understanding complex subsurface geological structures. Although the performance of automatic velocity analysis methods based on Common Middle Point (CMP) data or Velocity Spectra (VS) is encouraging, particularly deep learning methods. However, most methods ignore the complementarity between CMP data and VS data, and only one of them is selected for velocity modelling. We propose a multimodal neural network (MMN) that combines the advantages of CMP data details representation and simplification of VS. MMN includes multilayer convolution structures and auto-encoder structures, which are used to extract time–space amplitude information from CMP gathers and energy groups features from VS data, respectively. This paper compared MMN with the CMP single-modal network (CSN) and the velocity spectra single-modal network (VSSN). Based on synthetic data, we investigated their differences in terms of continuity, accuracy, noise resistance and generalization. The MMN prediction results makes a trade-off between the overall continuity and local details. Visualization analysis of the intermediate feature maps explains the MMN velocity prediction mechanism, that is, the multi-angle representation and complementary fusion of velocity information. Finally, the performance of the proposed method is demonstrated using the braided river deposited field data example.
1 INTRODUCTION
Seismic velocity is a critical parameter in the entire seismic exploration process. An accurate velocity model is required for pre-stack imaging (Al‐Yahya 1989), seismic interpretation and even seismic inversion (Hou & Marfurt 2002; Russell 1988; Biswas et al. 2019). Many seismic velocity analysis methods, such as tomographic inversion (Hu et al. 2021; Taillandier et al. 2009; Li et al. 2019) based on traveltime, amplitude versus offset (AVO) velocity analysis (Zhang et al. 2022) based on amplitude, and full-waveform inversion (FWI, Virieux & Operto 2009) based on waveform, have been developed over the past decades.The traveltime-based inversion method updates the velocity model by the difference between the forward traveltime and the observed traveltime. To establish forward relationship, velocity spectrum or ray tracing are usually used. Anderson & Dziewonski (1984) and Stork (1992) based on the ray tracing equation, performed tomographic velocity inversion according to the picked traveltime. However, traveltime picking is difficult and sensitive to noise. Therefore, some scholars (Taner & Koehler 1969) simplify the relationship between traveltime and velocity, and different velocities forward velocity spectra with different energy distributions. Velocity is obtained by manually picking up VS, but there are some issues such as high cost and low picking efficiency. Toldi (1989) first proposed an automatic velocity pickup method based on VS, which maximizes stack energy as the objective function for inverting the velocity. Based on the VS, Meshbey et al. (2002), Fomel (2009) and Almarzoug & Ahmed (2012) derive the variational equation from velocity to traveltime and regard velocity picking as a variational problem. These automatic picking methods (Xu & White 1996; Zhu & Gibson 2018; Lumley 1997; Zhang & Claerbout 1990; Zhang 1991) have achieved good results for relatively simple geological structures. The amplitude-based method is an effective velocity inversion method. Scholars such as Swan (2001), Ratcliffe & Roberts (2003) and She et al. (2018) proposed mining velocity information from AVO analysis, constructing a residual velocity indicator from AVO analysis in Normal Moveout (NMO) gathers or CMP gathers. However, these methods only use traveltime or amplitude, so the source of velocity information is single. Waveform inversion employs more velocity information. For example, FWI (Tarantola 1984; Virieux & Operto 2009; He & Wang 2021) is a high-resolution velocity modelling method based on wavefield matching of shot gather data or CMP data. Symes (2008) and Cameron et al. (2008) proposed migration velocity analysis method, with the criterion of achieving the best imaging quality as the indicator of velocity accuracy. Overall, utilizing multiple information for velocity inversion is currently a trend. The VS or CMP data are usually used in velocity inversion.
While developing conventional geophysical methods, Araya-Polo et al. (2019), Simon et al. (2023) and Yang & Ma (2019) utilized machine learning (ML) and deep learning (DL) algorithms (LeCun et al. 2015; Goodfellow et al. 2020) to mine traveltime, amplitude and waveform information from VS or CMP data for velocity modeling. Such as clustering algorithm based on VS (Smith 2017; Ahmad & Hashmi 2016; Chen 2018a, b; Wei et al. 2018; Wang et al. 2022; Wang 2021a, b). These methods can identify the energy groups peaks of VS, thereby obtaining the stacking velocity pairs. Clustering (Hung et al. 2013) is a label-free method. However, it requires high-quality velocity spectra [e.g. high signal-to-noise ratio (SNR), high resolution, etc.]. Compared with unsupervised methods, deep learning introduces expert knowledge through labels. Models trained end-to-end can generate stable predictions on low SNR data. For example, Zhang et al. (2019) proposed an automatic velocity picking method that uses You Only Look Once (YOLO) and Long Short-Term Memory (LSTM) networks (Hochreiter & Schmidhuber 1997) to detect energy groups in the VS and then maps the results to stacking velocities. Park & Sacchi (2020) proposed VGG16 classification network with global and local velocity spectra as inputs for estimating stacking velocity. The VS is the most intuitive way to represent the relationship between traveltime and velocity, . But, these VS-based methods can only obtain velocity from traveltime. Some researchers have proposed to use neural networks to directly extract richer velocity information from CMP data, spanning the VS. For example, Biswas et al. (2018), Fabien-Ouellet & Sarkar (2020) and used Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) (Cho et al. 2014, 2018) to estimate seismic velocity based on the CMP gathers. However, CMP gathers usually contain multiples and random noises, which lead to complex appearances and low SNR. This makes it difficult to extract complete traveltime and amplitude information, and velocity inversion is challenging and less stable. Therefore, it relies on a large amount of labelled data set. Based on the flatness of events, Calderón-Macıás et al. (1998), Ma et al. (2018) and Ferreira et al. (2020) estimated the velocity deviation. Pairing residual velocity with Common Image Gathers (CIGs) can generate a large number of training data set, but the richness of labelled data set is low, and the limitation of small samples cannot be avoided.
These studies indicate that CMP data and velocity spectra are the are the primary representations of seismic velocity. As amplitude, waveform and traveltime are related to velocity inversion, CMP data and velocity spectra are the main modes (Baltrušaitis et al. 2018). On the other hand, it shows that few scholars can combine these two types of data to predict velocity. This paper proposes an MMN that takes CMP data and VS as input. The MMN uses convolution and auto-encoder structure in the branches to extract and fuse velocity features, and embeds the Dix formula to simultaneously estimate the interval velocity and Root-Mean-Square(RMS) velocity. First, the physical relationship between CMP gather, VS and velocity is analysed to provide a reference for the design of MMN. Second, disparate complexity Marmousi data (from noise-free to noisy) and out-of-sample Overthrust data were tested to analyse the differences between the three networks. Using two concepts in multimodal learning, ‘representation’ and ‘fusion’, the intermediate feature maps of the MMN were visualized, improving the interpretability of velocity prediction. Finally, the proposed method was validated using field data from a developing multistage river. Tests on the synthetic and real data examples demonstrate MMN has the ability to extract and fuse multimodal information, and successfully predict velocity models.
2 METHODS
2.1 Math-physical relationship between CMP data and velocity
Seismic velocity is usually related to CMP data and VS. Therefore, this paper first analyses the forward process from seismic velocity to CMP data from the perspective of traveltime and amplitude. Assuming that CMP gathers can be effectively represented by ray theory and a convolution model, the forward modelling of CMP data focuses on traveltime and amplitude factors.
According to the ray theory and Fermat’s principle, the approximate relationship between traveltime and velocity can be expressed as follows:
where the horizontal distance is represented by x, and the depth by z. |$L(h)$| is the ray path, |$T(h)$| is the ray traveltime, h is the offset and |${v}_{{\mathop{\rm int}} }(x,z)$| is the depth domain interval velocity. When the velocity model is known, the ray propagation path |$L(h)$| and the minimum traveltime |$T(h)$|are calculated by solving the ray tracing equation (Nowack 1992; Um & Thurber 1987). Considering that the amplitude at different offsets is related to velocity, according to the approximation of Zoeppritz equation (Shuey 1985), the approximate relationship between reflection coefficient and velocity can be expressed as follows:
where |${v}_{{\mathop{\rm int}} }({t}_i)$| is the interval velocity in the time domain, |$\rho $| is the density, R is the reflection coefficient with an angle of |$\theta $| and Gis a gradient and related to the velocity. Eq. (3) gives a description of the amplitude variation with angle. After obtaining traveltime and reflection coefficient, based on convolution model, forward modelling of CMP gather can be expressed as follows:
where |${{\bf R}}$| is the reflection coefficient matrix, |${{\bf N}}$| is the inverse NMO operator (Castle 1994; Miller 1992), |${{\bf W}}$| is the wavelet matrix and |${{\bf S}}$| is the CMP gather. Fis the operator generating function, which can calculate the transformation matrix |${{\bf N}}$| based on the traveltime and offset.
The physical relationship indicates that CMP data contain traveltime and amplitude information, both of which are functions of seismic velocity (as shown in Fig. 1). The polarity (positive or negative) and magnitude of the reflection coefficient can represent the interval velocity trend (increase or decrease) and gradient. The variation pattern of amplitude with offset can provide detailed velocity inversion information. Theoretically, velocity inversion information can be entirely provided by CMP data. However, weak reflection, wavelet interference, crossing events and multiple waves make the appearance of CMP gathers complex and the SNR low (as shown in Fig. 1). On the other hand, the relationship between traveltime and velocity expressed in eq. (1) is based on ray tracing equations, and the mapping relationship is complex. Therefore, deep learning algorithms based on CMP gathers have a high prediction difficulty and poor stability, and rely on the number of labelled samples.

Schematic diagram of the physical relationship between CMP gathers, velocity spectra and velocity. The CMP data express velocity through amplitude and traveltime, providing velocity details. The VS is generated based on hyperbolic prior and energy stacking, simplifying velocity information. In terms of expressing velocity, these two complement each other.
2.2 Math-physical relationship between velocity spectrum and velocity
Regarding these issues with CMP gathers, we analysed the VS generation process and tried to find another data source to reduce the reliance on training sample. The approximate relationship between VS and seismic velocity can be expressed as:
where m is the number of samples in the time window, k is the number of traces and |${v}_{\mathrm{ test}}$| is the scanning velocity. Considering the delay of the waveform, a hyperbolic band is formed based on the velocity and time window. The average energy inside the band is |$E( {t,{v}_{\mathrm{ test}}} )$|. The maximum average energy is achieved when the scanning velocity is RMS velocity. |${{\bf E}}$| is the VS matrix.
Based on |${t}^2 + {( {{\raise0.7ex\hbox{${{h}_k}$} \!{/ {\vphantom {{{h}_k} {{v}_{\mathrm{ test}}({t}_i)}}}} \!\lower0.7ex\hbox{${{v}_{test}({t}_i)}$}}} )}^2$|, different t–v pairs can be used to extract amplitudes with different traveltimes on the CMP gather, which can then be stacked to establish the relationship between energy groups and velocity. In other words, the VS can simplify the relationship between velocity and traveltime in the form of energy groups. Additionally, the stacking process can attenuate noise, especially random noise. However, if seismic velocity is estimated solely based on the VS, there will be a lack of detailed information. This is because energy stacking (as shown in eq. 6) is a lossy irreversible operation that hides or even removes the amplitude and reflection coefficient of CMP gathers that are related to velocity inversion. Consequently, the VS can only reflect low-frequency the velocity trends.
The physical relationship analysis between the two types of data mentioned above is summarized in Fig. 1. CMP data focus on expressing the amplitude–velocity relationship over the entire time and offset, providing local details. The VS contains a priori knowledge of hyperbola, simplifying the complex ray propagation relationship between traveltime and velocity. In terms of noise resistance, the VS can provide assistance. Therefore, CMP data and VS are complementary in expressing velocity.
2.3 MMN inversion architecture
Based on the forward modelling process between multimodal data and velocity, MMN is proposed to simulate the inversion process. That is, using MMN to automatically extract features from CMP data and VS for seismic velocity prediction, which consists of convolutional layers and gated recurrent units (GRU). Considering the two key points of multimodal learning: representation and fusion, three modules are designed, namely CMP representation module, VS representation module and Fusion module, for feature observation and comparison. As shown in Figs 2(a)–(c), these three modules can be decomposed and combined to form three different networks: CSN with only CMP input, the VSSN with only VS input and the MMN with CMP and VS input. As an example, the MMN is given here. The CMP representation module expands the receptive field with multiple convolution layers and max-pooling layers, so that a larger space–time information in CMP gather can be considered. The input size of this module is |${n}_t \times {n}_x \times 1$|, where |${n}_t$| and |${n}_x$|, respectively, represent the time and offset lengths. Through two types of convolutional layers with filter shapes of 2 × 2×|${\raise0.7ex\hbox{${{n}_v}$} \!{/ {\vphantom {{{n}_v} 2}}} \!\lower0.7ex\hbox{$2$}}$| (yellow solid box) and 3 × 3×|${n}_v$| (blue solid box), both with a stride of 1, where |${n}_v$| is the number of scanned velocities, the number of feature maps increases from 1 to |${n}_v$|. The a max-pooling layer is then used to receive waveform information with different offsets in order to reduce dimensionality. The second dimension is compressed to 1 along the offset, and the size of the extracted velocity feature map is |${n}_t \times {n}_v \times 1$|. Note that the filter's size can be adjusted to fit the situation, but the multilayer convolution kernel window should start large and end small to extract larger field-of-view data. The VS representation module's input size is |${n}_t \times {n}_v \times 1$|. Because VS mode mainly expresses velocity in the form of energy groups, this module adopts an auto-encoder structure to retain energy group characteristics. First, the blue and yellow solid box convolutions are used with a stride of 2 to reduce the size of the VS to 1/4 of the original size, achieving downsampling with a size of |${\raise0.7ex\hbox{${{n}_t}$} \!{/ {\vphantom {{{n}_t} 4}}} \!\lower0.7ex\hbox{$4$}} \times {\raise0.7ex\hbox{${{n}_v}$} \!{/ {\vphantom {{{n}_v} 4}}} \!\lower0.7ex\hbox{$4$}} \times {\raise0.7ex\hbox{${{n}_v}$} \!{/ {\vphantom {{{n}_v} 2}}} \!\lower0.7ex\hbox{$2$}}$|. Then, the deconvolution is performed with filters of shape 2 × 2×|${\raise0.7ex\hbox{${{n}_v}$} \!{/ {\vphantom {{{n}_v} 2}}} \!\lower0.7ex\hbox{$2$}}$| (yellow dotted box) and 3 × 3 × 1 (blue dotted box) with a stride of 2 to perform upsampling and recovery the size to |${n}_t \times {n}_v \times 1$|. Another velocity feature map is extracted from the VS branch, which can simply and directly reflect the relationship between traveltime and velocity, guiding neural network learning and accelerating convergence. The fusion module integrates velocity labels for parameter training and combines two velocity feature maps to achieve complementarity in amplitude and traveltime. The two velocity feature maps are connected to produce a dual-channel output of size |${n}_t \times {n}_v \times 2$|, which is fused using convolution (blue solid box) and deconvolution (blue dotted box). Considering that seismic velocities may have correlations at different time, GRU is used to capture this temporal relationship. The neural network outputs interval velocity features and then embeds the Dix formula to convert the interval velocity features to RMS velocity, and the real interval velocity and RMS velocity are used as labels for evaluation. The network is multitasking and outputs two types of seismic velocities, both of which are vectors of |${n}_t \times 1$|. Since this paper focuses on the impact of multimodality on velocity prediction, multitasking is not studied in detail. Under the assumption of horizontal stratification, Dix (1955) formula can be expressed as follows:
where |${\rm{\Delta }}t$| is the time sampling rate. Clearly, the Dix constraint can control the RMS velocity to increase from shallow to deep. Given network parameters, CMP gathers and VS, the MMN can output two vectors. The following objective function was chosen to assess the quality of the MMN:
where N is the number of samples, |${\theta }_{\mathrm{ net}}$| is the weight and bias parameter in the MMN, |${\mathrm{ net}}_{{\mathop{\rm int}} }({{{\bf S}}}_i,{{{\bf E}}}_i,{\theta }_{\mathrm{ net}})$| and |${\mathrm{ net}}_{{\rm{rms}}}({{{\bf S}}}_i,{{{\bf E}}}_i,{\theta }_{\mathrm{ net}})$| represent the interval velocity and RMS velocity calculated by the MMN. |${v}_{{\mathop{\rm int}} ,i}$| and|${v}_{{\rm{rms}},i}$| represent the interval velocity and RMS velocity labels. The objective function's first and second terms represent the interval and the RMS velocity losses, respectively. Like MMN, the objective functions of CSN and VSSN are made up of two terms. CSN only inputs CMP data, and VSSN only inputs VS data. Each iteration calculates the total loss and then uses Adam (Bock et al. 2018; Kingma & Ba 2014) backpropagation to update the network parameters. When the losses of the training and verification sets converge to the minimum value, the network's training is complete. The trained network can estimate RMS and interval velocity that satisfy the Dix distribution.

Network architecture diagram. (a)–(c) represent the VSSN, CSN and MMN network structures, respectively. In (c), the red arrow indicates the CMP feature map, the blue arrow indicates the VS feature map and the green arrow indicates the fused feature map.
2.4 Synthetic data set preparation
The synthetic data set was generated using the 2-D Marmousi model (Martin et al. 2006). We convert the depth domain velocity model to time domain and then prepare labels using the true interval velocity model (Fig. 3a) and RMS velocity model (Fig. 3b). We horizontally thinned the Marmousi model, cropped both sides, and obtained 737 horizontal samples with a 12 m interval. The grids of the Marmousi model is 749 × 737. As shown in Fig. 3(a), there exist thrust imbricate structure, as well as three large-scale faults, an anticline, an unconformity surface and wedges developed in the middle of the model from shallow to deep. We used ray tracing and AVO forward modelling techniques with a Ricker wavelet source of 35 Hz in frequency. By extracting gathers, the 737 CMP gathers are obtained. As shown in Fig. 3(c), the time sampling interval is 0.002 s, the number of time samples is 749, the max fold number is 37 and the max offset is 4000 m. The scanning velocity range of 1.5–3.5 km s−1 is used to calculate the VS (Fig. 3d). Before network training, we use the Z-score normalization to pre-process velocity label and CMP gathers in the training, validation, and test data set:
where Y is the raw velocity label (interval velocity or RMS velocity) or CMP data. |${\mu }_Y$| and |${\sigma }_Y$| represent the mean value and standard deviation of velocity label or CMP data and |$\overline Y $| are the normalized values of Y. Since the energy group was normalized when calculating the VS, the VS is not processed here.

2-D Marmousi model data. (a) and (b) show the interval velocity model and the RMS velocity model. (c) and (d) show examples of a single CMP gather and a VS, respectively. The red lines in (a) denote the positions of the training samples, and the blue lines indicate the positions of the validation samples.
3 EXAMPLES
In this section, we first investigate the distinction of the MMN, CSN and VSSN networks on the Marmousi model in noisy or noise-free scenarios. Then, the trained model is directly used to predict the Overthrust model to evaluate MMN's generalization. We compare the intermediate feature maps of MMN in the test set and explain the mechanism of MMN velocity prediction. Finally, the proposed method is tested using the field data.
3.1 Noise-free synthetic data example
Physical analysis suggests that MMN has a more significant advantage in small-sample conditions, we extract 10 corresponding CMP data, velocity spectra, interval velocity and RMS velocity as the training set at the position indicated by the red line in Fig. 3(a). The samples at CMP 456, 641 and 671 are extracted to as the validation set, as indicated by the blue line in Fig. 3(a). During the training, the learning rate and batch size are set to 0.0001 and 10, respectively, and the test set consists of the true Marmousi model. The statistical training and validation loss curves (i.e. the mean value of 50 training and validation loss curves) of the three networks are shown in Fig. 4. The black, blue and cyan lines respectively indicate the total loss, interval velocity loss and RMS velocity loss of the training set. The red, magenta and green lines represent the corresponding loss of the validation set. The loss curves of all three networks can converge. The loss values of MMN are the smallest, indicating that MMN extracts more velocity information by combining multimodal data and has better generalization performance. In addition, the loss values of MMN and VSSN remain unchanged after about 250 epochs, while CSN requires 600 epochs to converge. This reflects the simplification effect of velocity spectra, which allows MMN and VSSN to converge faster.

Comparisons between (a) the VSSN-, (b) the CSN- and (c) the MMN-based average training and validation loss curves via statistical tests for the Marmousi data. In each subgraph, the total loss, interval velocity loss and RMS velocity loss curves are plotted.
The trained network is applied to the test data set and the results are shown in Fig. 5. The interval velocity models predicted by VSSN, CSN, MMN and are depicted in Figs 5(b)–(d). While the predicted RMS velocity model is depicted in Figs 5(f)–(h). First, the prediction results are compared with the interpolation/extrapolation results (Figs 5a and e) using the corresponding 10 velocity labels. The interval velocities predicted by the three networks are all better than the interpolation results, and the RMS velocities are also more detailed. The large differences between the interpolated results and the true model demonstrates how difficult it is to express the entire velocity model with only 10 samples. However, the neural network method, particularly MMN, achieved satisfactory results, demonstrating the effectiveness and generalization of MMN. Furthermore, the prediction results of interval velocity can show the ability of three networks to describe high-frequency details. The interval velocity model predicted by MMN (Fig. 5d) and CSN (Fig. 5c) are similar, with clear fault and wedges. However, there are also differences, mainly in the characterization of wedges and anticline structures at 1.3–1.5 s. On the other hand, the VSSN-predicted interval velocity (Fig. 5b) is significantly different from the true, and the structural details such as wedges, fold and anticlines are lost, and the geological boundary is blurred. Only using VS is difficult to characterize the interval velocity details of complex formation units.

Comparisons between the interpolation- and the neural-network-constructed velocity models for the Marmousi data. (a)–(d) show the interval velocity models obtained from interpolation, VSSN, CSN and MMN methods, respectively. (e)–(h) show the constructed RMS velocity models.
Low-frequency RMS velocity prediction results are used to evaluate the stability of the three networks. The results predicted by MMN (Fig. 5d) are better than those of the other two single-mode networks. In the thrust imbricate region between 300 and 600 CMP, the lateral resolution of the RMS velocity model predicted by MMN or CSN (Fig. 5h or g) is higher than that of VSSN (Fig. 5f), specially at 0.8–1.2 s. Regarding the lateral continuity of RMS velocity, MMN and VSSN performed well, but the CSN prediction (Fig. 5g) is unstable, with poor lateral continuity and ‘spike’ anomalies in the shallow layer (0.4–0.8 s). These results indicate that MMN can combine the advantages of the two modes in local detail representation and overall stability.
In Fig. 6, we compare the single-trace NMO performance of CSN, VSSN and MMN. The first column (Figs 6a, f and k) represents the three CMP gathers (CMP 456, 641 and 671). The second column (Figs 6b, g and l), third column (Figs 6c, h and m) and fourth column (Figs 6d, I and n) columns show the NMO gathers estimated using VSSN, CSN and MMN velocities, respectively. The velocity spectra and the RMS velocity curves are plotted in the fifth column. In the velocity spectra, the RMS velocity curves predicted by MMN or VSSN (red or green lines) essentially track the energy groups, which can make the events in NMO gathers flatter. The predicted velocity of CSN (yellow lines) deviates from the energy groups, especially in the weak reflection region shown in the red box and the event interference region shown in the green box. This is because it is difficult to capture the complete reflection events in these complex regions, and velocity estimation based on the hyperbolic events in the CMP gathers is inaccurate, resulting in some events (red or green box) upwards or downwards. However, there are still energy groups in the VS indicating RMS velocities, so the results of MMN and VSSN are relatively accurate.

Comparisons between single- and multimode-estimated single-trace RMS velocity results and their NMO performance. (a)–(e) are the 456CMP gather, VSSN-based NMO gather, CSN-based NMO gather, MMN-based NMO gather and VS with velocity curves. (f)–(j) are the corresponding results of 641CMP and (k)–(o) are the corresponding results of 671CMP. In (e), (j) and (o), the green, yellow and red lines represent the RMS velocities predicted by VSSN, CSN and MMN, respectively. In the boxed areas, compared with the CSN method, the MMN and VSSN methods have smaller velocity errors and better NMO effects.
3.2 Noisy synthetic data example
Seismic data are often noisy, and noise-free data (Fig. 7a) are far from the real situation. MMN and CSN have essential differences in noise resistance. Therefore, 40 per cent energy random noise is introduced into the CMP gather (Fig. 7c), and the corresponding VS (Fig. 7d) was calculated. Figs 8(a)–(c) show the interval velocity models predicted by VSSN, CSN and MMN after introducing noise. The RMS velocity results are shown in Figs 8(d)–(f). Even in the case of noise, the interval velocity model predicted by MMN (Fig. 8c) is essentially the same as the result without noise (Fig. 5c), and it can still depict the boundary of the geological body. However, the CSN result (Fig. 8b) has significant differences compared to the result without noise, with unclear shallow structural contours. Similarly, the RMS velocity predicted by CSN (Fig. 8e) appears large error in shallow layers (about 0–0.4 s). This is due to the fact that noise has a greater impact on CMP gathers, especially in shallow weak reflection region, where noise can cover effective signal. RMS velocity predicted by MMN or VSSN still retains the Marousi model's lateral trend and is less affected by noise. Compared with Figs 7(b) and (d), it is also clear that random noise has little effect on the VS. As a result, using VS can improve the network's antinoise capability.

Comparisons between noise-free and noisy CMP data and their corresponding velocity spectra. (a) and (b) show the CMP gather and VS without noise. (c) and (d) show the CMP gather with 40 per cent random noise and the generated VS from (c).

Comparisons between VSSN-, CSN- and MMN-constructed velocity models results for the noisy Marmousi data. (a)–(c) show the interval velocity models obtained by the VSSN, CSN and MMN methods, respectively. (d)–(f) show the constructed RMS velocity models.
To quantify the differences between the three networks, 50 statistical tests were performed in the case of CMP gathers with and without noise, and the average error (AE) was counted to produce Table 1. In both noise-free and noisy cases, MMN had the smallest AE (the bold values in Table 1), indicating its good adaptability and robustness. In the noise-free case, AE of interval velocities predicted by MMN and CSN were 68.4887 and 70.1883, respectively, while that of VSSN was 118.998. This reflects the shortcomings of VS in predicting interval velocity. After adding noise, the AE of the velocity models constructed by the three network increase, the RMS velocity AE of CSN increased by 11.153, while that of VSSN only increased by 0.3096, which was a little change. These results are consistent with the qualitative analysis, suggesting that the velocity spectral lacks information to characterize high-frequency interval velocities, but it can enhance the network's antinoise ability.
Comparisons among three networks for velocity prediction using the noise-free and noisy data.
Method . | VSSN . | CSN . | MMN . | |||
---|---|---|---|---|---|---|
Criterion (AE) . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . |
Noise-free data | 118.998 | 33.5618 | 70.1833 | 25.6167 | 68.4887 | 21.7017 |
Noisy data | 119.549 | 33.8714 | 107.8431 | 36.7697 | 93.3670 | 31.9566 |
Method . | VSSN . | CSN . | MMN . | |||
---|---|---|---|---|---|---|
Criterion (AE) . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . |
Noise-free data | 118.998 | 33.5618 | 70.1833 | 25.6167 | 68.4887 | 21.7017 |
Noisy data | 119.549 | 33.8714 | 107.8431 | 36.7697 | 93.3670 | 31.9566 |
Comparisons among three networks for velocity prediction using the noise-free and noisy data.
Method . | VSSN . | CSN . | MMN . | |||
---|---|---|---|---|---|---|
Criterion (AE) . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . |
Noise-free data | 118.998 | 33.5618 | 70.1833 | 25.6167 | 68.4887 | 21.7017 |
Noisy data | 119.549 | 33.8714 | 107.8431 | 36.7697 | 93.3670 | 31.9566 |
Method . | VSSN . | CSN . | MMN . | |||
---|---|---|---|---|---|---|
Criterion (AE) . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . | |${v}_{{\mathop{\rm int}} }$| . | |${v}_{\mathrm{ rms}}$| . |
Noise-free data | 118.998 | 33.5618 | 70.1833 | 25.6167 | 68.4887 | 21.7017 |
Noisy data | 119.549 | 33.8714 | 107.8431 | 36.7697 | 93.3670 | 31.9566 |
3.3 Out-of-sample synthetic data example
For the generalization, the comparison between MMN prediction results (Figs 5d and h) and interpolation results (Figs 5a and e) can reflect MMN generalization to a certain extent. However, this is not enough, and it is necessary to use the Overrust model for ‘out-of-sample’ testing. A velocity model (Fig. 9a) with a grid size of 749 × 737 is obtained by resampling and clipping, and the interval velocity range is controlled between 1.5 and 5.5 km s−1. Figs 9(a) and (e) show the interval velocity model and the RMS velocity model. Forward modelling method and geometry system are consistent with Marmousi model. The source is also zero-phase Ricker wavelet with main frequency of 35 Hz. The corresponding CMP data and VS are generated as the test set. Based on the training set (10 samples) of the Marmousi model, three trained network models are directly used to predict the Overthrust model. Figs 9(b)–(d), respectively, describe the interval velocity models predicted by MMN, CSN and VSSN, and Figs 9(f)–(h), respectively, show the predicted RMS velocity models. For the Overthrust model that has never been involved in training, MMN can still produce reasonable results, which are consistent with the true velocity model in the overall geological structure and lateral trend. Furthermore, comparing the results of the three networks, the interval velocity predicted by VSSN still lacks detailed features, and the overall stability and continuity of RMS velocity predicted by CSN are poor, consistent with the above conclusions.

Comparisons between VSSN-, CSN- and MMN-constructed velocity results for the Overthrust model. (a) and (e) are the true interval velocities and RMS velocities. (b)–(d) show the interval velocity models obtained by the VSSN, CSN and MMN methods, respectively. (f)–(h) show the constructed RMS velocity models.
3.4 Mechanism of MMN prediction
Representation and fusion are the focus of MMN velocity prediction, which involves utilizing the complementarity and redundancy of multimode to represent and summarize velocity features, and connecting these features to perform comprehensive velocity prediction. Therefore, from the perspective of representation and fusion, the intermediate feature map before and after fusion is visualized to explain why MMN can successfully predict velocity. CMP 150 and 450 were selected from the test set because of their different velocity distributions. The conclusions based on these two samples are generalizable. The red, blue and green arrows shown in Fig. 2(c), respectively, indicate the feature maps extracted from the CMP branch, VS branch, and fused trunk, named ‘CMP feature map’ (Fig. 10a or d), ‘VS feature map’ (Fig. 10b or e) and ‘fused feature map’ (Fig. 10c or f). Figs 10(a)–(c) are the three intermediate feature maps extracted from CMP 150, and Figs 10(d)–(f) are the results from CMP 450. The size of the three intermediate feature maps is |${n}_t \times {n}_v \times 1$|, and the actual size of the feature maps based on the Marmousi model is |$749 \times 64 \times 1$|. CMP feature maps have significant features at the red arrow indication, as shown in Figs 10(a) and (d). Similarly, the fused feature map (Fig. 10c or f) has obvious features at the green arrow location. As artificial intelligence can easily extract information from the obvious positions of features, the feature values are extract from the columns indicated by the arrows to form the CMP feature curves (Fig. 11b or g) and the fused feature curves (Fig. 11d or i). Correspondingly, the VS feature curves (Fig. 11c or h) is formed by selecting a value from the VS's obvious feature (energy groups).

Comparisons between three types of intermediate feature maps. (a)–(c) are the CMP, VS and fusion feature maps from CMP 150. (d)–(f) are the corresponding feature maps extracted from CMP 450. The arrows in the figure point to the columns with obvious features.

Comparisons between three types of feature curves. (a)–(e) are the NMO gather, CMP feature curves, VS feature curves, fusion feature curves and true layer velocity related to CMP 150. (f)–(j) are the corresponding information related to CMP 450.
The above three types of feature curves (Figs 11b–d or g–i) have obvious response at the reflection events of NMO gather (Fig. 11a or f). Meanwhile, these feature curves also maintain a strong correlation with the interval velocity (Fig. 11e or j), indicating that MMN can extract and summarize feature curves with similar attributes. However, these feature curves also have some differences. For example, the VS feature curve is mainly dominated by low frequency and represents the overall velocity trend and low-frequency velocity background. CMP feature curves have high-frequency responses and can represent local velocity details. Therefore, MMN can characterize seismic velocity in many aspects. The effective representation of each mode is the basis for successful prediction of velocity based on multimodal information. Before fusion, each CMP feature curves (Fig. 11b or g) has some differences. The CMP feature are rich in local detail but appear redundant and complex, with a low overall stability. The VS feature curves (Fig. 11c or h) stably correspond to the strong reflection events. However, some velocity details may be omitted at the non-energy and sparse energy clusters. After fusion, the fused feature maps (Fig. 10c or f) show consistency, and its feature curves (Fig. 11d or i) are more stable than the CMP feature curves, and have more details than VS feature curves. This demonstrates that MMN can fuse velocity features from different modalities to obtain more comprehensive velocity information. This information may be hidden or not apparent in a single modality, but can be uncovered through multimodal fusion.
3.5 Field data example
In this subsection, we use the real pre-stack 3-D seismic data from an oil field in western China to test the availability of MMN in practical application. The 3-D survey (Fig. 12a) consists of 200 ILines, each with 1101 common reflection point (CRP) gathers, for a total of 220 200 gathers with a trace spacing of 50 m. Each gather has 2000 time samples with an interval of 2 ms. The seismic data in the work area are preprocessed before obtaining the CRP gather, including seismic noise attenuation (Sang et al. 2021), linear interference wave suppression and so on. The target reservoir is 1.4–2.0 s. According to geological knowledge, the reservoir unit develops multistage rivers, particularly reticulated rivers and meandering rivers near the target horizon (black line in Fig. 13). Due to the significant differences between the rivers and the surrounding rocks, the seismic profile shows local amplitude anomalies, which is referred to as a ‘bright spot.’

The field seismic data. (a) is the 3-D survey. (b) is a CRP gather. (c) is a VS from (b). In (a), the red points indicate the training sample position.

Comparisons between VSSN-, CSN- and MMN-constructed velocity model for the field seismic data. (a)–(d) show the 3-D stacking velocity models obtained by the interpolation, VSSN, CSN and MMN methods, respectively. There is one target horizon visible on the velocity model, which appears as a horizontal black line.
Each CRP gather's VS is generated using the average energy criterion. Figs 12(b) and (c) depict one of the CRP gathers and the corresponding VS. As shown by the red dotted in Fig. 12(a), 100 manually picked velocities were randomly selected as labels, and the corresponding 100 CRP gathers and velocity spectra constituted the training set to train MMN, CSN and VSSN, respectively. To evaluate MMN's performance, 440 stacking velocities are manually picked up with 50 × 10 grid. Then the stacking velocity model of the 3-D work area is obtained via interpolation, shown in Fig. 13(a) as a reference model. The trained network predicts the entire 3-D work area with 1 × 1 grid, and the results are smoothed. Figs 13(b)–(d) show stacking velocity models based on VSSN, CSN and MMN. In the overall trend of velocity, the three predicted velocity models agree well with the reference model. However, the CSN-based predictions (Fig. 13c) are unstable, with lateral discontinuities near 2 s. This is due to interference and noise between reflection events near 2 s in the CRP gather (Fig. 12b), while the VS in these regions (Fig. 12c) has clear energy groups response. Therefore, the MMN and VSSN predictions show good lateral continuity in these areas. In contrast, the VS in the 2.5–3 s region has no obvious energy group characteristics, so the velocity details of VSSN (Fig. 13b) are not as rich as those of MMN and VSSN. The MMN makes a trade-off between continuity and detail features.
We further analyse the MMN-predicted stacked velocity profiles and the corresponding imaging results. In the reservoir area of InLine888, as shown in Figs 14(a) and (b), the MMN prediction (Fig. 14b) indicated by these red arrows exhibits higher resolution and richer velocity details compared to interpolated stacking velocities (Fig. 14a). This is mainly because the velocity is picked manually based on 50 × 10 grid, and the subsequent interpolation will lead to the distortion of velocity model. MMN can achieve 1 × 1 grid dense picking. Even after smoothing processing, each velocity curve information is derived from the corresponding seismic data. Overall, as shown in Fig. 15 for stacking profile comparison, the performance of MMN is comparable to that of the results based on interpolation, depicting clear main geological structures such as folds and sandstone bright spots. The magnified subfigure on the target horizon is shown in red boxes in Figs 15(a) and (b), and the MMN method has clearer abnormal amplitude. Fig. 15(c) depicts the VS picking results at this location. The velocity curve predicted by MMN is closer to the energy cluster at 1.4 s, which improves the imaging accuracy of the ‘bright spot’ and conforms to the river response characteristics.

Comparisons between (a) interpolation- and (b) MMN-constructed stacking velocity profile on ILine 888. The red arrows indicate the differences in the details of the velocity profile.

Comparisons between the interpolation- and MMN-derived imaging profile on ILine 888. (a) and (b) are the stacking imaging results based on Figs 14(a) and (b), respectively. (c) is the VS corresponding to the red box in (a). The black line in (a) represents the target horizon. The red box and its magnified subfigure in (a) and (b) indicate the location of ‘bright spot’. The yellow line in (c) represents the MMN-predicted velocity, and the red line represents the interpolated velocity.
In order to further evaluate the imaging of the MMN on the rivers, the coherence attribute of post-stack seismic data is extracted along the target horizon (black solid line in Fig. 15) to observe the rivers boundary. The river outlines of the two slices are similar, and the river branches and general trends are mostly consistent. The MMN method (Fig. 16b) provides a continuous and complete river boundary at the three locations indicated by the red arrows. This is due to the fact that the MMN-predicted velocity model is more precise than that obtained by the interpolation method, and the corresponding stacking profiles have a more obvious ‘bright spot’ response.

Comparisons between the interpolation- and MMN-derived coherent slices using along the target horizon. (a) and (b) are coherent slices based on Fig. 15(a) and (b), respectively. The MMN method shows a clearer river boundary (as indicated by the red arrows).
4 DISCUSSION
The use of MMN has some constraints. For example, in the generalization test of the Overthrust model, it is necessary to ensure that the forward modelling method, source wavelet and observation system are consistent with the training set, and the range of velocity distribution should not differ too much. Only meet these conditions can MMN achieve better results on ‘out-of-sample models’. However, it is difficult for field data. Therefore, we cannot directly use the trained model to predict field data. For example, in this paper, we examined the velocity distribution of synthetic and field data as shown in Fig. 17. Due to the differences in velocity distribution between the synthetic data (blue and yellow lines) and the field data (green and red lines), it is necessary to manually pick labels to fine-tune the network to adapt to new wavelets, observation systems and velocity ranges. Based on these constraint conditions and the characteristics of pre-stack gathers in field data, a large number of training samples are generated using a diffusion probabilistic models (Ho et al. 2020) for pre-training the model, which is the focus of our next research on transfer learning.

Comparison of synthetic data and field data velocity distribution. The blue and yellow lines describe the velocity distribution based on the test and training sets from the Marmousi model, respectively. The green and red lines describe the velocity distribution based on the manually picked velocities and the training set velocities from field data, respectively.
It is worth noting that the method proposed in this paper combines multiple information to ensure the stability of the predicted velocity, the velocity model has a certain degree of lateral continuity. But this method is essentially based on single trace, it is difficult to consider the velocity relationship between traces. In general, the time-domain velocity modelling and imaging are interactive, and that horizon of final interval velocity should be consistent with the strong reflection events of imaging profile, that is, the geological model should be consistent. Therefore, using the geological model of the imaging profile to constrain the lateral trend of the velocity model is a further research for us.
5 CONCLUSIONS
We propose an MMN to simulate the seismic velocity inversion process by combining characteristics of CMP gather and VS. The MMN is highly stable, precise and converges quickly by combing the amplitude information in CMP data and simplified traveltime information in VS, while two single-mode networks cannot simultaneously achieve high continuity and high detail. Through test with the Overrust model, it is verified that the MMN is different from interpolation methods, as it can extract velocity information from multimodal data and has strong generalization ability. In addition, the visualization analysis of intermediate feature maps explains the success of MMN prediction lies in its ability to characterize low-frequency velocity backgrounds and high-frequency velocity details, and then complementarily fuse them. Compared to manual picking, the MMN network is automatic and more efficient, enabling high-density velocity picking and providing accurate velocity models for river sand bodies, and even describing river boundaries more clearly.
DATA AND MATERIALS AVAILABILITY
Data associated with this research are available and can be obtained by contacting the corresponding author.
ACKNOWLEDGEMENTS
This work was financially supported by the National Key R&D Program of China (2018YFA0702504), the National Natural Science Foundation of China (41974140 and 42174152) and the Strategic Cooperation Technology Projects of CNPC and CUPB (ZLZX2020-03).
CONFLICT OF INTEREST
The authors declared that they have no conflicts of interest to this work.