The combination of IoT applied sciences into cross-media artwork and design presents vital alternatives for dynamic interplay, user-driven personalization, and multi-sensory engagement. Nevertheless, standard programs usually lack scalability, real-time adaptability, and coherent knowledge fusion methods, significantly when managing advanced and heterogeneous enter streams. To deal with these limitations, this paper proposes an IoT-based framework that leverages interconnected units, real-time knowledge processing, and a deep studying spine to allow clever, responsive habits in cross-media artwork installations.
Proposed framework
The IoT-enhanced framework developed on this work is designed to allow the real-time acquisition, transmission, and classification of multimodal knowledge, particularly visible, auditory, and movement inputs, thereby supporting context-aware creative interplay. IoT-enabled sensing units, together with RGB cameras, MEMS microphones, and inertial measurement items (IMUs) that comprise accelerometers and gyroscopes, are deployed to gather uncooked environmental and user-related knowledge. These units talk with a cloud-based infrastructure by light-weight, low-latency protocols reminiscent of MQTT and UDP over Wi-Fi, enabling asynchronous and scalable knowledge switch24,25. To make sure excessive knowledge constancy, a devoted media management layer performs real-time knowledge synchronisation, noise filtering, redundancy elimination, and inconsistency discount. Moreover, the framework incorporates a knowledge privateness and integrity module that manages encryption, anonymisation, and safe storage operations, making it appropriate for high-throughput and privacy-sensitive deployments in creative environments. On the core of the processing pipeline lies the DeepFusionNet structure, which employs Convolutional Neural Networks (CNNs) for spatial function extraction, Lengthy Brief-Time period Reminiscence (LSTM) and Gated Recurrent Unit (GRU) layers for modeling temporal dependencies in movement and audio streams, and absolutely related layers for multimodal fusion and classification26,27. The mannequin doesn’t generate creative content material; as an alternative, it performs classification of enter contexts to set off predefined creative responses in real-time, interactive installations. To optimise system efficiency, DeepFusionNet is educated utilizing hyperparameter tuning and evaluated on numerous efficiency metrics, together with accuracy, sensitivity, F1-score, and Matthews Correlation Coefficient (MCC)28. These metrics affirm the mannequin’s reliability in precisely decoding and responding to dynamic, real-world multimodal knowledge. The whole pipeline, encompassing IoT-based knowledge acquisition, safe cloud-based processing, DeepFusionNet integration, and efficiency analysis, is illustrated in Fig. 1.
As proven in Fig. 1, to additional Help real-time responsiveness, the sensing units have been configured with modality-specific parameters. The RGB cameras captured video at 30 frames per second (fps) with a 1280 × 720 decision, whereas the MEMS microphones recorded audio at a sampling charge of 44.1 kHz with a 16-bit depth. Inertial measurement items (IMUs), together with accelerometers and gyroscopes, streamed movement knowledge at 100 Hz, offering ample granularity for gesture and posture recognition. All modalities have been synchronized utilizing a timestamp-based alignment technique, supported by the Community Time Protocol (NTP) to make sure constant timing throughout units. A light-weight buffering mechanism was utilized to appropriate minor delays (lower than 10 ms), enabling coherent fusion of visible, auditory, and movement streams inside the DeepFusionNet pipeline.
Dataset assortment
To assist the event and analysis of the proposed framework, a multimodal dataset was constructed utilizing each real-time knowledge collected from IoT units and publicly out there sources. Excessive-resolution visible knowledge, together with pictures and quick video clips, have been captured utilizing cameras and light-weight sensors and additional supplemented with samples from established datasets reminiscent of COCO and Flickr8k, which provide various and semantically annotated picture content material. For auditory enter, sound recordings have been collected by microphones and sound sensors, and augmented utilizing open datasets reminiscent of UrbanSound8K and FreeSound. These sources embody a various vary of labeled environmental and ambient sounds, appropriate for classification duties. Movement-related knowledge have been acquired from accelerometers and gyroscopes built-in into IoT units and enriched with samples from the Human Exercise Recognition (HAR) dataset out there on Kaggle.
All collected knowledge have been transferred to a centralized cloud setting the place preprocessing operations, reminiscent of noise filtering, normalization, synchronization, and modality alignment, have been utilized. High quality management procedures ensured the consistency and reliability of the dataset, whereas knowledge privateness mechanisms have been integrated to guard any delicate info captured throughout real-time knowledge acquisition. The ultimate multimodal dataset contains synchronized visible, auditory, and movement options aligned with interplay contexts. This dataset allows DeepFusionNet to carry out efficient classification of person states and environmental circumstances, that are then used to set off predefined creative responses inside the cross-media framework. To outline these interplay contexts, a labeling scheme was carried out that categorized user-system states into predefined lessons reminiscent of Lively, Idle, Exploratory, and Engaged. Annotations have been assigned by a mixture of guide skilled labeling and semi-automatic logging of sensor occasions (e.g., movement bursts, sound thresholds, or visible presence). Handbook annotations have been carried out by two impartial reviewers to make sure reliability, with discrepancies resolved by consensus. This hybrid strategy enabled constant and reproducible mapping between multimodal alerts and contextual enter states, which then served because the ground-truth labels for mannequin coaching and analysis.
Let the dataset(::D) for cross-media artwork and design be outlined as a group of three major modalities: visuals, sounds, and movement knowledge.
$$D={ {D_v},{D_a},{D_m}}$$
(1)
The place, (:{D}_{textual content{v}}) represents the visible knowledge, (:{D}_{a}), represents the auditory knowledge, and (:{D}_{m}), represents the movement knowledge. Visible knowledge (:{D}_{v}), it consists of a set of pictures and video samples:
$${D_v}={ {x_{vi}}mid {x_{vi}} in {{mathbb{R}}^{H occasions W occasions C}}}$$
(2)
The place (:H), (:W), and (:C) signify the peak, width, and variety of colour channels, respectively. Auditory knowledge (:{D}_{a}), consists of audio alerts represented as sequences of discrete samples:
$${D_a}={ {x_{ai}}mid {x_{ai}} in {{mathbb{R}}^T}}$$
(3)
The place (:T) is the length of the audio pattern by way of the variety of time steps. Movement knowledge (:{D}_{m}), is captured from sensors and represented as a sequence of options:
$${D_m}={ {x_{mi}}mid {x_{mi}} in {{mathbb{R}}^d}}$$
(4)
The place (:d) is the variety of motion-related options (e.g., acceleration, orientation). The complete dataset (:D) is then represented as:
$$D=bigcuplimits_{{i=1}}^{N} {left{ {left( {{x_{{v_i}}},{x_{{a_i}}},{x_{{m_i}}},{y_i}} proper)} proper}}$$
(5)
The place (:{y}_{i}), is the corresponding label or annotation for the pattern, (:N) is the whole variety of samples.
Preprocessing
Knowledge preprocessing is taken into account an important facet of function extraction, because it determines which type of multimodal knowledge is greatest Suited to evaluation by deep studying fashions. It does this to make sure that there may be high-quality knowledge, with out lacking, redundant, or inconsistent knowledge, for formulating Subsequent knowledge evaluation processes. Modality-specific preprocessing was carried out to optimize the standard and consistency of multimodal inputs for DeepFusionNet. Within the visible stream, all pictures and video frames have been resized to 224× 224 pixels to align with CNN enter necessities and normalized channel-wise utilizing dataset-specific means and normal deviations29. To reinforce generalization and scale back overfitting, augmentations reminiscent of random rotations (± 15°), horizontal flips, and brightness changes have been utilized. Within the audio stream, uncooked waveforms have been transformed into Mel-spectrograms with 128 frequency bins and a hard and fast time window. Noise filtering, utilizing a 300–3400 Hz band-pass filter, eliminated irrelevant frequency elements. Robustness was additional enhanced by the injection of Gaussian noise, random time-shifting inside a 50–200 ms vary, and pitch scaling. Within the movement stream, accelerometer and gyroscope readings have been smoothed with a moving-average filter to suppress jitter, normalized per-axis to make sure gadget independence, and corrected for lacking values utilizing k-nearest neighbor (okay = 5) interpolation. Further augmentation by temporal stretching and axis rotation perturbations simulated pure variations in person actions30. Collectively, these procedures ensured standardized, high-quality, and various multimodal representations, enabling DeepFusionNet to realize dependable fusion and real-time classification in cross-media artwork purposes.
Visible modality is used to standardize visible inputs; every pixel is normalized channel-wise by subtracting the imply and dividing by the usual deviation. This ensures that each one pictures share a constant scale, lowering bias from lighting variations or colour imbalances:
$${I^{prime}_c}(x,y)=frac{{{I_c}(x,y) – {mu _c}}}{{{sigma _c}}}$$
(6)
To enhance robustness, pictures are randomly rotated inside a managed vary, which will increase pattern range and prevents overfitting to particular orientations:
$${I^{prime}_{rot}}(x,y)=I({R_theta } cdot {[x,y]^T}),,theta in [ – {15^ circ },+{15^ circ }]$$
(7)
Audio modality is a uncooked audio waveform that’s transformed into Mel-spectrograms, which seize frequency info in a perceptually significant scale. This transformation gives a compact and discriminative time–frequency illustration for studying:
$$Sleft( {m,n} proper)=sumlimits_{{okay=0}}^{{Ok – 1}} {{{left| {X(n,okay) cdot {H_m}(okay)} proper|}^2}}$$
(8)
To extend generalization, Gaussian noise is added throughout augmentation, simulating environmental variability and making certain that the mannequin shouldn’t be overly delicate to small perturbations:
$$Aprime (t)=A(t)+eta cdot N(0,{sigma ^2})$$
(9)
Movement modality is a sensor sign that usually accommodates high-frequency jitter; due to this fact, smoothing is utilized utilizing a transferring common filter. This reduces transient noise whereas retaining the general form of the movement patterns:
$${M^{prime}_t}=frac{1}{w}sumlimits_{{i=0}}^{{w – 1}} {M(t – i)}$$
(10)
In instances the place sensor readings are lacking, values are imputed utilizing a k-nearest neighbor technique. This technique replaces lacking entries with the imply of close by patterns, sustaining temporal consistency:
$${M^{prime}_{missin g}}=frac{1}{okay}sumlimits_{{j=1}}^{okay} {{M_{N{N_j}}}}$$
(11)
These formulations formalize the preprocessing pipeline throughout all modalities, making certain that visible inputs are standardized and augmented, audio alerts are remodeled into strong spectrogram representations with noise tolerance, and movement knowledge are smoothed, normalized, and imputed for reliability. Collectively, they supply high-quality, constant, and various multimodal options that considerably improve the effectiveness of DeepFusionNet in real-time classification duties.
Step one is knowledge normalization, which scales all inputs to have a zero imply with out weighting them in the direction of outliers31,32. For knowledge cleansing, outlier values are eliminated by statistical evaluation, and coverings for various modes of lacking knowledge embrace imply or k-NN imputation33,34. Moreover, knowledge replenishment helps diversify a dataset by incorporating newly created artificial samples, whereas formatting standardization facilitates the creation of a unified format throughout various kinds of knowledge. These preprocessing steps improve the standard of the info, making certain that solely probably the most appropriate options are extracted and lowering the chance of growing a program with low accuracy.
Normalization scales options to a standardized vary. For a function vector (:x=[{x}_{1},{x}_{2},dots:,{x}_{n}]), Min-Max scaling is outlined as:
$$x^{prime}=frac{{x – hbox{min} (x)}}{{hbox{max} (x) – hbox{min} (x)}}for,forall {x_i} in x,0 leqslant {x^{prime}_i} leqslant 1$$
(12)
Z-score normalization standardizes knowledge utilizing:
$$x^{prime}=frac{{x – mu }}{sigma }$$
(13)
The place:
$$mu =frac{1}{n}sumlimits_{{i=1}}^{n} {{x_i}}$$
(14)
$$sigma sqrt {frac{1}{n}sumlimits_{{i=1}}^{n} {{{left( {{x_i} – mu } proper)}^2}} }$$
(15)
Outliers are recognized utilizing Z-scores for every knowledge level (:{x}_{i}):
$${z_i}=frac{{{x_i} – mu }}{sigma },Outier,situation,left| {{z_i}} proper|>okay$$
(16)
Right here, okay is a predefined threshold (generally okay = 3). Outliers are then excluded by:
$${x_{cleaned}}={ {x_i} in xmid mid {z_i}mid , leqslant okay}$$
(17)
Lacking knowledge (:{x}_{lacking}) is imputed by:
$${x_{lacking}},mu =sumlimits_{{i=1}}^{n} {{x_i}}$$
(18)
The preprocessing consists of particular parameters, reminiscent of the brink (:okay:=:3) for outlier detection utilizing the Z-score technique, to take away excessive knowledge factors. For KNN imputation, (:Ok=5) is used, the place lacking values are stuffed primarily based on the common of the 5 nearest neighbors, making certain dependable knowledge whereas sustaining computational effectivity:
$${x_{lacking}}=frac{1}{okay}sumlimits_{{j=1}}^{okay} {{x_j}} ,{x_j} in NearestNeighbors({x_{lacking}})$$
(19)
Augmentation generates artificial samples. For picture knowledge (:I(x,y)):
$$Iprime =R(S(I(x,y))$$
(20)
$$Ileft{ start{gathered} R=Rotation~matrix hfill S=Scaling{textual content{ }}issue~matrix hfill finish{gathered} proper.$$
(21)
For audio knowledge (:Aleft(tright)), noise (:N,) and time shifts (:T) are utilized:
$$Aprime (t)=A(t+Delta t)+eta cdot N,eta in [0,1]$$
(22)
Right here, (:varDelta:t) is the time shift, and (:eta:) controls noise depth.
Standardization unifies knowledge dimensions. For picture datasets:
$$Iprime =left{ start{gathered} xprime ,yprime =Resize(I(x,y),D) hfill D=[H,W] hfill H=peak hfill W=width hfill finish{gathered} proper.$$
(23)
For time-series knowledge (:Tleft(tright)):
$$Tprime (tprime )=Interpolate(T(t),Delta tprime )$$
(24)
Preprocessing makes use of mathematical formulations to optimize knowledge high quality, consistency, and compatibility, which considerably enhances the efficiency of deep studying fashions.
DeepFusionNet
DeepFusionNet is a hybrid deep studying structure developed to carry out environment friendly multimodal knowledge classification within the context of cross-media artwork and design. The mannequin is designed to course of heterogeneous enter streams, together with visible, auditory, and movement knowledge, and extract high-level options that inform the habits of interactive programs. The structure begins with Convolutional Neural Networks (CNNs), that are used to extract spatial options from visible knowledge and to rework auditory alerts into spectrogram representations for additional processing. Pooling layers observe the convolutional layers to cut back dimensionality whereas preserving important spatial options35. These representations are then handed by flattening layers, which convert multi-dimensional function maps into vectors appropriate for dense processing36,37. To deal with temporal traits in movement and audio knowledge, the mannequin incorporates Lengthy Brief-Time period Reminiscence (LSTM) and Gated Recurrent Unit (GRU) layers. These layers are chargeable for studying temporal dependencies and are significantly efficient in processing time-aligned multimodal sequences. The outputs of the CNN, LSTM, and GRU elements are concatenated and handed by absolutely related layers, which carry out function fusion and facilitate closing classification. The mannequin is optimized utilizing superior coaching strategies, together with hyperparameter tuning and loss perform minimization, to realize strong generalization throughout multimodal datasets38,39. In contrast to generative fashions, DeepFusionNet is particularly designed to categorise interplay contexts, reminiscent of person gestures or environmental states, and set off predefined creative responses inside cross-media installations. By combining the strengths of CNNs for spatial knowledge, recurrent layers for temporal knowledge, and dense layers for integration, DeepFusionNet gives a scalable and adaptable answer for real-time multimodal classification. Its efficiency within the inventive area is characterised by low classification error, excessive responsiveness, and applicability to various interactive design duties. The complete structure of DeepFusionNet, together with its multimodal enter processing pipeline and part relationships, is illustrated in Fig. 2.
DeepFusionNet is a sophisticated mannequin of deep studying developed for fusion purposes involving multimodal knowledge, reminiscent of imaginative and prescient, audio, and movement sensor knowledge. CNNs are one of many neural networks used on this examine, together with LSTM and GRU networks40,41. For enter pictures or spectrograms, and within the case of auditory knowledge, the CNN conducts a convolution to extract spatial options. Let the enter be represented as (:{X}_{visible}) or (:{X}_{audio}), the place every knowledge kind is processed individually:
$${F_{cnn}}=CNN(X)$$
(25)
The place (:{F}_{cnn}) is the function map obtained from the convolution layers; these options are sometimes processed with activation features (:{upsigma:}) (e.g., ReLU) to introduce non-linearity. For dealing with sequential knowledge (reminiscent of movement sequences or time-series options from auditory knowledge), the temporal dependencies are modeled utilizing LSTMs and GRUs. For LSTM, the enter at time step (:t) is (:{x}_{t}), and the LSTM computes the next updates at every time step:
$${i_t}=sigma ({W_i} cdot {x_t}+{U_i} cdot {h_{t – 1}}+{b_i})$$
(26)
$${f_t}=sigma ({W_f} cdot {x_t}+{U_f} cdot {h_{t – 1}}+{b_f})$$
(27)
$${o_t}=sigma ({W_o} cdot {x_t}+{U_o} cdot {h_{t – 1}}+{b_o})$$
(28)
$${c_t}={f_t} cdot {c_{t – 1}}+{i_t} cdot tanh({W_c} cdot {x_t}+{U_c} cdot {h_{t – 1}}+{b_c})$$
(29)
$${h_t}={o_t} cdot tanh({c_t})$$
(30)
The place (:{h}_{t}:)represents the hidden state, (:{i}_{t}), (:{f}_{t}), and (:{o}_{t})_tot are the enter, overlook, and output gates, respectively, and (:{c}_{t}) is the cell state. For GRU, the gates are less complicated and are given by:
$${r_t}=sigma ({W_r} cdot {x_t}+{U_r} cdot {h_{t – 1}}+{b_r})$$
(31)
$${z_t}=sigma ({W_z} cdot {x_t}+{U_z} cdot {h_{t – 1}}+{b_z})$$
(32)
$$tilde {h}t=tanh({W_h} cdot {x_t}+{U_h} cdot ({r_t} cdot {h_{t – 1}})+{b_h})$$
(33)
$${h_t}=(1 – {z_t}) cdot {h_{t – 1}}+{z_t} cdot {tilde {h}_t}$$
(34)
The place (:{r}_{t}) and (:{z}_{t}) are reset and up to date gates, respectively.
In DeepFusionNet, the mannequin effectively combines spatial options extracted from the CNN layers and temporal options captured by the LSTM/GRU layers. This fusion course of allows the community to combine each varieties of knowledge for enhanced multimodal understanding. The fusion is carried out as follows:
$${F_{fused}}=sigma left( {{W_c} cdot left[ {{F_{cnn}} oplus {F_{lstm/gru}}} right]+b} proper)$$
(35)
Within the fusion course of, (:{F}_{textual content{c}textual content{n}textual content{n}}) represents the spatial options extracted from CNN layers (visible knowledge), and (:{F}_{textual content{l}textual content{s}textual content{t}textual content{m}/textual content{g}textual content{r}textual content{u}}) represents the temporal options from LSTM/GRU layers (movement/audio knowledge). The (:oplus:) operation denotes the concatenation of those options alongside the function dimension. The concatenated options are then linearly remodeled by the load matrix (:{W}_{c}), with a bias time period (:b) added after the transformation. The activation perform (:sigma:(cdot:)), sometimes ReLU or sigmoid, introduces non-linearity into the remodeled options. This fusion technique integrates each spatial and temporal info, enabling the mannequin to carry out correct classification or decision-making for cross-media artwork duties. The ensuing fused options are then processed for downstream duties, reminiscent of classification or regression.
The fused options are handed by absolutely related layers, represented as:
$$y=sigma ({W_{fc}} cdot {F_{fused}}+{b_{fc}})$$
(36)
The mannequin parameters (:varTheta:) are optimized by minimizing a loss perform (:L):
$$L= – frac{1}{n}sumlimits_{{i=1}}^{N} {left[ {{y_i}log ({{hat {y}}_i})+1(1 – {y_i})log (1 – {{hat {y}}_i })} right]}$$
(37)
Utilizing backpropagation and optimizers like Adam, the mannequin iteratively updates (:varTheta:):
$${Theta ^{(t+1)}}={Theta ^{(t)}} – eta cdot {nabla _{Theta }}L$$
(38)
The place (:eta:) is the training charge, and (:t) is the iteration step.
Efficiency analysis of DeepFusionNet
DeepFusionNet efficiency is assessed utilizing Recognized goal features relative to a given activity, and on this case, the researcher primarily makes use of classification. The standard of classification is measured by embracing essential classification metrics, reminiscent of accuracy, precision, recall (also referred to as sensitivity), F1 rating, and specificity, to guage the mannequin’s success. This evaluation evaluates its capability to categorise multimodal enter (on visible, auditory, and movement knowledge) and elicit predefined creative responses. Accuracy is the share of cases the place the mannequin makes appropriate predictions42,43. Precision and recall decide whether or not the mannequin precisely detects each constructive and detrimental instances. The F1 rating combines precision and recall, which is especially necessary when working with imbalanced knowledge. Specificity quantifies the mannequin’s accuracy in classifying detrimental instances. Furthermore, the Receiver Working Attribute (ROC) curve and the Space Below the Curve (AUC-ROC) decide the trade-off between sensitivity and specificity, relying on completely different thresholds44. In binary classification issues, the Matthews Correlation Coefficient (MCC) can be employed as a result of it considers all values within the confusion matrix, thereby offering a balanced efficiency measure, particularly in imbalanced settings.
$$:Accuracy:=frac{{T}^{+}+{T}^{-}}{{T}^{+}+{{F}^{+}+T}^{-}+{F}^{-}}:::0le:Accle:1$$
(39)
$$:Specificity:=:frac{{T}^{-}}{{F}^{+}+{T}^{-}}:::::::::::0le:Sple:1:$$
(40)
$$:Sensitivity:=:frac{{T}^{+}}{{T}^{+}+{F}^{-}}:::::::::0le:Snle:1$$
(41)
$$:Mcc=:frac{left({T}^{-}{*:T}^{+}proper)-::left(:{F}^{-}*:{F}^{+}proper)}{sqrt{left({{f}^{+}:+:T}^{+}proper)left({T}^{+}+:{F}^{-}proper)left({F}^{+}+{T}^{-}proper)left({T}^{-}+{F}^{-}proper)}}:::::-1le:Mccle:1::::$$
(42)
The place T+ symbolizes true positives, (:{F}^{+}) symbolizes false positives, (:{T}^{-}) Symbolizes true negatives and (:{F}^{-}) False negatives, respectively.
Regression standards are additionally represented by the Imply Absolute Error (MAE), Imply Squared Error (MSE), and Root Imply Squared Error (RMSE) along with classification metrics. Along with these metrics, we additionally consider the computational effectivity of DeepFusionNet, explicitly specializing in coaching time, inference time, and reminiscence consumption. These knowledge are important for understanding the precise implementation of the mannequin, particularly in real-time, the place latency and reminiscence are vital issues. To find out the comparative effectivity of DeepFusionNet, it’s contrasted with baseline fashions, testing its functionality with multimodal knowledge and the flexibleness of advanced exams. The above comparisons show the effectiveness of DeepFusionNet in multimodal classification issues, significantly in eventualities the place a number of knowledge streams should be processed in real-time. Though classification metrics assess the extent to which the mannequin can classify enter states (i.e., states that elicit predefined creative responses), it’s important to notice that creative advantage won’t be evaluated on this analysis. The emphasis is positioned on the reliability of the system to elicit the proper artist responses relying on the explicit states of the inputs. The aesthetic high quality of the manufacturing, together with graphic and sound stimuli, will probably be assessed in subsequent analysis by surveys or by skilled customers to supply a holistic overview of the mannequin’s efficiency in interactive artwork installations.
Imply Absolute Error (MAE):
$$MAE=frac{1}{N}sumlimits_{{i=1}}^{N} {left| {{y_i} – {{hat {y}}_i}} proper|}$$
(43)
Root Imply Squared Error (RMSE):
$$RMSE=sqrt {sumlimits_{{i=1}}^{N} {{{left( {{y_i} – {{hat {y}}_i}} proper)}^2}} }$$
(44)
These metrics are informative in assessing the efficiency of the random forest algorithm concerning the outcomes estimated and pushed by climate knowledge on financial actions. On this case, the first quantitative efficiency metrics assess the effectiveness of the mannequin used to determine college students who require assist.


