-
PDF
- Split View
-
Views
-
Cite
Cite
Jin-Kook Lee, Youngjin Yoo, Seung Hyun Cha, Generative early architectural visualizations: incorporating architect’s style-trained models, Journal of Computational Design and Engineering, Volume 11, Issue 5, October 2024, Pages 40–59, https://doi.org/10.1093/jcde/qwae065
- Share Icon Share
Abstract
This study introduces a novel approach to architectural visualization using generative artificial intelligence (AI), particularly emphasizing text-to-image technology, to remarkably improve the visualization process right from the initial design phase within the architecture, engineering, and construction industry. By creating more than 10 000 images incorporating an architect’s personal style and characteristics into a residential house model, the effectiveness of base AI models. Furthermore, various architectural styles were integrated to enhance the visualization process. This method involved additional training for styles with low similarity rates, which required extensive data preparation and their integration into the base AI model. Demonstrated to be effective across multiple scenarios, this technique markedly enhances the efficiency and speed of production of architectural visualization images. Highlighting the vast potential of AI in design visualization, our study emphasizes the technology’s shift toward facilitating more user-centered and personalized design applications.

Generative artificial intelligence (AI) adeptly identifies diverse architectural styles and features.
We explored the effectiveness of AI models in architectural visualization.
We established datasets and introduced enhanced training for unique architectural nuances.
We streamlined prompts, data preparations, and training methods for varied architectural designs.
Our findings showcase 20 unique architects’ styles, seamlessly integrating them with the “AI renderer”.
1. Introduction
The field of architecture has long recognized the pivotal role of visualization in conveying design concepts, ideas, and spatial arrangements (Greenberg, 1974). Visualization bridges the gap between abstract concepts and tangible representations, allowing architects to translate their creative visions into visual forms that can be easily comprehended by various audiences (Chen, 2004). By creating realistic or conceptual visualizations, architects can explore design alternatives, evaluate spatial qualities, and make informed decisions while designing (Kunze et al., 2012). Consequently, architectural visualization aids in effective decision-making for architects, designers, clients, and stakeholders (Akin, 1978).
Visualization techniques have advanced to ensure uninterrupted and efficient design progress. The contemporary approach involves a sequence of steps: three-dimensional (3D) modeling upon design finalization, GPU rendering, and post-editing (Fig. 1a; Yildirim & Yavuz, 2012). These advancements allow architects to create high-fidelity visualizations, offering a comprehensive understanding of unbuilt structures (Koutamanis, 2000). Despite considerable improvements in speed and quality, visualization remains complex and time-consuming. It is often replaced by using general reference images for ideation or deferred until later stages for presentation (Bouchlaghem et al., 2005).

Overview of the research. (a) Conventional visualization approach and (b) proposed approach. Refer to Fig. 2 for the details of (b).
The emergence of artificial intelligence (AI) and machine learning (ML)-based image generation models enables rapid image creation from textual descriptions (Ramesh et al., 2021; Saharia et al., 2022b). Implementing such technology in architecture fosters innovative solutions and accelerates the design process. However, publicly available pretrained image generation AI models are usually trained using common objects and cannot be directly applied for specialized fields such as architecture. Existing research mainly focuses on reviewing the potential of these pretrained AI models and image editing techniques (Ploennigs & Berger, 2023). Therefore, further research is needed to tailor and enhance the performance of these models using domain-specific architectural images for their application.
This study investigates the generation of architectural visualizations, including reference images and initial renderings, through AI-based image generation methods (Fig. 1b). Furthermore, it assesses the performance of default AI model based on architects’ styles and features (Section 3). Based on these findings, the design styles and features of various architects are defined (Section 4). Section 5 focuses on styles with lower similarity rates that are addressed via additional training. Finally, Section 6 demonstrates the practical applications of these visualization methods across diverse scenarios.
2. Background
2.1. Development of architectural visualization techniques
The evolution of architectural visualization techniques has paralleled technological advancements. Traditionally, architects relied on hand-drawn sketches, paintings, and physical models to communicate their design concepts (Al-Kodmany, 2001; Atilola et al., 2016). While expressive, these methods were limited in scale, accuracy, and time efficiency. The advent of computer-aided design (CAD), followed by building information modeling (BIM) and rendering engines, introduced photorealistic rendering techniques, marking a pivotal shift in architectural visualization (Fonseca et al., 2013; Xu et al., 2023).
2D CAD drawings enabled precise and modifiable digital representations (Chiu, 1995). The development of 3D modeling enabled a deeper understanding of spatial relationships and volumetric compositions (Eastman, 1999; Hong et al., 2022; Ma et al., 2023; Xu et al., 2016; Yan et al., 2011). In combination with rendering engines, these tools created photorealistic renderings, capturing intricate material textures and replicating lighting conditions (David et al., 2022; Li et al., 2017). Photorealistic visualization technologies now extend beyond the limitations of flat screens, immersing stakeholders in virtual environments and allowing interactive experience with designs (Han & Leite, 2022; Korkut & Surer, 2023; Lee et al., 2023b; Ma et al., 2023).
Advanced visualization tools, such as CAD, BIM, and rendering engines, have led to precise and authentic visualizations, resulting in considerable time efficiency, cost reductions, streamlined editing, efficient data storage, and innovative alternatives (Azhar, 2011; Chen et al., 2023; Lee et al., 2023a). However, creating highly realistic visualizations still requires meticulous modeling, high-end computer specifications, significant time, and specialized skills (Azhar, 2011; Fonseca et al., 2017). As a result, accessibility to these advanced visualization technologies remains somewhat constrained.
Ongoing research aims to refine and create more user-friendly visualization tools. Automation is emerging as a promising solution, simplifying and expediting the visualization process. Leveraging ML and AI, architects can automate various aspects of visualization, allowing more time for in-depth design exploration and decision-making (Castro et al., 2021; Lee et al., 2024; Ploennigs & Berger, 2023; Qian et al., 2023).
2.2. ML and generative artificial intelligence
The evolution from ML to generative artificial intelligence (Gen AI) marks a transformative journey in the field of AI. Initially, ML focused on enabling computers to learn patterns and make predictions from data (Bishop, 2006; Janiesch et al., 2021). AI now processes diverse data types, including text, images, videos, and audios, identifies objects, analyzes patterns, and provides predictions across various domains, including architecture (Kakooee & Dillenburger, 2024; Katsigiannis et al., 2023; Lee et al., 2012; Mathew et al., 2021; Park & Cha, 2023; Song et al., 2020a; Wei et al., 2022). In the field of architecture, many researchers have dedicated significant time to processing images and videos related to design and construction sites (Park & Hyun, 2022; Qian et al., 2023; Zhang et al., 2022). They have focused on developing technologies for object recognition within these visual materials. Based on these advancements, they propose methods to identify various elements, from design styles to construction site rule violations, aiming to manage related data more effectively. These technologies are becoming more streamlined and sophisticated, requiring less manual engineering, as their development accelerates at an unprecedented pace (LeCun et al., 2015). AI now extends beyond data analysis and classification extending into generation of new data based on learned content (Kim et al., 2024).
Gen AI focuses on creating data that mirrors specific input datasets. In 2014, generative adversarial networks (GANs) were introduced, marking a paradigm shift in Gen AI. Goodfellow et al. (2014) pioneered GANs, generating realistic synthetic data through a competitive interplay between a generator and a discriminator. In architecture, GANs have been used to differentiate and classify interior design styles based on learned data (Kim et al., 2019). Cho et al. (2020) transformed hand-drawn architectural blueprints into vectorized drawings. Kikuchi et al. (2022) visualized future scenarios by editing existing buildings from videos. Rahbar et al. (2022) generated architectural layouts for particular topological conditions and geometrical constraints. These studies have advanced research in generative imagery, including image generation and edition (Goetschalckx et al., 2019; Karras et al., 2019).
In 2015, Sohl-Dickstein et al. (2015) proposed the concept of diffusion models (DMs), combining generative models with natural language. DMs use a hierarchy of denoising autoencoders for high-quality image synthesis through a reversed diffusion process (Ho et al, 2020; Song et al., 2020b). Notably, DMs avoid issues like mode-collapse, training instabilities, and vast parameter counts observed in GANs and variational autoencoders (Rombach et al, 2022). They apply to diverse tasks beyond text-to-image (txt2img) generation, which generates images from text prompts, and image-to-image (img2img) generation, which modifies existing images based on a text prompt, along with inpainting, outpainting, up-scaling, and stroke-based synthesis (Kawar et al., 2022; Kim et al., 2023; Li et al., 2017; Lugmayr et al., 2022; Meng et al., 2021; Saharia et al., 2022a). From GANs to DMs, ML-based Gen AI redefines the boundaries of AI creativity (Oppenlaender, 2022), and holds immense potential across industries, including architecture.
2.3. Potential of image generation AI for architectural visualization
In 2020, large language models (LLMs) emerged as transformative tools in natural language processing. Notably, OpenAI (2021)’s GPT-3 employs transformer architectures to comprehend and generate human-like text (OpenAI & Pilipiszyn, 2021). This evolution has positively impacted AI models which generate images based on text (txt2img generation models; Ploennigs & Berger, 2023), as well as on DMs. Representative image generation AI platforms built on LLMs include Midjourney (Midjourney Inc., 2022), DALL·E2 (OpenAI, 2022), and Stable Diffusion (SD; Stability AI, 2022), etc.
Midjourney (Midjourney Inc., 2022) employs transformers to generate detailed images based on textual descriptions (Oppenlaender, 2022). DALL·E2 (OpenAI, 2022) introduces an encoder–decoder architecture capable of producing images from textual descriptions containing novel combinations of objects and concepts (Ramesh et al., 2021). SD (Stability AI, 2022) is a recently proposed txt2img model that utilizes a latent diffusion process to create images from text, allowing gradual refinement of images (Rombach et al, 2022). Each platform provides the ability to obtain high-quality images through txt2img and img2img approaches, with the potential to achieve desired results through meticulous prompt engineering.
Currently, various studies have explored Gen AI, particularly image generation AI, in creative fields like the arts. Oppenlaender (2022) investigated the AI-based image generation and edition, viewing AI as a tool to extend human creativity. In architecture, Ploennigs and Berger (2023) conducted case studies on various image generation platforms to explore their potential for architectural visualization. Jo et al. (2024) introduced a method for rendering building façades using regional styles. However, these studies either focused on reviewing platforms’ capabilities and limitations (Ploennigs & Berger, 2023), or were limited to a single regional style and façade (Jo et al., 2024). Hence, there is a need for further research on the diverse applications of Gen AI’s visualization technology.
AI-assisted visualization in architecture can automate labor-intensive tasks, transforming abstract concepts into vivid representations. Regardless of design stages or material readiness, this approach can expedite the visualization process and accelerate design cycles. Consequently, architects will get more time to focus on crucial design decisions and facilitate effective communication with clients, collaborators, and the public (Epstein et al., 2023; Oppenlaender, 2022; Ploennigs & Berger, 2023). Therefore, our purpose is to explore and amplify the potential of integrating Gen AI into the visualization process to enhance the architectural design process. For this purpose, this study, focusing on architects’ styles, systematically trains architectural expertise to fine-tune the performance of existing models and demonstrates various applications in design process based on these enhancements (Fig. 2).

Early architectural visualization by architect’s style and feature using Gen AI.
Due to the varying strengths and weaknesses of each platform, users must select the platform that aligns with their desired outcomes and functionalities. In this study, SD (Stability AI, 2022) is identified as highly suitable for AI-aided architectural visualization generation, due to its learning capacity, stability during training, controlled generation process, and consistent production of high-quality images from textual prompts. The following sections will explore the practical implementation and implications of the SD model within architectural design.
3. Image Generation Based on Architect’s Style and Feature
3.1. Image generation with Gen AI
There are two primary approaches of image generation in SD: txt2img and img2img. The former involves generating images solely from textual prompts, while the latter involves providing both a textual prompt and a seed image as input to modify the given seed image based on the prompt. Therefore, the txt2img generation approach is useful for more flexible idea visualization unconstrained by form, while img2img is beneficial for tailored continuous idea development. Depending on the nature of each approach, they can be utilized for various tasks such as generating reference images and rendering images. These two image generation approaches can be formally defined as follows:
The “|$generate()$|” function in an image generation AI model (M) creates an image (|$Im{g_{\rm G}}$|) based on the generation parameters (|$Para{m_{\rm G}}$|) and text prompts (|${P_{\rm t}}$|) that describe the target. When using the “|$generate()$|” function with a seed image (|${\rm{}}Im{g_{\rm S}}$|) in the img2img approach, the |${\rm{}}Im{g_{\rm S}}$| is processed according to the processing parameters (|$Para{m_{\rm P}}$|), allowing M to generate |$Im{g_{\rm G}}$| based on both |${\rm{}}Im{g_{\rm S}}$| and |${P_{\rm t}}$|. |${P_{\rm t}}$| is derived from the desired target image (|$Im{g_{\rm t}}$|), through the process “|$getprompts()$|”, serving as a textual representation of |$Im{g_{\rm t}}$|. As a result, |$Im{g_{\rm G}}$| demonstrates a resemblance to |$Im{g_{\rm t}}$| and predominantly belongs to a group sharing similarities with the target:
|$Para{m_{\rm G}}$| comprises four components necessary for defining image generation. These essential elements include |$resolution$|, which determines the image dimensions in pixels; |$sampling{\rm{\,\,}}\textit{method}$|, which refers to the type of technique used to extract samples from the latent space; |$sampling{\rm{\,\,}}\textit{steps}$|, which determine the number of intermediate stages between the initial and final states during the diffusion process, substantially impacting the level of detail in results; and |$CFG{\rm{\,\,}}\textit{scale}$| (classifier-free guidance scale), which indicates the level of autonomy or reliance on predefined classifiers in the AI model. |$Para{m_{\rm P}}$| comprises three components, with the |$processor$| referring to the type of methods used to recognize |${\rm{}}Im{g_{\rm S}}$|. Depending on the |$processor$|, the method of detecting |${\rm{}}Im{g_{\rm S}}$| varies, such as detecting boundaries based on image contrast or detecting shapes based on image depth or distance. |$control{\rm{\,\,}}\textit{weight}$| indicates how closely the detected shape of the seed image will be adhered to, representing the degree of allowance for change, while |$control{\rm{\,\,}}\textit{mode}$| indicates whether the prompt or the seed image is given more priority.
|${P_{\rm t}}$| is composed of two types of prompts: scene description prompt (|$\mathrm{ SDP}$|) and resolution quality prompt (|$\mathrm{ RQP}$|). While it is possible to generate images by providing only scene and context descriptions, the probability of obtaining the desired image and quality may be low. Therefore, it is necessary to employ prompt engineering to describe the |$Im{g_{\rm t}}$| systematically and precisely, as illustrated in Table 1. The |$\mathrm{ SDP}$| encompasses not only main description but also the graphic style and composition of resultant images. The |$\mathrm{ RQP}$| pertains to prompts related to the image’s resolution quality, allowing users to achieve the desired image quality. Lastly, to prevent errors and dissimilar image results, it is crucial to utilize negative prompts to exclude keywords that should be avoided or do not align with the |$Im{g_{\rm t}}$|.
Type . | Content . | Positive prompt example . | Negative prompt example . |
---|---|---|---|
SDP | Main description (about scene and context) | A house with Mondrian’s color palette, located in a forest, a cat sitting in a chair, kids running around the house, etc. | Dogs, department, tower, cars, located in a city, at night, etc. |
Graphic style | Professional photograph, photorealistic rendering, etc. | Watercolor painting, oil painting, drawing, sketch, cartoonish, etc. | |
Composition (angle, lighting, etc.) | Full shot, deep depth of field, high-key lighting, natural lighting, two-point perspective, etc. | Bird’s-eye view, isometric, portrait, cropped view, etc. | |
RQP | Resolution | Realistic shadows, enhanced-detail, v-ray rendering, full HD, masterpiece, highly detailed, high quality, 8k, etc. | Low quality, too much noise, normal quality, watermark, blurry textured, blurry, noise, faint, text, etc. |
Type . | Content . | Positive prompt example . | Negative prompt example . |
---|---|---|---|
SDP | Main description (about scene and context) | A house with Mondrian’s color palette, located in a forest, a cat sitting in a chair, kids running around the house, etc. | Dogs, department, tower, cars, located in a city, at night, etc. |
Graphic style | Professional photograph, photorealistic rendering, etc. | Watercolor painting, oil painting, drawing, sketch, cartoonish, etc. | |
Composition (angle, lighting, etc.) | Full shot, deep depth of field, high-key lighting, natural lighting, two-point perspective, etc. | Bird’s-eye view, isometric, portrait, cropped view, etc. | |
RQP | Resolution | Realistic shadows, enhanced-detail, v-ray rendering, full HD, masterpiece, highly detailed, high quality, 8k, etc. | Low quality, too much noise, normal quality, watermark, blurry textured, blurry, noise, faint, text, etc. |
Type . | Content . | Positive prompt example . | Negative prompt example . |
---|---|---|---|
SDP | Main description (about scene and context) | A house with Mondrian’s color palette, located in a forest, a cat sitting in a chair, kids running around the house, etc. | Dogs, department, tower, cars, located in a city, at night, etc. |
Graphic style | Professional photograph, photorealistic rendering, etc. | Watercolor painting, oil painting, drawing, sketch, cartoonish, etc. | |
Composition (angle, lighting, etc.) | Full shot, deep depth of field, high-key lighting, natural lighting, two-point perspective, etc. | Bird’s-eye view, isometric, portrait, cropped view, etc. | |
RQP | Resolution | Realistic shadows, enhanced-detail, v-ray rendering, full HD, masterpiece, highly detailed, high quality, 8k, etc. | Low quality, too much noise, normal quality, watermark, blurry textured, blurry, noise, faint, text, etc. |
Type . | Content . | Positive prompt example . | Negative prompt example . |
---|---|---|---|
SDP | Main description (about scene and context) | A house with Mondrian’s color palette, located in a forest, a cat sitting in a chair, kids running around the house, etc. | Dogs, department, tower, cars, located in a city, at night, etc. |
Graphic style | Professional photograph, photorealistic rendering, etc. | Watercolor painting, oil painting, drawing, sketch, cartoonish, etc. | |
Composition (angle, lighting, etc.) | Full shot, deep depth of field, high-key lighting, natural lighting, two-point perspective, etc. | Bird’s-eye view, isometric, portrait, cropped view, etc. | |
RQP | Resolution | Realistic shadows, enhanced-detail, v-ray rendering, full HD, masterpiece, highly detailed, high quality, 8k, etc. | Low quality, too much noise, normal quality, watermark, blurry textured, blurry, noise, faint, text, etc. |
3.2. Image generation test for architects’ styles
An intensive image generation test to evaluate the performance of the SD model, specifically for architectural visualization with architects’ design styles and features was conducted. The primary objective of the test was to assess the extent to which the pretrained model recognizes architects’ styles. The test primarily focused on txt2img due to its relatively unconstrained nature for generating images based on the same model as img2img. This approach allowed us to assess the optimum performance of the pretrained model. Therefore, for the test, all the images were generated based on Equation (1), with the target scenes set as residential houses reflecting various architects’ design styles. Each architect’s style was treated as an independent variable for the test. We randomly selected 20 architects and applied their styles.
By providing detailed prompts, efforts can be made to generate images that closely resemble the target scenes. However, to accurately discern if the SD default model recognizes specific design styles and features used by real-world architects, certain style-related words were intentionally omitted. Consequently, for |${P_{\rm t}}$|, the main description prompt was given as “architect name-inspired residential house” to ensure more precise comparison between the independent variables. To facilitate a more accurate comparison, the “photorealistic rendering prompt set” [a collection of positive and negative prompts specifically designed to generate high-resolution images with a photorealistic rendering style: positive prompt (professional photograph, photorealistic rendering, realistic, enhance-detail, v-ray rendering, full HD, masterpiece, highly detailed, high quality, 8k, two-point perspective, exterior view, full shot, deep depth of field, f/22, high-key lighting, natural lighting, and realistic shadows) and negative prompt (low quality, bad proportion, awkward shadows, unrealistic lighting, pixelated textures, too much noise, unrealistic reflections, normal quality, watermark, bad perspective, confusing details, blurry textured, blurry, noise, cloudy, faint, and text)] was used, containing style-related keywords commonly employed in architectural visualization’s graphic style, such as photorealistic rendering, and in composition, the two-point perspective view. All the images were generated using the SD default model using a local PC (the local PC used for this study was equipped with an RTX series GPU from NVIDIA, and 16GB of RAM) with a resolution of 1024 × 512 pixels. More than 10 000 images were generated in this test (Table 2).


3.3. Demand of additional training
There are various methods and tools available to evaluate the fidelity of the generated images (|$Im{g_{\rm G}}$|) to their input text (|${P_{\rm t}}$|) in order to assess the performance of a default model (M). The human preference classifier (Wu et al., 2023a, b) and CLIP score (Hessel et al., 2022) are representative evaluation metrics for assessing the human preference score in txt2img synthesis. The first approach can measure the extent of misalignment with human preferences by identifying instances such as floating pillars, awkwardly positioned furniture, or discrepancies in appearance. The second approach measures the similarity between text prompts and images by assessing whether the images contain characteristic elements of an architect’s style. Additionally, qualitative methods such as surveys and observations can be used for assessment. This study is focused on introducing a novel visualization method while recognizing its subjective nature. Therefore, the evaluation was conducted both quantitatively and qualitatively based on prior research on each architects’ styles and features.
The quantitative evaluation of performance of M was conducted to assess the |$Similarity{\rm{\,\,}}$| between target images (|$Im{g_{\rm t}}$|) and those generated via AI (|$Im{g_{\rm G}}$|) based on CLIP score. The |$Similarity$| is calculated dividing CLIP score (|$Score$|) of each |$Im{g_{\rm G}}$| by the average CLIP score (|$Average{\rm{\,\,}}\textit{Score}$|) of the actual project images of each architect, designated as the targets in this study (Equation 8). If the |$Similarity( {Im{g_{\rm G}}} )$| reaches the target similarity (|$Tsm$|), the |$Im{g_{\rm G}}$| is classified as the target group (Equation 9). This evaluation was performed on a randomly selected sample of 100 |$Im{g_{\rm G}}{\rm{}}$| for each architect, with the |$Tsm$| set at 90%:
The generated results were also qualitatively evaluated based on three criteria regarding how well they reflected the |${P_{\rm t}}$|. With respect to the main description prompt, the evaluation focused on (i) style fidelity, determining how accurately the design characteristics of specific architects were represented, and (ii) domain fidelity, assessing whether the distinctive features of a particular building type, in this case, a residential house, were accurately reflected. Additionally, the photorealistic rendering prompt set, used across all the tests, was examined for (iii) image quality, assessing how closely the graphic style, composition, and resolution matched the desired output.
The results of these tests showed that the current SD model mostly achieved high domain fidelity and image quality. However, variations in style fidelity were observed between different architects, regardless of their prominence in the field. Figure 3 illustrates the proportion of images, among 100 sample images per architect, that showcased a |$Similarity$| of ≥90%. According to Fig. 3, for the eight architects, a majority of the |$Im{g_{\rm G}}$| showed a |$Similarity{\rm{\,\,}}$| of less than 90% compared with the |$Average{\rm{\,\,}}\textit{Score}{\rm{\,\,}}$| of the actual project images of each architect. For these architects, the generated images exhibit generic Western-style residential houses with relatively lower image quality and details (Table 2). To address the limited recognition of certain architects’ styles, additional training of the existing image generation model is required. Hence, we conducted some additional training by defining these architects’ design styles and features.

Pretrained model’s performance for each architect’s style (percentage of |$Im{g_{\rm G}}$| belonging to |$Im{g_{\rm t}}$| within each sample).
4. Definition of Architects’ Styles
4.1. Operational definition of architects’ styles
According to Schapiro (1961), style comprises constant forms, elements, qualities, and expressions. These characteristics are used to distinguish differences between periods, groups, or individual designers (Ackerman, 1963; Chan, 1992; Crook, 1987; Smithies, 1981). Chan (1994) defined style as the set of common features present in artifacts, introducing a taxonomic approach to defining architectural styles in his study. Various qualitative and quantitative methods, including Chan’s, have been employed to define design styles in diverse fields (Huang et al., 2016). In industrial design, Hyun et al. (2015) quantified car styles, adapting Chan’s methodology.
Building on these studies, this research aims to define an architect’s style based on established concepts and use it to train and generate images. According to Chan (1994), style is composed of physical forms, patterns, or distinct characteristics. A style can be quantified by measuring the similarity between projects based on the repetition of common features across projects. A higher frequency of features contributes to a more coherent and strongly recognizable style, though certain features are more effective than others.
Drawing from the aforementioned concepts, in this section, an architect’s style can be defined as follows:
Within the architect’s style (|${S_{\rm A}}$|), various visual features exist (Moussavi, 2015). However, this study places emphasis on form (|$form$|), materiality (|$materiality$|), and structure (|$structure$|) features. The |$form$| feature pertains to the formal characteristics of whether it is predominantly curved or straight (Ching, 2023); the |$materiality$| feature denotes a visually prominent aspect, encompassing the primary materials employed (Hartoonian, 2016); and the |$structure$| feature encompasses the connectivity of interior and exterior spaces, based on systems such as framing and load-bearing systems (Sandaker et al., 2022). Each style exhibits distinct degrees and measurements of features. When applying weight (W) to styles, W influences each feature.
Similar to Chan’s (1994) research, we focused on interpreting each architect’s style through features rather than the substance of styles. While the degree and the measurement of style are not extensively covered in the study, we can still control the intensity of the style, which is proportional to each defined elements, as shown in Equation (11). This phenomenon is visually illustrated in Fig. 4, specifically with its form.

The txt2img generation applying Zaha Hadid style with different weight (W).
4.2. Fusion of architects’ styles
In this section, we employed the SD model to implement design fusions, integrating various architectural styles. The aim was to observe how each feature of a style, as defined in the previous section, influences other styles. To conduct style fusion, involving merging, extracting, and adjusting weight (W), we used the |$\mathrm{ SDP}$| established in Table 3. The supplementary |${P_{\rm t}}$|, photorealistic rendering prompt set, and |$Para{m_{\rm G}}$| used for image generation were the same as used in previous tests. Based on the fusion results, we observed how each feature of a style and its associated weight influence and interact with other styles.
|$\mathrm{ SDP}$| . | Positive prompt . | Negative prompt . |
---|---|---|
Merge (A + B) | Architect A and Architect B-inspired residential house | None |
Extract (A − B) | Architect A-inspired residential house | Architect B’s design features |
Weight (W) | Utilize (parentheses) for the words and place a colon and a number between 0 and 2 next to them, with 1 representing 100% effectiveness. |
|$\mathrm{ SDP}$| . | Positive prompt . | Negative prompt . |
---|---|---|
Merge (A + B) | Architect A and Architect B-inspired residential house | None |
Extract (A − B) | Architect A-inspired residential house | Architect B’s design features |
Weight (W) | Utilize (parentheses) for the words and place a colon and a number between 0 and 2 next to them, with 1 representing 100% effectiveness. |
|$\mathrm{ SDP}$| . | Positive prompt . | Negative prompt . |
---|---|---|
Merge (A + B) | Architect A and Architect B-inspired residential house | None |
Extract (A − B) | Architect A-inspired residential house | Architect B’s design features |
Weight (W) | Utilize (parentheses) for the words and place a colon and a number between 0 and 2 next to them, with 1 representing 100% effectiveness. |
|$\mathrm{ SDP}$| . | Positive prompt . | Negative prompt . |
---|---|---|
Merge (A + B) | Architect A and Architect B-inspired residential house | None |
Extract (A − B) | Architect A-inspired residential house | Architect B’s design features |
Weight (W) | Utilize (parentheses) for the words and place a colon and a number between 0 and 2 next to them, with 1 representing 100% effectiveness. |
Through style fusion using image generation AI, we were able to identify the rough rules that govern how each feature interacts with others. When merging two or more styles, the features are combined in a visually harmonious way. Adjusting W to prioritize one style over the other resulted in the distinct traits of that style being prominently reflected. However, the feature extraction between styles worked only when there were similarities or overlapping features between them; otherwise, there was no visual impact.
As depicted in Fig. 5, the architectural styles of Louis Kahn and Antoni Gaudi are contrasting: Gaudi’s style showcases curvilinear shapes, employs a variety of colors and mosaics, and features a more closed structure; whereas, Kahn’s design style emphasizes on rectilinear shapes and predominantly employs concrete, resulting in an overall monochromatic appearance with more open elements like cloisters.

In the fusion of these contrasting styles, as shown in the lower section of Fig. 5, characteristics of both styles blend together depending on the value of W. While most results reflected Kahn’s monochromatic materiality and structure system, a greater application of Gaudi’s style highlighted one of his key features: organic and curvilinear forms. Conversely, when Kahn’s style was more pronounced, the curvature was restrained, leading to a more subdued expression. Through design fusions, it was observed that architects’ styles can be proportionally applied and can be visually distinguished. This process can also help achieve new design styles where the overall design features of both styles are harmoniously combined, by adjusting weights associated with either.
5. Additional Training of the Model for Architectural Visualization
5.1. Additional training of existing model
This study focuses on conducting additional training, particularly using the low rank-adaptation (LoRA) method (Hu et al., 2021), to generate images that belong to |$Im{g_{\rm t}}$|. LoRA method involves reparameterizing the weight matrix used for updates by focusing on specific targets rather than updating the entire model’s weights. This approach is advantageous as it reduces computational costs and memory usage, while also remaining effective with smaller datasets. The LoRA model, is compact and offers the advantage of being efficiently swapped and utilized across multiple models.
If the majority of generated images (|$Im{g_{\rm G}}$|) do not belong to the target image (|$Im{g_{\rm t}}$|) group, the existing model (M) needs to be replaced with an alternative model (|$M{\rm{^{\prime}}}$|). In this study, the performance of the M is assessed by calculating the |$Similarity$| of |$Im{g_{\rm G}}$| to the |$Im{g_{\rm t}}$| based on CLIP scores. Accordingly, if the number (n) of randomly selected |$Im{g_{\rm G}}$| has a |$Similarity$| higher than the target similarity (|$Tsm$|) surpasses the threshold for the majority criterion (|$\mu $|), there is a need to replace M with |$M{\rm{^{\prime}}}$| as described in Equation (12). For all architects, |$Tsm$| is uniformly set at 90%, and |$\mu $| is set at 70% to evaluate the accuracy of the model’s output along with the consistency and stability of the results. To improve accuracy and stability, |$M{\rm{^{\prime}}}$| can be either substituted or upgraded through additional training.
The trained model for target (|${M_{\rm t}}$|) can be generated using “|$train( \,\, )$|” function, using the base model (M), hyperparameters (|$Hyperparam$|) to control the training process, and a training dataset specific to the target (|${D_{\rm t}}$|):
Among these, |$Hyperparam$||$Hyperparam$| significantly influence the model’s learning process and the subsequent performance of |${M_{\rm t}}$|. These |$Hyperparam$| involve diverse and extensive settings, with many detailed parameters. However, in this study, we focused on three key hyperparameters: train batch size (|$B{S_{\rm t}}$|), epochs (|$epoch$|), and learning rate (|$\alpha $|). |$B{S_{\rm t}}$| refers to the number of datasets processed together in each training iteration. |$epoch$| represents the number of iterations for one complete training of all datasets. |$\alpha $| determines the learning step size between iterations, controlling the speed and rate of errors and loss. These |$Hyperparam$| play a crucial role in shaping the training process and ultimately impact the effectiveness of |${M_{\rm t}}$|:
To accurately generate the target images, a systematic additional training method was proposed, as illustrated in Fig. 6 based on the previous definitions. The additional training process consists of two steps: (i) dataset preparation, which involved data collection, preprocessing, and keyword extraction, and (ii) model training. In this process, the prepared dataset was added to the base model using predefined hyperparameters. Through this process, we obtained the trained LoRA model, which has learned the target characteristics.

By applying this model (|${M_{\rm t}}$|) to the existing image generation function, images that closely resemble the |$Im{g_{\rm t}}$| with a higher similarity than before were obtained. When using the |${M_{\rm t}}$|, it is necessary to input the application weight (WW), which should be a value between 0 and 1, representing 0 as 0% and 1 as 100%.
5.2. Data preparation for additional training
Few-shot learning requires high-quality training data with consistent content. For additional training using the LoRA method (Hu et al., 2021), a dataset (|${D_{\rm t}}$|) containing image data (|$Im{g_{\rm D}}$|) and corresponding annotation text data (|$Tx{t_{\rm D}}$|) is essential. |${D_{\rm t}}$| for additional training is defined as follows:
To ensure high-quality image data and content consistency between them, careful selection of images representing the target is crucial. The |$Im{g_{\rm D}}$| should align with the main description prompt, the desired composition, and the desired image quality. It is also important to avoid images that include excessive information, as it might interfere with the training process. Preprocessing steps such as image resizing and cropping help eliminate unnecessary content beforehand.
The text data, denoted as |$Tx{t_{\rm D}}$|, is always trained in conjunction with the corresponding |$Im{g_{\rm D}}$|. |$Tx{t_{\rm D}}$| is extracted from |$Im{g_{\rm D}}$| using the “getannotation()” operator, describing the target content and characteristics present within the |$Im{g_{\rm D}}$|. To ensure a successful and efficient learning process, it is crucial that the |$Tx{t_{\rm D}}$| accurately and clearly describes the |$Im{g_{\rm D}}$|, based on three components, as specified in Table 4. They include representation name (N), annotation of specific features (|$SF$|), along with the three features of style defined in Section 4.1, and general features (|$GF$|). Including abstract content in the |$Tx{t_{\rm D}}$| is beneficial; however, it is essential to include objective information that visually distinguishes and supports intangible aspects.
Component . | Description . | Example for architect’s design style . |
---|---|---|
Representative N | A pronoun or a word that activates the trained model. This component is essential. | Architect’s name, artist’s name, interior design style name, etc. |
Annotation of |$SF$| | Specific tangible and abstract features that distinguish the target from others. These annotations may repeat throughout the training dataset. | Form, materiality, structure, architectural components, idea, theory, movement (e.g., modernism), emotion, etc. |
Annotation of |$GF$| | Description of visual features about both the target and its context that do not belong to |$SF$|. These annotations are general, can vary, and may not repeat. | Secondary materiality, a place where the project seems to be located, hour, presence of vegetation, etc. |
Component . | Description . | Example for architect’s design style . |
---|---|---|
Representative N | A pronoun or a word that activates the trained model. This component is essential. | Architect’s name, artist’s name, interior design style name, etc. |
Annotation of |$SF$| | Specific tangible and abstract features that distinguish the target from others. These annotations may repeat throughout the training dataset. | Form, materiality, structure, architectural components, idea, theory, movement (e.g., modernism), emotion, etc. |
Annotation of |$GF$| | Description of visual features about both the target and its context that do not belong to |$SF$|. These annotations are general, can vary, and may not repeat. | Secondary materiality, a place where the project seems to be located, hour, presence of vegetation, etc. |
Component . | Description . | Example for architect’s design style . |
---|---|---|
Representative N | A pronoun or a word that activates the trained model. This component is essential. | Architect’s name, artist’s name, interior design style name, etc. |
Annotation of |$SF$| | Specific tangible and abstract features that distinguish the target from others. These annotations may repeat throughout the training dataset. | Form, materiality, structure, architectural components, idea, theory, movement (e.g., modernism), emotion, etc. |
Annotation of |$GF$| | Description of visual features about both the target and its context that do not belong to |$SF$|. These annotations are general, can vary, and may not repeat. | Secondary materiality, a place where the project seems to be located, hour, presence of vegetation, etc. |
Component . | Description . | Example for architect’s design style . |
---|---|---|
Representative N | A pronoun or a word that activates the trained model. This component is essential. | Architect’s name, artist’s name, interior design style name, etc. |
Annotation of |$SF$| | Specific tangible and abstract features that distinguish the target from others. These annotations may repeat throughout the training dataset. | Form, materiality, structure, architectural components, idea, theory, movement (e.g., modernism), emotion, etc. |
Annotation of |$GF$| | Description of visual features about both the target and its context that do not belong to |$SF$|. These annotations are general, can vary, and may not repeat. | Secondary materiality, a place where the project seems to be located, hour, presence of vegetation, etc. |
5.3. Additional training for architects’ styles and features
Additional training focusing on architects who experienced low or no similarity in the image generation test described in Section 3.2 was conducted. Implementation of few-shot learning using the previously defined additional training method and compared the performance of the default model (M) with the trained model (|${M_{\rm t}}$|) by generating images with each model were carried out. The images were generated using parameters (|$Para{m_{\rm G}}$|) and the prompts (|${P_{\rm t}}$|) as described in Section 3.2 and Equation (16).
The M generated images (|$Im{g_{\rm G}}$|) with an average |$Similarity$| of less than 90% compared with the target images (|$Im{g_{\rm t}}$|) for certain architects. So, when the |${M_{\rm t}}$| is not applied, the specific features of those styles were not represented. However, when the trained model was used, these features are correctly displayed proportional to the weights (W) assigned. As shown in Fig. 7, the average |$Similarity$| between |$Im{g_{\rm G}}$| and |$Im{g_{\rm t}}$|, as well as the proportion of images with a |$Similarity$| of ≥90%, increased significantly after using |${M_{\rm t}}$|. In the case of Louis Kahn, the |$Similarity$| level with the target improved by approximately 16% when the |${M_{\rm t}}$| was applied at 100%. The proportion of images belonging to the target group increased by approximately 3.4 times. After the additional training, the model was able to generate images applying architects’ styles and features along with combining them for style fusion.

The txt2img generation test result with different weight of Louis Kahn style-trained model.
6. Demonstration
6.1. Overview of demonstration
Throughout this research, we observed that image generation AI can rapidly produce high-quality architectural visualization based solely on textual prompts. When applied in architecture, this technology allows architects to effortlessly generate design reference images and visualizations from the very initial stages of the design process. This section demonstrates the practical application of image generation AI, particularly SD, with various architects’ styles, focusing on different types of residential building visualization.
First, the image generation model improves its capabilities and spectral quality through additional training, extending to architects’ styles that the existing model may not recognize. Using txt2img generation, users can generate exterior visualization images of buildings with architects’ styles and features from text, building a reference database. This approach can be used to obtain architectural visualization images with a single architect’s style applied to building exteriors and it enables the combination or extraction of different styles to create new alternatives.
Using img2img generation to massing models produced during the initial design phases, we could instantly generate rendering images from various viewpoints. Thus, a user-friendly interface, where users can use this img2img technology more conveniently beyond text-based outputs, was demonstrated.
6.2. Additional training and architect’s style and feature model files
The implementation of design styles and features of various architects in image generation AI is described in this section. The image generation AI with additional training allows users to easily obtain desired images according to their needs even with a small dataset. This additional training was demonstrated based on the Equation (14), targeting architects with low similarity rates in the default model (M). Figure 8 presents the steps of additional training for the selected architects: (i) data preparation, including preprocessing and keyword extraction and (2) additional training of the dataset (|${D_{\rm t}}$|). By following this procedure, the additional training was aimed to enhance the model’s ability to generate users’ desired images that accurately reflect each architect’s distinctive features and characteristics.

Additional training process (example of SANAA style). Developed from Fig. 6.
As shown in Fig. 8, the image data (|$Im{g_{\rm D}}$|) includes photographs of the projects from reputable sources such as the architects’ official websites and globally recognized architecture broadcasting platforms [e.g., Archdaily (2008), DIVISARE (1998), and Dezeen (2006)]. We aimed to include every project undertaken by the architects. To ensure high-quality training images, we selected representative photographs for each project based on two criteria: (i) the entire facades of the architectural structures, and (ii) a two- or one-point perspective. Additionally, we preprocessed the collected images to optimize the training process. This involved resizing the images and cropping out unnecessary elements in the surroundings that could potentially interfere in learning the target architect’s style and feature.
Existing interviews with architects and experts as well as precedent research on the architect’s style or their projects were employed to construct the text data (|$Tx{t_{\rm D}}$|) for each image. Each |$Tx{t_{\rm D}}$| for training the architects’ styles consisted of three categories of annotation (Table 5) based on Table 4 from Section 5.2. First, we appended “style” to the target architect’s name (e.g., SANAA style, Louis Kahn style, etc.) as the representative label for this additional training. Based on prior research and interviews, frequently used keywords related to the architect’s style, including its form, materiality, and structure, were selected. Additionally, visual features such as secondary materials, weather, and the surrounding environment were included, which are objectively observed. The generated |$Tx{t_{\rm D}}$| files were saved with the same names as the corresponding |$Im{g_{\rm D}}$| files and trained together as one |${D_{\rm t}}$|.
Component . | Reference source . | Used image annotations . |
---|---|---|
Representative N | SANAA (1995) | SANAA style |
Annotation of |$SF$| | The extensive use of uniform skins is evident. White, homogeneous surfaces are often used. The use of poured concrete and other uniform materialities can be assimilated to white. Repeating densely small steel column or other structural elements, they transform the void into a porous solid (Vandenbulcke, 2012). | Minimalist, simplicity, elegance, sensitivity, transparency, translucency, openness, homogeneous, monolith, horizons, glass walls, curved shape, white color, fine steel columns (pilotis), thin ceilings, repetition, etc. |
Annotation of |$GF$| | Based on observation | Bush, trees in the background, grass in the ground, in the park, sunny days, etc. |
Component . | Reference source . | Used image annotations . |
---|---|---|
Representative N | SANAA (1995) | SANAA style |
Annotation of |$SF$| | The extensive use of uniform skins is evident. White, homogeneous surfaces are often used. The use of poured concrete and other uniform materialities can be assimilated to white. Repeating densely small steel column or other structural elements, they transform the void into a porous solid (Vandenbulcke, 2012). | Minimalist, simplicity, elegance, sensitivity, transparency, translucency, openness, homogeneous, monolith, horizons, glass walls, curved shape, white color, fine steel columns (pilotis), thin ceilings, repetition, etc. |
Annotation of |$GF$| | Based on observation | Bush, trees in the background, grass in the ground, in the park, sunny days, etc. |
Component . | Reference source . | Used image annotations . |
---|---|---|
Representative N | SANAA (1995) | SANAA style |
Annotation of |$SF$| | The extensive use of uniform skins is evident. White, homogeneous surfaces are often used. The use of poured concrete and other uniform materialities can be assimilated to white. Repeating densely small steel column or other structural elements, they transform the void into a porous solid (Vandenbulcke, 2012). | Minimalist, simplicity, elegance, sensitivity, transparency, translucency, openness, homogeneous, monolith, horizons, glass walls, curved shape, white color, fine steel columns (pilotis), thin ceilings, repetition, etc. |
Annotation of |$GF$| | Based on observation | Bush, trees in the background, grass in the ground, in the park, sunny days, etc. |
Component . | Reference source . | Used image annotations . |
---|---|---|
Representative N | SANAA (1995) | SANAA style |
Annotation of |$SF$| | The extensive use of uniform skins is evident. White, homogeneous surfaces are often used. The use of poured concrete and other uniform materialities can be assimilated to white. Repeating densely small steel column or other structural elements, they transform the void into a porous solid (Vandenbulcke, 2012). | Minimalist, simplicity, elegance, sensitivity, transparency, translucency, openness, homogeneous, monolith, horizons, glass walls, curved shape, white color, fine steel columns (pilotis), thin ceilings, repetition, etc. |
Annotation of |$GF$| | Based on observation | Bush, trees in the background, grass in the ground, in the park, sunny days, etc. |
The additional training was conducted with eight architects who exhibited considerably low similarity rates in the preliminary image generation test: SANAA, Renzo Piano, I.M. Pei, Le Corbusier, Shigeru Ban, Tadao Ando, Luis Barragan, and Louis Kahn. Relatively small |${D_{\rm t}}$| were created for each architect based on the aforementioned process, depending on the number of their real projects. The prepared |${D_{\rm t}}$| were added to the M, named pruned v1.5, using DreamBooth LoRA approach. The |${\rm{\,\,}}\textit{Hyperparam}$| used for training include a batch size of 1100 epochs, and a learning rate of 0.0001. The training duration ranged from 25 to 40 min on a local PC, proportional to the size of |${D_{\rm t}}$|. Consequently, a single trained model (|${M_{\rm t}}$|) file with the safetensor extension was generated for each architect.
When the |${M_{\rm t}}$| file was applied to the M (Equation 16), it generated architectural exterior images that closely resembled the design styles of the architects, unlike when using the M. As demonstrated by the training example for the SANAA style in Table 6, it was possible to accurately depict the design features of architects and implement their specific styles by training on around 165 datasets within a short period. A total of eight |${M_{\rm t}}$| files were constructed, each implementing the styles of eight architects. All |${M_{\rm t}}$| files, added to the M, could generate high-quality images comparable with the output images shown in Section 6.3.


The performance of the |${M_{\rm t}}$| was evaluated by calculating the |$Similarity$| of generated images (|$Im{g_{\rm G}}$|) to their target. Figure 9 presents the |$Similarity$|-based performance evaluation results of |${M_{\rm t}}$| and M for the architects with low performance as described in Section 3.3. According to Fig. 9, the proportion of images with a similarity of ≥90% increased by ∼5 times on average after using |${M_{\rm t}}$|, despite being generated with the same prompts and parameters. These results suggest that, under the same conditions, |${M_{\rm t}}$| can generate images that more accurately and effectively reflect the prompts, thereby enhancing the visualization process and quality.

Additionally, a survey was conducted to qualitatively measure and validate the performance of |${M_{\rm t}}$| using the human evaluation score (HES). Each question in the survey presented one image generated by |${M_{\rm t}}$| and one by M, using the same prompts and parameters. Participants were asked to choose the one of the two images, the |$Im{g_{\rm G}}$| that better matched the description of the architect’s style provided (Fig. 10). The HES for each M was calculated as the percentage of the times the image generated by that model (|${{\mathit{ Img}}}_{{{\rm{G}}_{{i}}}}^{{M}}$|) was chosen divided by the total number of responses [number of participants (N) multiplied by the number of questions (Q)], as shown in Equation (22):

A survey consisting 80 multiple-choice questions (10 questions per architect) was conducted with 21 professionals in architectural design, focusing on the eight architects’ styles for which the models were additionally trained. The results of the survey, including individual HES for each architect and the total HES, are presented in Table 7. All eight |${M_{\rm t}}$| demonstrated higher HES compared with the M. On average, ∼94.88% of the images generated based on |${M_{\rm t}}$| (|${{Img}}_{\rm{G}}^{{{{M}}_{\rm t}}}$|) were selected because they better reflected the styles of each architect. This indicates that |${M_{\rm t}}$| can capture the nuances and subtleties of the architectural styles from the perspective of human perception and judgment.
Category . | I.M. Pei . | Luis Barragan . | Le Corbusier . | Louis Kahn . | Renzo Piano . | Shigeru Ban . | SANAA . | Tadao Ando . | Total . |
---|---|---|---|---|---|---|---|---|---|
|$HES( M )$| | 3.81 | 1.43 | 3.33 | 7.62 | 8.57 | 5.71 | 3.33 | 7.14 | 5.12 |
|$HES( {{M_{\rm t}}} )$| | 96.19 | 98.57 | 96.67 | 92.38 | 91.43 | 94.29 | 96.67 | 92.86 | 94.88 |
Category . | I.M. Pei . | Luis Barragan . | Le Corbusier . | Louis Kahn . | Renzo Piano . | Shigeru Ban . | SANAA . | Tadao Ando . | Total . |
---|---|---|---|---|---|---|---|---|---|
|$HES( M )$| | 3.81 | 1.43 | 3.33 | 7.62 | 8.57 | 5.71 | 3.33 | 7.14 | 5.12 |
|$HES( {{M_{\rm t}}} )$| | 96.19 | 98.57 | 96.67 | 92.38 | 91.43 | 94.29 | 96.67 | 92.86 | 94.88 |
Category . | I.M. Pei . | Luis Barragan . | Le Corbusier . | Louis Kahn . | Renzo Piano . | Shigeru Ban . | SANAA . | Tadao Ando . | Total . |
---|---|---|---|---|---|---|---|---|---|
|$HES( M )$| | 3.81 | 1.43 | 3.33 | 7.62 | 8.57 | 5.71 | 3.33 | 7.14 | 5.12 |
|$HES( {{M_{\rm t}}} )$| | 96.19 | 98.57 | 96.67 | 92.38 | 91.43 | 94.29 | 96.67 | 92.86 | 94.88 |
Category . | I.M. Pei . | Luis Barragan . | Le Corbusier . | Louis Kahn . | Renzo Piano . | Shigeru Ban . | SANAA . | Tadao Ando . | Total . |
---|---|---|---|---|---|---|---|---|---|
|$HES( M )$| | 3.81 | 1.43 | 3.33 | 7.62 | 8.57 | 5.71 | 3.33 | 7.14 | 5.12 |
|$HES( {{M_{\rm t}}} )$| | 96.19 | 98.57 | 96.67 | 92.38 | 91.43 | 94.29 | 96.67 | 92.86 | 94.88 |
6.3. Architects’ design styled image generation
This section outlines the acquisition of creative reference images which reflects architects’ design styles. Users can generate desired exterior images of buildings reflecting specific architects’ styles within a short timeframe using the image generation approach proposed in this research. Furthermore, image generation AI allows users to create and obtain a wider range of architectural image references by merging or extracting two or more architects’ styles.
Twenty internationally known architects, including recipients of the Pritzker Prize, often referred to as the Nobel Prize of Architecture, or those who have had a significant design influence were selected for this study. We generated architectural visualizations applying their styles based on the Equations (1) and (16). The pruned v1.5 checkpoint was used as the base M, and for architects with lower similarity, a trained model (|${M_{\rm t}}$|) from the previous section was additionally employed. Detailed text prompt (|${P_{\rm t}}$|) and generation parameter (|$Para{m_{\rm G}}$|) used for each architect’s style and their fusion are specified in the Tables 8 and 9. For style fusion, |${P_{\rm t}}$| was generated following the rules outlined in Section 4.2, as summarized in Table 3. Except for the content prompts, all other conditions were kept the same to compare the results of applying a single style and multiple styles. Each image took on average 5 second to be generated in the local PC environment.




Tables 8 and 9 provide evidence that the majority of these images accurately capture the architectural characteristics and elements associated with each architect’s style. Even in cases where the architect did not have prior experience with residential projects, the images maintained the scale and programmatic characteristics of residential buildings. When merging distinct styles, various features, such as form, materiality, and structure, are mitigated, diminished or enhanced. When Zaha Hadid’s style merges with Shigeru Ban’s style, Zaha Hadid’s curvilinear form was moderated and Shigeru Ban’s wooden grid shell was added and emphasized. In case of extracting from the two styles, the common characteristic between the two architects, concrete material, was removed, resulting in a residential house image with an entirely different metallic material. Louis Kahn’s distinctive form was retained, as it does not overlap with Tadao Ando’s style.
Visualizations of residential building exteriors were generated reflecting 20 singular styles with high-quality images that effectively captured the characteristics of all architects through |${M_{\rm t}}$|. Furthermore, many style fusions were implemented, ultimately generating ∼11 crucial architectural styles by combining the styles of nine architects. These generated outputs (|$Im{g_{\rm G}}$|) can offer a range of ideas and inspirations even for the initial phases of architectural design, facilitating rapid and effective communication throughout the design process.
6.4. AIBIM-Design: AI-assisted rendering tool
In this section, we introduce AIBIM-Design, a user-friendly interface that effortlessly models building masses and generates images from these masses using the img2img method. With AIBIM-Design’s main interface, as depicted in Fig. 11, users can automatically model building mass alternatives according to their needs, such as floor area ratio, building regulations, number of floors, and total area, extracted from the input site plan. Additionally, the users have the flexibility to manually modify the model later or even draw and model the blueprint themselves from the start. Once the model is complete, the img2img rendering interface within the same platform allows users to create high-quality visualization images.

Main interface of AIBIM-Design. (1) Drawing area with view control bar; (2) design and drawing tool palette; (3) properties pallet; and (4) spatial information browser.
The img2img rendering interface, shown in Fig. 12, enables real-time manipulation and exploration of various perspectives of the 3D model. Once users choose their desired view, they can easily generate visualizations from that perspective in seconds. This is achieved by selecting preferred architects’ styles and providing the details through prompts and parameters in the corresponding section. Users can experiment with alternative images multiple times until they attain the desired outcome. Once they achieve an image that closely aligns with their target, they can save the original image file.

Img2img rendering interface in AIBIM-Design. (1) Trained model (|${M_{\rm t}}$|) options and prompt (|${P_{\rm t}}$|) input box; (2) parameter (|$Para{m_{\rm G}}$| and |$Para{m_{\rm P}}$|) setting bar; (3) 3D model linked-seed image selection area; and (4) output preview area.
The scenarios below describe the visualization of an architectural mass model using img2img method, especially through the AIBIM-Design renderer tool. Users input the image of the mass model as a seed image and the textual prompts (|${P_{\rm t}}$|) about their requirements and preferences and then visualizations are rendered. With this technique, architects can easily generate images, accelerating the process of decision making and communication.
Image generation was conducted based on Equations (2) and (17) using three seed images (|$Im{g_{\rm S}}$|) with different perspective, as shown in Fig. 13, by applying individual style of each architect and combining different architects’ styles. The results (Tables 10 and 11) revealed that, although there were limitations in implementing styles due to predetermined building volume, elements such as materials, structure, openings, and colors were successfully applied and were able to reflect corresponding styles. Using this interface with the img2img approach, architects can access more concrete design alternatives during the initial phase, helping them in making informed decisions and efficiently refine their designs.
![Seed images for img2img generation. (1) Frontal perspective; (2) angular perspective; and (3) isometric view. [Captured from 3D model linked-seed image selection area of Img2img rendering interface in AIBIM-Design (part (3) of Fig. 12)].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jcde/11/5/10.1093_jcde_qwae065/1/m_qwae065fig13.jpeg?Expires=1747853466&Signature=VM6ETAAeJ4DTDX3gdM-TB8nZozDQTCpfC0synAxe3JC5zU9B-NtKe-fly0kmy4MLCdPTtEH37OVxmA6vxPR4kXTZpzPoiqk0N67JdGbrCq0mde5t075VFZpF4Nb82Twl8IDhwF5V-thEQXZNabKs5p2cLQSCnH7JA04C7eLrSTXd2ozqOX1lNJOrwckgVCJs5nfhgi2s4WsCJ0SLkjwNyAYDXwoZrdfZBVpRG27NRb01zgSPSEjB5LGM~~ApWTUQFRgKjHeHYQcTL7YYnW5aixNt2zhC0sBfKvlPS389pItGucgkn5KzskPAvBZ7JCPAmNJ72cRjIpVQgdYFEwuBTw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Seed images for img2img generation. (1) Frontal perspective; (2) angular perspective; and (3) isometric view. [Captured from 3D model linked-seed image selection area of Img2img rendering interface in AIBIM-Design (part (3) of Fig. 12)].




There are several limitations to this demonstration. This demonstration was conducted within the scope of specifying certain types and styles of buildings, showcasing 20 architect styles and residential buildings. The images were generated based on AI exhibit high-quality design alternatives for visualization. Moreover, not all generated images belong to the target category, and in some cases, alternative designs that are practically infeasible for construction and realization may be produced. During the additional training of architects’ styles, training data was extracted using as objective information as possible, and a majority of projects by architects were considered. Despite these efforts, there may be biases in the additional training and the fidelity of generated images in terms of style.
7. Conclusions
Visualization serves as a conduit for effective decision-making and communication. However, the process of visualization is difficult; it involves multiple intricate and sophisticated tasks. Driven by its significance and inherent complexities, this paper introduced a novel approach and a tool that leverages AI to create visual representations based on textual input. This approach involved additional training for styles with initially lower similarity rates, which required intensive data preparation and integration into the AI model. This technique has proven effective across multiple scenarios, significantly enhancing the efficiency and speed of architectural visualization image production. In this study, over 10 000 images were generated incorporating an architect’s personal style and characteristics into residential house models, to assess the base AI model’s effectiveness. The study highlights the vast potential of AI in design visualization, emphasizing a shift towards facilitating more user-centered and personalized design applications.
This research demonstrates how Gen AI can transform the architectural visualization process, making it more efficient and responsive to individual styles. The developed additional training process ensures that the AI model can effectively learn and replicate specific architectural styles, improving the relevance and quality of generated images. This approach allows for a broader range of visual representations, providing architects with powerful tools to explore and communicate their design ideas more effectively.
While our study shows promising results, it has limitations. The generated outputs are raster graphics images and do not include actual materials such as 3D model files. Not all generated images necessarily belong to the target category, and some designs may be impractical for construction. Additionally, biases in the supplementary training data may affect the fidelity of the generated images in terms of style.
Future research should focus on developing specialized training models based on more diverse and detailed variations in the training data for the enhancement of the model’s efficacy. Additionally, exploring other visualization forms by combining different AI models can lead to more systematic and multi-modal alternatives and representations, contributing to a more integrated and efficient design process.
Conflict of interest statement
The authors state that they do not have any known financial interests or personal relationships that could have influenced the findings of the study.
Author Contributions
Jin-Kook Lee (Conceptualization, Methodology, Visualization, Software, Project administration, Writing—original draft, Writing—review & editing, Youngjin Yoo (Investigation, Methodology, Visualization, Data curation, Software, Writing—original draft, Writing—review & editing), and Seung Hyun Cha (Visualization, Software, Investigation, Writing—review & editing)
Acknowledgments
This work is supported in 2024 by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant No. RS-2021-KA163269). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (Grant No. NRF-2022R1A2C1093310).
Data Availability
The first or corresponding author (PI) can provide most of the data for training, and/or models that were used in this study upon a reasonable request, as well as the links in references and the technical resource section. Additionally, the paper includes references to the archives and links provided by the PI. (Contact author: [email protected])