Chapter 1

Chapter 1: The Evolution and Challenges of Video Technology

1.1 In the Beginning, We Used Scan-Converters!

Most of my career has been spent working across Italy, not just in one city or region, but throughout the entire country. If some aspects of what I describe seem unfamiliar, it's because, as I've observed when working with companies abroad, countries like Germany and the United States often have more optimized workflows and working methods. However, when budgets are tight, the challenges and constraints faced in Italy are common across the world.

A video mixer is a fundamental hardware or software device in audiovisual production that allows seamless and imperceptible switching between different video sources. While the user interface of mixers varies in complexity and functionality depending on the model, it shares a common basic structure. This allows professionals to have operational familiarity with its interface without having to learn a new tool from scratch.

In contexts such as conferences and conventions, signal processors or graphic mixers are often used. Although they don't offer all the functionalities of a complete video mixer, these devices allow the combination of video sources from computers (presentations, images, videos) with traditional video signals (cameras, media players), offering a simpler and more immediate solution. However, integrating RGB signals (from computers) with YUV signals (typical of cameras) has historically been a challenge due to differences in color models and the handling of chromatic information.

The YUV format, widely used in video systems, can lead to a loss of color details, especially in the red and blue components of the image, due to the compression of color information. This compression, indicated by the notation 4:4:4, 4:2:2, etc., can lead to discrepancies between the colors displayed on the screen and those of the original scene. The notation 4:4:4, 4:2:2, etc., indicates how much of the original RGB information is captured and transmitted. In a 4:4:4 system, all components (Y, U, V) have the same resolution. The first '4' represents the Y or luminance, essentially a black-and-white image used also to derive green. In 4:2:2 systems, the chrominance components (U and V, related to blue and red) are typically compressed to half the resolution of luminance. Systems with 4:2:0 or 4:1:1 involve even greater compression, resulting in more noticeable quality loss.

This compression can lead to situations where the colors on the screen do not exactly match those of the original scene. Despite white balance adjustments, which involve calibrating the white based on lighting conditions, color rendering can differ significantly between RGB and YUV systems. For instance, achieving a natural skin tone under certain lighting conditions might require setting the white balance in a way that compromises the accurate reproduction of other colors.

1.2 Technical Challenges Related to EDID and HDCP

Another significant challenge in integrating video signals involves EDID (Extended Display Identification Data) when using interfaces such as HDMI or DisplayPort. EDID is a protocol that allows displays to communicate their capabilities to the source device, ensuring video signal compatibility. However, issues with EDID communication can lead to incompatibilities or a loss in image quality. Some of the most common problems associated with EDID include:

Format Incompatibility: If a capture device does not properly support EDID, it may be unable to interpret or utilize video signals from an HDMI or DisplayPort source. This can result in issues such as missing signals, incorrect resolutions, or distorted images.
Resolutions and Refresh Rates: Some capture devices may struggle to handle high resolutions or refresh rates if the monitor or video source sends incorrect or non-standardized information via EDID.
Signal Blocking: In some cases, EDID can block the video signal if it detects a discrepancy between the monitor's declared capabilities and what is actually supported. This can prevent external capture devices from acquiring the signal.

In the early days of "digital," much emphasis was placed on EDID because, unlike analog monitors (TVs and projectors), which allowed free distribution of signals, digital technology changed the rules. It did so with a somewhat unclear marketing strategy that did not clearly distinguish between FullHD and HDReady, labeling everything simply as HD. However, not all problems stem solely from EDID. The advent of digital also brought with it a content protection system, particularly active with HDMI and DisplayPort: HDCP (High-bandwidth Digital Content Protection).

HDCP interacts with EDID in that it includes information about the device's HDCP compatibility. Previously, one could freely take a video signal and distribute it to multiple devices, such as TVs, projectors, and recorders. Now, HDCP analyzes the EDID of each device to decide whether or not to transmit protected content. In cases of incompatibility, this can result in signal blockage. If a source device (like a Blu-ray player or gaming console) detects that the receiving device does not support HDCP or is not authorized, it may block the signal or reduce its quality. This is not a direct action of EDID but rather the HDCP system using some information provided by EDID.

Recorders and Capture Devices: Many professional video recording or capture devices are designed to be compatible with HDCP, allowing the recording of protected content under certain circumstances (e.g., for legitimate broadcast use). However, consumer capture devices may not have this authorization, resulting in a black screen.

The market has responded to these challenges by favoring signal processors or graphic mixers to manage signals from consumer devices using HDMI, DisplayPort, and DVI cables.

1.3 Projecting a Presentation: An Unexpected Challenge

Projecting a presentation, an operation that seems straightforward, has become surprisingly problematic and costly in the modern professional context. Presenters often use their own PCs for presentations, with PowerPoint being the predominant software.

PowerPoint was originally developed by Forethought, Inc. and launched in 1987. Later acquired by Microsoft, it became an integral part of the Microsoft Office suite. In the 2000s, Microsoft collaborated with Apple and Adobe to develop the Office Open XML format, which includes the PPTX format introduced with PowerPoint 2007. This XML-based format was designed to improve compatibility, security, and file management compared to previous binary formats (this vs that).

However, despite PowerPoint's cross-platform compatibility, it brings with it a series of persistent issues:

Incomplete Information Saving: PowerPoint may not always save a file with all the available information. Videos, audio, or images included in the presentation may not be physically available, or they may have incompatible codecs that prevent them from playing. This means that a meticulously prepared presentation on one computer may not function correctly on another device.
Formatting Issues: The absence of the same font used to create the presentation can cause the formatting of the slides to be lost, making the result unviewable. Imagine seeing a presentation with text out of place, cut off, or in completely different fonts—the visual impact and professionalism are severely compromised.
Specific Hardware Problems: Besides software issues, there are also hardware problems related to the presenters' PCs. The graphics card of some laptops may not support simultaneous output at two different resolutions, and the integrated monitor may not be Full HD. This can cause difficulties in mirroring the display or managing resolutions between the laptop and the projector.
System Sounds and Updates: During a presentation, it is common to hear unwanted system sounds or be interrupted by software update notifications. There is nothing worse than seeing a presenter interrupt their speech to close an update window or disable a system alarm.
Sudden Reboots: Another potentially catastrophic problem is the sudden reboot of the computer, often caused by automatic updates or system errors. This can interrupt a presentation at the worst possible moment, causing discomfort for both the presenter and the audience.
Generic Incompatibilities: Finally, there are generic incompatibilities between different hardware and software. Not all laptops are built the same, and differences in drivers, operating systems, and configurations can cause unpredictable problems during the presentation.

So, why not let each presenter bring their own computer? As Bob Dylan would say, "The answer is blowin' in the wind!" While democratic, this approach would only multiply the above-mentioned problems, making the situation even more unpredictable.

The solution, of course, is to use Linux, SDI signals up to 100 meters, and fiber optics beyond 150 meters, completely forgetting all the problems of HDCP and EDID, and using software that allows for animated images and text compatible with the entire universe! Will this be what artificial intelligence brings us? At the moment, however, it's not possible to estimate when this simplified future will arrive. To solve these issues, mixed techniques are used, sometimes very bold, not without high costs and technical complexities.

For more information on the internal structure of PPTX files and the collaboration between Microsoft, Apple, and Adobe, you can visit the [Microsoft Learn page on the PPTX format](pptx format).

1.4 Mixed Solutions and the Role of Graphic Mixers

Do graphic mixers solve these problems? Yes.

Pros: They have powerful hardware that allows them to modify the signal taken from a consumer source and return a compatible signal. This doesn't mean that the same thing couldn't be done with converters; it’s just that it was something companies already budgeted for when video was analog, so it has remained.
Cons: They are often very expensive and are used not only to convert signals but also as full-fledged video mixers, without offering the same artistic operational capabilities.

The use of these machines has led to two schools of thought:

Television: Graphics become a YUV video, turning into just another input for the video mixer.
Graphic: The video mixer’s signal is inserted into the graphics. This operation presents various technical problems, often not resolved correctly, causing visible defects in the videos embedded within the graphic images.

Despite their convenience, power, and versatility, performing the typical operations of a video mixer on a signal processor is very complicated and requires specific experience. On the other hand, some video mixers have procedures that are not developed in a user-friendly way, making them a bit intimidating, especially at the beginning.

My idea is, therefore, to create, within the limits of possibility, a video mixing software that has a simple yet complete interface to economically carry out all those activities that take place in the conference field.

A software that can be used by anyone with minimal experience with video mixers, but also has a vocation for live compositing and, later on, why not, even for the animation of remote cameras, lights, and audio. I love silent films, but unfortunately, they are no longer in high demand, so I started studying how to sync audio with video, and sooner or later, I'll decide to include it.

I haven't found a single book on this subject, and it hasn't been easy to put this information together. What I want to write is a sort of introduction to what you will find in my code on GitHub OpenPyVision. In the landscape of technical and scientific literature, it is rare to find texts that specifically address the mathematics of video mixing. All the texts are focused on static images. This book stands out much like the famous "Al-Kitāb al-mukhtaṣar fī ḥisāb al-jabr wa-l-muqābala" by Muhammad ibn Musa al-Khwarizmi. Just as Al-Khwarizmi revolutionized mathematics by introducing algebra and providing tools to solve complex equations, this text aims to provide a deep and practical understanding of the issues related to video mixing.

The goal is to deviate from the traditional approaches found in other books on the subject, offering instead a direct and practical guide to understanding and solving the specific problems of video mixing. Just as "Al-Jabr" marked a turning point in the history of mathematics, this book aims to be a reference point for anyone wishing to delve into the mathematics applied to video mixing and video transitions.

1.5 Excuse Me, But Who Are You?

My passion for video mixing stems from an eclectic professional journey. After studying film directing, I specialized in special effects and animation, delving into software like "Shake" and "Nuke" and taking online courses led by industry experts such as Steve Wright. However, in Italy, I found few opportunities to apply these skills in large-scale projects, which often required large teams and dedicated resources.

A turning point came just before the pandemic when a television production I was working with shut down. I seized the opportunity to explore new avenues, taking advantage of a basic income to deepen my knowledge in machine learning and data analysis. During this period of study and experimentation, I began to see connections between my skills in special effects and the new knowledge I was acquiring. I started using matrix computation libraries, similar to those used in machine learning, to process images and videos. I also experimented with ChatGPT to automate certain tasks and correct errors in my code.

In my spare time, I developed small personal projects, including an initial draft of a video mixing program. My enthusiasm for this project grew, and I began to think that perhaps I could actually create something useful and innovative. I continued to work in the audiovisual field, trying to integrate my new skills into my projects. I experimented with image processing algorithms and honed my programming abilities.

The idea for OpenPyVISION was born from the desire to combine my passions and skills to create an accessible, intuitive, and powerful video mixing software that could meet the needs of professionals in the industry and simplify their workflow.

1.6 24 Frames Per Second

Why does cinema use 24 frames per second (fps), while video signals use 25, 30, 50, or even 60? The answer lies in a fascinating journey through the history of cinema and television, a path that takes us from the magic lanterns of the 17th century to today's smartphones.

Cinema is an invention with a rich and complex history, characterized by numerous pioneers around the world who, perhaps simultaneously, drew inspiration from ideas or machines seen at international fairs. This period also witnessed many legal battles to establish who the true inventor of cinema was. The Lumière brothers, Auguste and Louis, are often considered the pioneers of cinema, as they organized the first public film screening with their cinematograph on December 28, 1895. They showcased short documentaries like "Workers Leaving the Lumière Factory" and "The Arrival of a Train at La Ciotat Station." Their approach focused on realism and documenting everyday life.

Meanwhile, in the United States, Thomas Alva Edison and his assistant William Kennedy Laurie Dickson developed the Kinetoscope and Kinetograph in the 1890s. The Kinetoscope allowed a single person to view a short film through an eyepiece. Although Edison's inventions were innovative, his vision of cinema was more individualistic compared to the Lumière brothers, who favored collective public screenings. Historically, the two inventions that led to these ideas were likely the magic lantern and the zoetrope.

In the mid-1600s, the so-called "magic lantern" existed, which projected images painted on glass slides onto walls, illuminated by an oil candle. A sort of early slide projector, its invention is attributed to Christiaan Huygens, but references to friars and Jesuits discussing similar devices suggest something akin to it might have already existed. The zoetrope was a cylinder with a series of images that, when spun at a certain speed, created the illusion of movement. The praxinoscope combined these two principles, featuring a central cylinder with the image, a candle inside the cylinder, and the images projected onto a semi-transparent outer cylinder.

But why 24 fps? During the silent film era, movies were shot at variable speeds, often between 16 and 20 fps. Early cameras had a crank to advance the film, with a mechanism to help maintain a consistent speed, but it was still possible to speed up or slow down the motion during both filming and projection. Increasing the speed gave a comedic effect, while slowing it down heightened suspense. This choice likely stemmed from empirical considerations based on these manual machines sold as entertainment devices or showcased at fairs and town festivals, which recreated the illusion of movement.

With the advent of sound, it became necessary to standardize projection speed to ensure audio synchronization, and 24 fps proved to be an adequate speed for achieving smooth motion while keeping film costs manageable.

1.7 Sound

It's unclear who first had the idea of transmitting a record via radio or in which country it became popular first—perhaps in England or the United States. Certainly, the managers of early radio stations quickly recognized the potential of this new invention.

In the early decades of cinema, films were silent, accompanied by live music performed in the theater. This limited the cinematic experience, as audiences could not hear dialogues or sound effects. While some argued that silent films represented an art form, major investors did not fully agree.

The broadcast of music via radio quickly became popular. Radio stations began regularly broadcasting music and producing live shows, such as radio dramas, reaching a wide and diverse audience. This changed the entertainment landscape, making music accessible to anyone with a radio.

The popularity of radio sparked competition among entrepreneurs in various entertainment sectors, including film producers. The public became accustomed to hearing voices and music broadcast in real-time, and this expectation transferred to cinema, where the demand for films combining images and sounds began to emerge. The breakthrough came in 1927 with the film "The Jazz Singer," produced by Warner Bros. This film used the Vitaphone system, which synchronized a recorded soundtrack on a disc with the projected images. "The Jazz Singer" featured synchronized dialogue and song segments, offering audiences a new cinematic experience.

The success of "The Jazz Singer" marked the beginning of the sound film era. Film studios quickly adopted the new technology, transforming how films were produced and consumed. Sound films allowed for richer and more engaging storytelling, with dialogue, music, and sound effects enhancing the visual experience. This transition was not without challenges: studios had to address technical issues such as soundproofing sets and using bulky microphones, while many silent film actors struggled to adapt. The impact of these innovations extended beyond entertainment: governments and authoritarian regimes recognized the strategic potential of the new technologies.

1.8 Television

Technological experiments were not limited to the fields of cinema and radio. States and governments recognized the strategic importance of these innovations. In Italy, for example, radio became essential for military communications, and early experiments in television broadcasting were interrupted.

In Germany, however, Adolf Hitler's Nazi regime exploited new technologies for propaganda. During the 1936 Olympic Games, Germany experimented with the first television broadcasts, using technology to promote the regime's image and demonstrate its technical supremacy.

Initially, television was mechanical. The invention is often credited to Paul Nipkow, a German engineer who developed a mechanical disc with spiral holes from the outside toward the center. As the disc spun, each hole passed in front of the image, allowing light to pass through in sequence. Behind the disc, a photoelectric cell captured the light passing through the holes. The intensity of the light varied depending on the brightness of each point on the scanned image. The photoelectric cell converted these light variations into electrical signals of varying intensity. The produced electrical signals could be transmitted via cable or radio waves. To reconstruct the image, a receiving device used an identical Nipkow disc, synchronized with the transmitting disc. The received signal modulated a light source (such as a neon lamp) behind the receiving disc. As the disc spun, the modulated light passed through the holes, reconstructing the original image line by line on a screen.

However, mechanical television had many limitations. The transmitted image was small, the need for precise synchronization between the transmitter and receiver was critical and often problematic, and mechanical scanning limited transmission speed, compromising the quality of moving images. Despite these limitations, Nipkow's disc was fundamental to the development of early mechanical television systems and paved the way for the electronic scanning technologies that followed, such as Vladimir Zworykin's iconoscope tube in the 1920s.

It was only with the advent of the cathode-ray tube that television began to become a commercially viable product. Commercial radios, such as RCA (Radio Corporation of America), played a key role in developing electronic television in the United States. While Europe was at war, experimentation and dissemination of televisions continued in the United States, albeit limited to a few pioneers and technology enthusiasts. Most American families did not yet own a television. Television broadcasts were available only in certain cities, mainly where experimental stations were present. In 1939, during the New York World's Fair, RCA presented electronic television to the public, marking a crucial moment in television history.

1.9 25, 30, 50, 60

In 1940, the earliest television broadcasts were in black and white and entirely live, as video recorders had yet to be developed. Images were often captured from 16mm film projections, marking the emergence of the first market for television-specific content. Shortly thereafter, the National Television System Committee (NTSC) established a standard of 525 lines of resolution and a frame rate of 30 frames per second (fps), with interlaced scanning at 60 Hz.

Both cinema and television adopted the 4:3 (1.33:1) aspect ratio. This was initially chosen in cinema because the audio track was printed alongside the film, reducing the available width for the image. Early televisions adopted this aspect ratio for several reasons: it was compatible with existing cinematic content, facilitating the broadcast of already-produced films and shorts, and it was practical for transmitting and displaying images on the screens of the time.

These technological and formatting choices laid the groundwork for television's evolution as a mass communication medium, influencing content production and television set design for decades. While the United States and Latin America adopted a 60 Hz standard for electricity distribution, influencing the choice of 30 fps for TV broadcasts to synchronize screen refresh rates with the power grid frequency and avoid visible interference, Europe's situation was more complex. Many European countries had already standardized at 50 Hz before World War II. For instance, Germany standardized at 50 Hz in the 1920s, and the UK began this process in the 1930s, completing it in the 1950s. The European standardization was a gradual process that began before the war, paused during it, and resumed afterward. The Marshall Plan aided in Europe's post-war reconstruction and modernization but did not enforce uniform standards, possibly aiming to recover pre-war technologies and infrastructures.

In television's early days, broadcasts were indeed at 30 full frames per second (in the U.S.), known as progressive scanning. However, early TVs suffered from unstable image brightness. To address this, engineers developed the interlaced scanning system. Instead of transmitting 30 complete images per second, the interlaced system divided each frame into two fields—one containing the odd lines and the other the even lines. These fields were transmitted alternately at a frequency of 60 fields per second in the U.S. or 50 fields per second in Europe. This led to differences like 25 and 30 progressive frames, and 50 and 60 interlaced frames. Given that images were traced line by line, the resolution was described as 525 lines. 1.9.1 Why 59.94?

While Americans in 1939 watched "Gone with the Wind" in color, Europe and Asia were engulfed in turmoil. In Italy, regular black and white broadcasts began on January 3, 1954, at 11:00 AM, while the U.S. was on the cusp of regular color transmissions.

Introducing color broadcasts in the U.S. was an engineering marvel that led to a frequency shift from 60 Hz to 59.94 Hz. The challenge was maintaining compatibility with black and white TVs, allowing them to receive and decode the signal seamlessly. The modulation of the chrominance signal (containing color information) had to avoid interfering with the luminance signal (containing brightness information) and the audio signal. The color frequency was tied to the horizontal scan frequency and the frame rate of the TV signal. By slightly reducing the frame frequency from 60 Hz to 59.94 Hz, engineers achieved better spectral separation between the chrominance and audio signals, reducing interference. The exact value of 59.94 Hz (more precisely, 59.94005994005994 Hz) was chosen to maintain a precise relationship with the color frequency and horizontal scanning. The color subcarrier frequency was set at 3.579545 MHz, derived from dividing 3.579545 MHz by 1000—a multiple of the horizontal field frequency—thus ensuring minimal interference. 1.9.2 The Smartphone

From 1955 to today, numerous historically significant inventions have emerged, with various players vying for the best market share. Among these, perhaps the most revolutionary—comparable in impact to cinema and television—is the smartphone. While personal computers improved digital literacy and bridged the digital divide in some countries, they didn't entirely make technology universally and immediately accessible. Despite Windows and the internet opening a window to the world, challenges remained.

In this context, smartphones achieved greater success by reducing the inherent difficulties of computers and offering a solution always at hand. They simplified access to information and communication, becoming indispensable tools in daily life and radically transforming how people interact with technology and each other. Smartphones introduce new challenges and ways of viewing images—they can be watched both horizontally and vertically and allow direct user interaction. Video and audio on-demand streaming have transformed media consumption. Mobile gaming has become a significant entertainment industry. New content formats, like vertical stories and short videos, gained popularity thanks to smartphones. Movements like citizen journalism and user-generated content have democratized news production and entertainment.

They've made remote work possible, offering job flexibility. Apps for time management and productivity help people better organize their lives. However, the lines between work and personal life have blurred, and new professions related to mobile, like app developers and influencers, have emerged. Smartphones are evolving to become lighter, foldable, and wearable, integrating augmented and virtual reality increasingly. Artificial intelligence is likely to become even more central, reminiscent of the blend between the magic lantern and praxinoscope that once inspired cinema.

However, smartphones collect vast amounts of personal data, making privacy and security growing concerns. Cybersecurity and personal data protection have become essential. They can be used for surveillance and activity tracking, and the spread of misinformation and fake news is facilitated by their ubiquity. Additionally, they contribute to energy consumption and carbon footprints. The production and disposal of electronic waste are significant problems, although efforts are underway to make smartphone production more sustainable.

While my field isn't traditionally considered academic, it's of collective interest. Previously, only major broadcasters and networks engaged in such endeavors; now, having a sort of television studio at home is within everyone's reach. However, understanding how to piece together images or why certain methods are preferred isn't easily found in textbooks. This is my attempt to address some of the challenges related to producing images using what I consider the other great revolution alongside smartphones, albeit one that hasn't garnered the same attention: Open Source.

1.9.3 MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly