AI Generator Rewriter

AI Generator Rewriter — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Matchbox Educable Noughts and Crosses Engine

    Matchbox Educable Noughts and Crosses Engine

    The Matchbox Educable Noughts and Crosses Engine (sometimes called the Machine Educable Noughts and Crosses Engine or MENACE) was a mechanical computer made from 304 matchboxes designed and built by artificial intelligence researcher Donald Michie and his colleague Roger Chambers, in 1961. It was designed to play human opponents in games of noughts and crosses (tic-tac-toe) by returning a move for any given state of play and to refine its strategy through reinforcement learning. This was one of the first types of artificial intelligence. Michie and Chambers did not have immediate access to a computer; they worked around this by building the engine out of matchboxes. The matchboxes they used each represented a single possible layout of a noughts and crosses grid. When the computer first played, it would randomly choose moves based on the current layout. As it played more games, through a reinforcement loop, it disqualified strategies that led to losing games, and supplemented strategies that led to winning games. Michie held a tournament against MENACE in 1961, wherein he experimented with different openings. Following MENACE's maiden tournament against Michie, it demonstrated successful artificial intelligence in its strategy. Michie's essays on MENACE's weight initialisation and the BOXES algorithm used by MENACE became popular in the field of computer science research. Michie was honoured for his contribution to machine learning research, and was twice commissioned to program a MENACE simulation on an actual computer. == Origin == Donald Michie (1923–2007) had been on the team decrypting the German Tunny Code during World War II. Fifteen years later, he wanted to further display his mathematical and computational prowess with an early convolutional neural network. Since computer equipment was not obtainable for such uses, and Michie did not have a computer readily available, he decided to display and demonstrate artificial intelligence in a more esoteric format and constructed a functional mechanical computer out of matchboxes and beads. MENACE was constructed as the result of a bet with a computer science colleague who postulated that such a machine was impossible. Michie undertook the task of collecting and defining each matchbox as a "fun project", later turned into a demonstration tool. Michie completed his essay on MENACE in 1963, "Experiments on the mechanization of game-learning", as well as his essay on the BOXES Algorithm, written with R. A. Chambers and had built up an AI research unit in Hope Park Square, Edinburgh, Scotland. MENACE learned by playing successive matches of noughts and crosses. Each time, it would eliminate a losing strategy by the human player confiscating the beads that corresponded to each move. It reinforced winning strategies by making the moves more likely, by supplying extra beads. This was one of the earliest versions of the Reinforcement Loop, the schematic algorithm of looping the algorithm, dropping unsuccessful strategies until only the winning ones remain. This model starts as completely random, and gradually learns. == Composition == MENACE was made from 304 matchboxes glued together in an arrangement similar to a chest of drawers. Each box had a code number, which was keyed into a chart. This chart had drawings of tic-tac-toe game grids with various configurations of X, O, and empty squares, corresponding to all possible permutations a game could go through as it progressed. After removing duplicate arrangements (ones that were simply rotations or mirror images of other configurations), MENACE used 304 permutations in its chart and thus that many matchboxes. Each individual matchbox tray contained a collection of coloured beads. Each colour represented a move on a square on the game grid, and so matchboxes with arrangements where positions on the grid were already taken would not have beads for that position. Additionally, at the front of the tray were two extra pieces of card in a "V" shape, the point of the "V" pointing at the front of the matchbox. Michie and his artificial intelligence team called MENACE's algorithm "Boxes", after the apparatus used for the machine. The first stage "Boxes" operated in five phases, each setting a definition and a precedent for the rules of the algorithm in relation to the game. == Operation == MENACE played first, as O, since all matchboxes represented permutations only relevant to the "X" player. To retrieve MENACE's choice of move, the opponent or operator located the matchbox that matched the current game state, or a rotation or mirror image of it. For example, at the start of a game, this would be the matchbox for an empty grid. The tray would be removed and lightly shaken so as to move the beads around. Then, the bead that had rolled into the point of the "V" shape at the front of the tray was the move MENACE had chosen to make. Its colour was then used as the position to play on, and, after accounting for any rotations or flips needed based on the chosen matchbox configuration's relation to the current grid, the O would be placed on that square. Then the player performed their move, the new state was located, a new move selected, and so on, until the game was finished. When the game had finished, the human player observed the game's outcome. As a game was played, each matchbox that was used for MENACE's turn had its tray returned to it ajar, and the bead used kept aside, so that MENACE's choice of moves and the game states they belonged to were recorded. Michie described his reinforcement system with "reward" and "punishment". Once the game was finished, if MENACE had won, it would then receive a "reward" for its victory. The removed beads showed the sequence of the winning moves. These were returned to their respective trays, easily identifiable since they were slightly open, as well as three bonus beads of the same colour. In this way, in future games MENACE would become more likely to repeat those winning moves, reinforcing winning strategies. If it lost, the removed beads were not returned, "punishing" MENACE, and meaning that in future it would be less likely, and eventually incapable if that colour of bead became absent, to repeat the moves that cause a loss. If the game was a draw, one additional bead was added to each box. == Results in practice == === Optimal strategy === Noughts and crosses has a well-known optimal strategy. A player must place their symbol in a way that blocks the other player from achieving any rows while simultaneously making a row themself. However, if both players use this strategy, the game always ends in a draw. If the human player is familiar with the optimal strategy, and MENACE can quickly learn it, then the games will eventually only end in draws. The likelihood of the computer winning increases quickly when the computer plays against a random-playing opponent. When playing against a player using optimal strategy, the odds of a draw grow to 100%. In Donald Michie's official tournament against MENACE in 1961 he used optimal strategy, and he and the computer began to draw consistently after twenty games. Michie's tournament had the following milestones: Michie began by consistently opening with "Variant 0", the middle square. At 15 games, MENACE abandoned all non-corner openings. At just over 20, Michie switched to consistently using "Variant 1", the bottom-right square. At 60, he returned to Variant 0. As he neared 80 games, he moved to "Variant 2", the top-middle. At 110, he switched to "Variant 3", the top right. At 135, he switched to "Variant 4", middle-right. At 190, he returned to Variant 1, and at 210, he returned to Variant 0. The trend in changes of beads in the "2" boxes runs: === Correlation === Depending on the strategy employed by the human player, MENACE produces a different trend on scatter graphs of wins. Using a random turn from the human player results in an almost-perfect positive trend. Playing the optimal strategy returns a slightly slower increase. The reinforcement does not create a perfect standard of wins; the algorithm will draw random uncertain conclusions each time. After the j-th round, the correlation of near-perfect play runs: 1 − D D − D ( j + 2 ) ∑ i = 0 j D ( j i + 1 ) V i {\displaystyle {1-D \over D-D^{(j+2)}}\sum _{i=0}^{j}D^{(ji+1)}V_{i}} Where Vi is the outcome (+1 is win, 0 is draw and -1 is loss) and D is the decay factor (average of past values of wins and losses). Below, Mn is the multiplier for the n-th round of the game. == Legacy == Donald Michie's MENACE proved that a computer could learn from failure and success to become good at a task. It used what would become core principles within the field of machine learning before they had been properly theorised. For example, the combination of how MENACE starts with equal numbers of types of beads in each matchbox, and how these are then selected at random, creates a learning behaviour similar to weight initialisation

    Read more →
  • Speech segmentation

    Speech segmentation

    Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing. In the field of automatic pronunciation assessment, the process of segmenting an utterance against expected word(s) is called forced alignment. Speech segmentation is a subfield of general speech perception and an important subproblem of the technologically focused field of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division (statistically based on likelihood) rather than a categorical one. Though it seems that coarticulation—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese, compared to writing systems which indicate speech segmentation between words by a word divider, such as the space. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese). == Lexical recognition == In natural languages, the meaning of a complex spoken sentence can be understood by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and combining those meanings according to the grammar rules of the language. Though lexical recognition is not thought to be used by infants in their first year, due to their highly limited vocabularies, it is one of the major processes involved in speech segmentation for adults. Three main models of lexical recognition exist in current research: first, whole-word access, which argues that words have a whole-word representation in the lexicon; second, decomposition, which argues that morphologically complex words are broken down into their morphemes (roots, stems, inflections, etc.) and then interpreted and; third, the view that whole-word and decomposition models are both used, but that the whole-word model provides some computational advantages and is therefore dominant in lexical recognition. To give an example, in a whole-word model, the word "cats" might be stored and searched for by letter, first "c", then "ca", "cat", and finally "cats". The same word, in a decompositional model, would likely be stored under the root word "cat" and could be searched for after removing the "s" suffix. "Falling", similarly, would be stored as "fall" and suffixed with the "ing" inflection. Though proponents of the decompositional model recognize that a morpheme-by-morpheme analysis may require significantly more computation, they argue that the unpacking of morphological information is necessary for other processes (such as syntactic structure) which may occur parallel to lexical searches. As a whole, research into systems of human lexical recognition is limited due to little experimental evidence that fully discriminates between the three main models. In any case, lexical recognition likely contributes significantly to speech segmentation through the contextual clues it provides, given that it is a heavily probabilistic system—based on the statistical likelihood of certain words or constituents occurring together. For example, one can imagine a situation where a person might say "I bought my dog at a ____ shop" and the missing word's vowel is pronounced as in "net", "sweat", or "pet". While the probability of "netshop" is extremely low, since "netshop" isn't currently a compound or phrase in English, and "sweatshop" also seems contextually improbable, "pet shop" is a good fit because it is a common phrase and is also related to the word "dog". Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer. Lexical recognition is of particular value in the field of computer speech recognition, since the ability to build and search a network of semantically connected ideas would greatly increase the effectiveness of speech-recognition software. Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available. == Phonotactic cues == For most spoken languages, the boundaries between lexical units are difficult to identify; phonotactics are one answer to this issue. One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, may be a relic of alphabetic heritage for some language communities. In fact, the way vowels are produced depends on the surrounding consonants just as consonants are affected by surrounding vowels; this is called coarticulation. For example, in the word "kit", the [k] is farther forward than when we say 'caught'. But also, the vowel in "kick" is phonetically different from the vowel in "kit", though we normally do not hear this. In addition, there are language-specific changes which occur in casual speech which makes it quite different from spelling. For example, in English, the phrase "hit you" could often be more appropriately spelled "hitcha". From a decompositional perspective, in many cases, phonotactics play a part in letting speakers know where to draw word boundaries. In English, the word "strawberry" is perceived by speakers as consisting (phonetically) of two parts: "straw" and "berry". Other interpretations such as "stra" and "wberry" are inhibited by English phonotactics, which does not allow the cluster "wb" word-initially. Other such examples are "day/dream" and "mile/stone" which are unlikely to be interpreted as "da/ydream" or "mil/estone" due to the phonotactic probability or improbability of certain clusters. The sentence "Five women left", which could be phonetically transcribed as [faɪvwɪmɘnlɛft], is marked since neither /vw/ in /faɪvwɪmɘn/ nor /nl/ in /wɪmɘnlɛft/ are allowed as syllable onsets or codas in English phonotactics. These phonotactic cues often allow speakers to easily distinguish the boundaries in words. Vowel harmony in languages like Finnish can also serve to provide phonotactic cues. While the system does not allow front vowels and back vowels to exist together within one morpheme, compounds allow two morphemes to maintain their own vowel harmony while coexisting in a word. Therefore, in compounds such as "selkä/ongelma" ('back problem') where vowel harmony is distinct between two constituents in a compound, the boundary will be wherever the switch in harmony takes place—between the "ä" and the "ö" in this case. Still, there are instances where phonotactics may not aid in segmentation. Words with unclear clusters or uncontrasted vowel harmony as in "opinto/uudistus" ('student reform') do not offer phonotactic clues as to how they are segmented. From the perspective of the whole-word model, however, these words are thought be stored as full words, so the constituent parts would not necessarily be relevant to lexical recognition. == In infants and non-natives == Infants are one major focus of research in speech segmentation. Since infants have not yet acquired a lexicon capable of providing extensive contextual clues or probability-based word searches within their first year, as mentioned above, they must often rely primarily upon phonotactic and rhythmic cues (with prosody being the dominant cue), all

    Read more →
  • Lexical Markup Framework

    Lexical Markup Framework

    Language resource management – Lexical markup framework (LMF; ISO 24613), produced by ISO/TC 37, is the ISO standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication. == Objectives == The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of large number of individual electronic resources to form extensive global electronic resources. Types of individual instantiations of LMF can include monolingual, bilingual or multilingual lexical resources. The same specifications are to be used for both small and large lexicons, for both simple and complex lexicons, for both written and spoken lexical representations. The descriptions range from morphology, syntax, computational semantics to computer-assisted translation. The covered languages are not restricted to European languages but cover all natural languages. The range of targeted NLP applications is not restricted. LMF is able to represent most lexicons, including WordNet, EDR and PAROLE lexicons. == History == In the past, lexicon standardization has been studied and developed by a series of projects like GENELEX, EDR, EAGLES, MULTEXT, PAROLE, SIMPLE and ISLE. Then, the ISO/TC 37 National delegations decided to address standards dedicated to NLP and lexicon representation. The work on LMF started in Summer 2003 by a new work item proposal issued by the US delegation. In Fall 2003, the French delegation issued a technical proposition for a data model dedicated to NLP lexicons. In early 2004, the ISO/TC 37 committee decided to form a common ISO project with Nicoletta Calzolari (CNR-ILC Italy) as convenor and Gil Francopoulo (Tagmatica France) and Monte George (ANSI, United States) as editors. The first step in developing LMF was to design an overall framework based on the general features of existing lexicons and to develop a consistent terminology to describe the components of those lexicons. The next step was the actual design of a comprehensive model that best represented all of the lexicons in detail. A large panel of 60 experts contributed a wide range of requirements for LMF that covered many types of NLP lexicons. The editors of LMF worked closely with the panel of experts to identify the best solutions and reach a consensus on the design of LMF. Special attention was paid to the morphology in order to provide powerful mechanisms for handling problems in several languages that were known as difficult to handle. 13 versions have been written, dispatched (to the National nominated experts), commented and discussed during various ISO technical meetings. After five years of work, including numerous face-to-face meetings and e-mail exchanges, the editors arrived at a coherent UML model. In conclusion, LMF should be considered a synthesis of the state of the art in NLP lexicon field. == Current stage == The ISO number is 24613. The LMF specification has been published officially as an International Standard on 17 November 2008. == As one of the members of the ISO/TC 37 family of standards == The ISO/TC 37 standards are currently elaborated as high level specifications and deal with word segmentation (ISO 24614), annotations (ISO 24611 a.k.a. MAF, ISO 24612 a.k.a. LAF, ISO 24615 a.k.a. SynAF, and ISO 24617-1 a.k.a. SemAF/Time), feature structures (ISO 24610), multimedia containers (ISO 24616 a.k.a. MLIF), and lexicons (ISO 24613). These standards are based on low level specifications dedicated to constants, namely data categories (revision of ISO 12620), language codes (ISO 639), scripts codes (ISO 15924), country codes (ISO 3166) and Unicode (ISO 10646). The two level organization forms a coherent family of standards with the following common and simple rules: the high level specification provides structural elements that are adorned by the standardized constants; the low level specifications provide standardized constants as metadata. == Key standards == The linguistics constants like /feminine/ or /transitive/ are not defined within LMF but are recorded in the Data Category Registry (DCR) that is maintained as a global resource by ISO/TC 37 in compliance with ISO/IEC 11179-3:2003. And these constants are used to adorn the high level structural elements. The LMF specification complies with the modeling principles of Unified Modeling Language (UML) as defined by Object Management Group (OMG). The structure is specified by means of UML class diagrams. The examples are presented by means of UML instance (or object) diagrams. An XML DTD is given in an annex of the LMF document. == Model structure == LMF is composed of the following components: The core package that is the structural skeleton which describes the basic hierarchy of information in a lexical entry. Extensions of the core package which are expressed in a framework that describes the reuse of the core components in conjunction with the additional components required for a specific lexical resource. The extensions are specifically dedicated to morphology, MRD, NLP syntax, NLP semantics, NLP multilingual notations, NLP morphological patterns, multiword expression patterns, and constraint expression patterns. == Example == In the following example, the lexical entry is associated with a lemma clergyman and two inflected forms clergyman and clergymen. The language coding is set for the whole lexical resource. The language value is set for the whole lexicon as shown in the following UML instance diagram. The elements Lexical Resource, Global Information, Lexicon, Lexical Entry, Lemma, and Word Form define the structure of the lexicon. They are specified within the LMF document. On the contrary, languageCoding, language, partOfSpeech, commonNoun, writtenForm, grammaticalNumber, singular, plural are data categories that are taken from the Data Category Registry. These marks adorn the structure. The values ISO 639-3, clergyman, clergymen are plain character strings. The value eng is taken from the list of languages as defined by ISO 639-3. With some additional information like dtdVersion and feat, the same data can be expressed by the following XML fragment: This example is rather simple, while LMF can represent much more complex linguistic descriptions the XML tagging is correspondingly complex. == Selected publications about LMF == The first publication about the LMF specification as it has been ratified by ISO (this paper became (in 2015) the 9th most cited paper within the Language Resources and Evaluation conferences from LREC papers): Language Resources and Evaluation LREC-2006/Genoa: Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, Claudia Soria: Lexical Markup Framework (LMF) About semantic representation: Gesellschaft für linguistische Datenverarbeitung GLDV-2007/Tübingen: Gil Francopoulo, Nuria Bel, Monte George Nicoletta Calzolari, Monica Monachini, Mandy Pet, Claudia Soria: Lexical Markup Framework ISO standard for semantic information in NLP lexicons About African languages: Traitement Automatique des langues naturelles, Marseille, 2014: Mouhamadou Khoule, Mouhamad Ndiankho Thiam, El Hadj Mamadou Nguer: Toward the establishment of a LMF-based Wolof language lexicon (Vers la mise en place d'un lexique basé sur LMF pour la langue wolof) [in French] About Asian languages: Lexicography, Journal of ASIALEX, Springer 2014: Lexical Markup Framework: Gil Francopoulo, Chu-Ren Huang: An ISO Standard for Electronic Lexicons and its Implications for Asian Languages DOI 10.1007/s40607-014-0006-z About European languages: COLING 2010: Verena Henrich, Erhard Hinrichs: Standardizing Wordnets in the ISO Standard LMF: Wordnet-LMF for GermaNet EACL 2012: Judith Eckle-Kohler, Iryna Gurevych: Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability EACL 2012: Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M Meyer, Christian Wirth: UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF. About Semitic languages: Journal of Natural Language Engineering, Cambridge University Press (to appear in Spring 2015): Aida Khemakhem, Bilel Gargouri, Abdelmajid Ben Hamadou, Gil Francopoulo: ISO Standard Modeling of a large Arabic Dictionary. Proceedings of the seventh Global Wordnet Conference 2014: Nadia B M Karmani, Hsan Soussou, Adel M Alimi: Building a standardized Wordnet in the ISO LMF for aeb language. Proceedings of the workshop: HLT & NLP within Arabic world, LREC 2008: Noureddine Loukil, Kais Haddar, Abdelmajid Ben Hamadou: Towards a syntactic lexicon of Arabic Verbs. Traitement Automatique des Langues Naturelles, Toulouse (in French) 2007: Khemakhem A, Gargouri B, Abdelwahed A, Francopoulo G: Modélisation des paradigmes de fl

    Read more →
  • Geometric hashing

    Geometric hashing

    In computer science, geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to other object representations and transformations. In an off-line step, the objects are encoded by treating each pair of points as a geometric basis. The remaining points can be represented in an invariant fashion with respect to this basis using two parameters. For each point, its quantized transformed coordinates are stored in the hash table as a key, and indices of the basis points as a value. Then a new pair of basis points is selected, and the process is repeated. In the on-line (recognition) step, randomly selected pairs of data points are considered as candidate bases. For each candidate basis, the remaining data points are encoded according to the basis and possible correspondences from the object are found in the previously constructed table. The candidate basis is accepted if a sufficiently large number of the data points index a consistent object basis. Geometric hashing was originally suggested in computer vision for object recognition in 2D and 3D, but later was applied to different problems such as structural alignment of proteins. == Geometric hashing in computer vision == Geometric hashing is a method used for object recognition. Let’s say that we want to check if a model image can be seen in an input image. This can be accomplished with geometric hashing. The method could be used to recognize one of the multiple objects in a base, in this case the hash table should store not only the pose information but also the index of object model in the base. === Example === For simplicity, this example will not use too many point features and assume that their descriptors are given by their coordinates only (in practice local descriptors such as SIFT could be used for indexing). ==== Training Phase ==== Find the model's feature points. Assume that 5 feature points are found in the model image with the coordinates ( 12 , 17 ) ; {\displaystyle (12,17);} ( 45 , 13 ) ; {\displaystyle (45,13);} ( 40 , 46 ) ; {\displaystyle (40,46);} ( 20 , 35 ) ; {\displaystyle (20,35);} ( 35 , 25 ) {\displaystyle (35,25)} , see the picture. Introduce a basis to describe the locations of the feature points. For 2D space and similarity transformation the basis is defined by a pair of points. The point of origin is placed in the middle of the segment connecting the two points (P2, P4 in our example), the x ′ {\displaystyle x'} axis is directed towards one of them, the y ′ {\displaystyle y'} is orthogonal and goes through the origin. The scale is selected such that absolute value of x ′ {\displaystyle x'} for both basis points is 1. Describe feature locations with respect to that basis, i.e. compute the projections to the new coordinate axes. The coordinates should be discretised to make recognition robust to noise, we take the bin size 0.25. We thus get the coordinates ( − 0.75 , − 1.25 ) ; {\displaystyle (-0.75,-1.25);} ( 1.00 , 0.00 ) ; {\displaystyle (1.00,0.00);} ( − 0.50 , 1.25 ) ; {\displaystyle (-0.50,1.25);} ( − 1.00 , 0.00 ) ; {\displaystyle (-1.00,0.00);} ( 0.00 , 0.25 ) {\displaystyle (0.00,0.25)} Store the basis in a hash table indexed by the features (only transformed coordinates in this case). If there were more objects to match with, we should also store the object number along with the basis pair. Repeat the process for a different basis pair (Step 2). It is needed to handle occlusions. Ideally, all the non-colinear pairs should be enumerated. We provide the hash table after two iterations, the pair (P1, P3) is selected for the second one. Hash Table: Most hash tables cannot have identical keys mapped to different values. So in real life one won’t encode basis keys (1.0, 0.0) and (-1.0, 0.0) in a hash table. ==== Recognition Phase ==== Find interesting feature points in the input image. Choose an arbitrary basis. If there isn't a suitable arbitrary basis, then it is likely that the input image does not contain the target object. Describe coordinates of the feature points in the new basis. Quantize obtained coordinates as it was done before. Compare all the transformed point features in the input image with the hash table. If the point features are identical or similar, then increase the count for the corresponding basis (and the type of object, if any). For each basis such that the count exceeds a certain threshold, verify the hypothesis that it corresponds to an image basis chosen in Step 2. Transfer the image coordinate system to the model one (for the supposed object) and try to match them. If successful, the object is found. Otherwise, go back to Step 2. === Finding mirrored pattern === It seems that this method is only capable of handling scaling, translation, and rotation. However, the input image may contain the object in mirror transform. Therefore, geometric hashing should be able to find the object, too. There are two ways to detect mirrored objects. For the vector graph, make the left side positive, and the right side negative. Multiplying the x position by -1 will give the same result. Use 3 points for the basis. This allows detecting mirror images (or objects). Actually, using 3 points for the basis is another approach for geometric hashing. === Geometric hashing in higher-dimensions === Similar to the example above, hashing applies to higher-dimensional data. For three-dimensional data points, three points are also needed for the basis. The first two points define the x-axis, and the third point defines the y-axis (with the first point). The z-axis is perpendicular to the created axis using the right-hand rule. Notice that the order of the points affects the resulting basis

    Read more →
  • Ray tracing (graphics)

    Ray tracing (graphics)

    In 3D computer graphics, ray tracing is a technique for modeling light transport for use in a wide variety of rendering algorithms for generating digital images. On a spectrum of computational cost and visual fidelity, ray tracing-based rendering techniques, such as ray casting, recursive ray tracing, distribution ray tracing, photon mapping and path tracing, are generally slower and higher fidelity than scanline rendering methods. Thus, ray tracing was first deployed in applications where taking a relatively long time to render could be tolerated, such as still CGI images, and film and television visual effects (VFX), but was less suited to real-time applications such as video games, where speed is critical in rendering each frame. Since 2018, however, hardware acceleration for real-time ray tracing has become standard on new commercial graphics cards, and graphics APIs have followed suit, allowing developers to use hybrid ray tracing and rasterization-based rendering in games and other real-time applications with a lesser hit to frame render times. Ray tracing is capable of simulating a variety of optical effects, such as reflection, refraction, soft shadows, scattering, depth of field, motion blur, caustics, ambient occlusion and dispersion phenomena (such as chromatic aberration). It can also be used to trace the path of sound waves in a similar fashion to light waves, making it a viable option for more immersive sound design in video games by rendering realistic reverberation and echoes. In fact, any physical wave or particle phenomenon with approximately linear motion can be simulated with ray tracing. Ray tracing–based rendering techniques that sample light over a domain typically generate multiple rays and often rely on denoising to reduce the resulting noise. == History == The idea of ray tracing comes from as early as the 16th century, when it was described by Albrecht Dürer, who is credited for its invention. Dürer described multiple techniques for projecting 3-D scenes onto an image plane. Some of these project chosen geometry onto the image plane, as is done with rasterization today. Others determine what geometry is visible along a given ray, as is done with ray tracing. Using a computer for ray tracing to generate shaded pictures was first accomplished by Arthur Appel in 1968. Appel used ray tracing for primary visibility (determining the closest surface to the camera at each image point) by tracing a ray through each point to be shaded into the scene to identify the visible surface. The closest surface intersected by the ray was the visible one. This non-recursive ray tracing-based rendering algorithm is today called "ray casting". His algorithm then traced secondary rays to the light source from each point being shaded to determine whether the point was in shadow or not. Later, in 1971, Goldstein and Nagel of MAGI (Mathematical Applications Group, Inc.) published "3-D Visual Simulation", wherein ray tracing was used to make shaded pictures of solids. At the ray-surface intersection point found, they computed the surface normal and, knowing the position of the light source, computed the brightness of the pixel on the screen. Their publication describes a short (30-second) film "made using the University of Maryland's display hardware outfitted with a 16mm camera. The film showed the helicopter and a simple ground-level gun emplacement. The helicopter was programmed to undergo a series of maneuvers including turns, take-offs, and landings, etc., until it eventually is shot down and crashed." A CDC 6600 computer was used. MAGI produced an animation video called MAGI/SynthaVision Sampler in 1974. Another early instance of ray casting came in 1976, when Scott Roth created a flip book animation in Bob Sproull's computer graphics course at Caltech. The scanned pages are shown as a video in the accompanying image. Roth's computer program noted an edge point at a pixel location if the ray intersected a bounded plane different from that of its neighbors. Of course, a ray could intersect multiple planes in space, but only the surface point closest to the camera was noted as visible. The platform was a DEC PDP-10, a Tektronix storage-tube display, and a printer which would create an image of the display on rolling thermal paper. Roth extended the framework, introduced the term ray casting in the context of computer graphics and solid modeling, and in 1982 published his work while at GM Research Labs. Turner Whitted was the first to show recursive ray tracing for mirror reflection and for refraction through translucent objects, with an angle determined by the solid's index of refraction, and to use ray tracing for anti-aliasing. Whitted also showed ray traced shadows. He produced a recursive ray traced film called The Compleat Angler in 1979 while an engineer at Bell Labs. Whitted's deeply recursive ray tracing algorithm reframed rendering from being primarily a matter of surface visibility determination to being a matter of light transport. His paper inspired a series of subsequent work by others that included distribution ray tracing and finally unbiased path tracing, which provides the rendering equation framework that has allowed computer-generated imagery to be faithful to reality. For decades, global illumination in major films using computer-generated imagery was approximated with additional lights. Ray tracing-based rendering eventually changed that by enabling physically based light transport. Early feature films rendered entirely using path tracing include Monster House (2006), Cloudy with a Chance of Meatballs (2009), and Monsters University (2013). == Algorithm overview == Optical ray tracing describes a method for producing visual images constructed in 3D computer graphics environments, with more photorealism than either ray casting or scanline rendering techniques. It works by tracing a path from an imaginary eye through each pixel in a virtual screen, and calculating the color of the object visible through it. Scenes in ray tracing are described mathematically by a programmer or by a visual artist (normally using intermediary tools). Scenes may also incorporate data from images and models captured by means such as digital photography. Typically, each ray must be tested for intersection with some subset of all the objects in the scene. Once the nearest object has been identified, the algorithm will estimate the incoming light at the point of intersection, examine the material properties of the object, and combine this information to calculate the final color of the pixel. Certain illumination algorithms and reflective or translucent materials may require more rays to be re-cast into the scene. It may at first seem counterintuitive or "backward" to send rays away from the camera, rather than into it (as actual light does in reality), but doing so is many orders of magnitude more efficient. Since the overwhelming majority of light rays from a given light source do not make it directly into the viewer's eye, a "forward" simulation could potentially waste a tremendous amount of computation on light paths that are never recorded. Therefore, the shortcut taken in ray tracing is to presuppose that a given ray intersects the view frame. After either a maximum number of reflections or a ray traveling a certain distance without intersection, the ray ceases to travel and the pixel's value is updated. === Calculate rays for rectangular viewport === On input we have (in calculation we use vector normalization and cross product): E ∈ R 3 {\displaystyle E\in \mathbb {R^{3}} } eye position T ∈ R 3 {\displaystyle T\in \mathbb {R^{3}} } target position θ ∈ [ 0 , π ] {\displaystyle \theta \in [0,\pi ]} field of view - for humans, we can assume ≈ π / 2 rad = 90 ∘ {\displaystyle \approx \pi /2{\text{ rad}}=90^{\circ }} m , k ∈ N {\displaystyle m,k\in \mathbb {N} } numbers of square pixels on viewport vertical and horizontal direction i , j ∈ N , 1 ≤ i ≤ k ∧ 1 ≤ j ≤ m {\displaystyle i,j\in \mathbb {N} ,1\leq i\leq k\land 1\leq j\leq m} numbers of actual pixel v → ∈ R 3 {\displaystyle {\vec {v}}\in \mathbb {R^{3}} } vertical vector which indicates where is up and down, usually v → = [ 0 , 1 , 0 ] {\displaystyle {\vec {v}}=[0,1,0]} - roll component which determine viewport rotation around point C (where the axis of rotation is the ET section) The idea is to find the position of each viewport pixel center P i j {\displaystyle P_{ij}} which allows us to find the line going from eye E {\displaystyle E} through that pixel and finally get the ray described by point E {\displaystyle E} and vector R → i j = P i j − E {\displaystyle {\vec {R}}_{ij}=P_{ij}-E} (or its normalization r → i j {\displaystyle {\vec {r}}_{ij}} ). First we need to find the coordinates of the bottom left viewport pixel P 1 m {\displaystyle P_{1m}} and find the next pixel by making a shift along directions parallel to viewport (vectors b → n {\displaystyle {\vec {b}}_{n

    Read more →
  • Text-to-video model

    Text-to-video model

    A text-to-video model is a form of generative artificial intelligence that uses a natural language description as input to produce a video relevant to the input text. Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models. == Models == There are different models, including open source models. Chinese-language input CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022. That year, Meta Platforms released a partial text-to-video model called "Make-A-Video", and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net. === 2023 === In February 2023, Runway released Gen-1 and Gen-2, among the first commercially available text-to-video and video-to-video models accessible to the public through a web interface. Gen-1, initially released as a video-to-video model, allowed users to transform existing video footage using text or image prompts. Gen-2, introduced in March 2023 and made publicly available in June 2023, added text-to-video capabilities, enabling users to generate videos from text prompts alone. In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. In the same month, Adobe introduced Firefly AI as part of its features. === 2024 === In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities. Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars. In June 2024, Luma Labs launched its Dream Machine video tool. That same month, Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology. By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AI, Baichuan, and Moonshot AI, which contribute to China's involvement in AI technology. In December 2024 Lightricks launched LTX Video as an open source model. === 2025 === Alternative approaches to text-to-video models include Google's Phenaki, Hour One, Colossyan, Runway's Gen-3 Alpha, and OpenAI's Sora, Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged. FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA. Google was preparing to launch a video generation tool named Veo for YouTube Shorts in 2025. In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models. In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds, and in October 2025 it released LTX-2, with audio capabilities built in. === 2026 === In February 2026, ByteDance released Seedance 2.0, it was noted for its impressive realistic generation, motion and camera control and 15 second generation, however the model faced huge critiscism from Motion Picture Association for copyright infringement. After viewing a viral clip of a fight between actors Brad Pitt and Tom Cruise, Rhett Reese, who is the co-writer of Deadpool & Wolverine and Zombieland announced that on social media "I hate to say it. It’s likely over for us," further stating that "In next to no time, one person is going to be able to sit at a computer and create a movie indistinguishable from what Hollywood now releases." == Architecture and training == There are several architectures that have been used to create text-to-video models. Similar to text-to-image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively. An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion — and diffusion models have also been used to develop the image generation aspects of the model. Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M. These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM. These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts. The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. This predictive process is subject to decline in quality as the length of the video increases due to resource limitations. The Will Smith Eating Spaghetti test is a benchmark for models. == Limitations == Despite the rapid evolution of text-to-video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs. Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility. Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model's ability to align generated video with the user's intended message. Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation. Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem that stable diffusion models also struggle with. Examples include distorted hands and unreadable text. == Ethics == The deployment of text-to-video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent. Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences. == Impacts and applications == Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content. During the Russo-Ukrainian war, fake videos made with artificial intelligence were created as part of a propaganda war against Ukraine and shared in social media. These included depictions of children in the Ukrainian Armed Forces, fake ads targeting children encouraging them to denounce critics of the Ukrainian government, or fictitious statements by Ukrainian President Volodymyr Zelenskyy about the country's surrender, among others. === Movies === Kaur vs Kore is the first Indian feature film made using generative AI which features dual role for the AI character of Sunny Leone, set to release in 2026. Chiranjeevi Hanuman – The Eternal is an Indian movie made entirely using Generative AI created by Vijay Subramaniam which is set for theatrical release in 2026. The movie was widely criticised by the Film makers in the Bollywood industr

    Read more →
  • Smart data capture

    Smart data capture

    Smart data capture (SDC), also known as 'intelligent data capture' or 'automated data capture', describes the branch of technology concerned with using computer vision techniques like optical character recognition (OCR), barcode scanning, object recognition and other similar technologies to extract and process information from semi-structured and unstructured data sources. IDC characterize smart data capture as an integrated hardware, software, and connectivity strategy to help organizations enable the capture of data in an efficient, repeatable, scalable, and future-proof way. Data is captured visually from barcodes, text, IDs and other objects - often from many sources simultaneously - before being converted and prepared for digital use, typically by artificial intelligence-powered software. An important feature of SDC is that it focuses not just on capturing data more efficiently but serving up easy-to-access, actionable insights at the instant of data collection to both frontline and desk-based workers, aiding decision-making and making it a two-way process. Smart data capture automates and accelerates capture, applying insights in real time and automating processes based on extracted input. Smart data capture is designed to be repeatable and scalable to reduce low-level manual tasks and eliminate human error. To achieve this goal, smart data capture solutions are often made available using specialist software installed on commodity hardware such as smartphones. However, some solutions may rely on specialized hardware such as dedicated scanning devices, wearables or shop floor robots. == Differences from OCR == Optical character recognition applications are typically concerned with the actual data capture process; they are intended to faithfully reproduce text, words, letters and symbols from a printed document. Smart data capture is multimodal, capable of extracting data from a wider range of semi-structured and unstructured sources, going beyond basic text recognition to offer a wider scope of applications. By extending functionality to provide actionable insights at the point of capture, SDC is also a two-way process (capture-display), while OCR is more commonly one-way (capture only), primarily used for data input. Smart data capture solutions typically have two parts: Data capture (which includes OCR, barcode scanning, object recognition) Functionality that then uses this data to provide actionable insights at the point of capture. == Applications == Smart data capture can be applied to almost any industry and application that requires visual information capture and interpretation. This may include: Retail Warehouse inventory control Logistics, handling and shipping Manufacturing Field service Healthcare Transport and travel Fraud detection

    Read more →
  • Document mosaicing

    Document mosaicing

    Document mosaicing is a process that stitches multiple, overlapping snapshot images of a document together to produce one large, high resolution composite. The document is slid under a stationary, over-the-desk camera by hand until all parts of the document are snapshotted by the camera's field of view. As the document slid under the camera, all motion of the document is coarsely tracked by the vision system. The document is periodically snapshotted such that the successive snapshots are overlap by about 50%. The system then finds the overlapped pairs and stitches them together repeatedly until all pairs are stitched together as one piece of document. The document mosaicing can be divided into four main processes. Tracking Feature detecting Correspondences establishing Images mosaicing. == Tracking (simple correlation process) == In this process, the motion of the document slid under the camera is coarsely tracked by the system. Tracking is performed by a process called simple correlation process. In the first frame of snapshots, a small patch is extracted from the center of the image as a correlation template. The correlation process is performed in the four times size of the patch area of the next frame. The motion of the paper is indicated by the peak in the correlation function. The peak in the correlation function indicates the motion of the paper. The template is resampled from this frame and the tracking continues until the template reaches the edge of the document. After the template reaches the edge of the document, another snapshot is taken and the tracking process performs repeatedly until the whole document is imaged. The snapshots are stored in an ordered list to facilitate pairing the overlapped images in later processes. == Feature detecting for efficient matching == Feature detection is the process of finding the transformation that aligns one image with another. There are two main approaches for feature detection. Feature-based approach : Motion parameters are estimated from point correspondences. This approach is suitable for the case that there is plenty supply of stable and detectable features. Featureless approach : When the motion between the two images is small, the motion parameters are estimated using optical flow. On the other hand, when the motion between the two images is large, the motion parameters are estimated using generalised cross-correlation. However, this approach requires a computationally expensive resources. Each image is segmented into a hierarchy of columns, lines, and words to match the organised sets of features across images. Skew angle estimation and columns, lines and words finding are the examples of feature detection operations. === Skew angle estimation === Firstly, the angle that the rows of text make with the image raster lines (skew angle) is estimated. It is assumed to lie in the range of ±20°. A small patch of text in the image is selected randomly and then rotated in the range of ±20° until the variance of the pixel intensities of the patch summed along the raster lines is maximised. To ensure that the found skew angle is accurate, the document mosaic system performs calculation at many image patches and derive the final estimation by finding the average of the individual angles weighted by the variance of the pixel intensities of each patch. === Columns, lines and words finding === In this operation, the de-skewed document is intuitively segmented into a hierarchy of columns, lines and words. The sensitivity to illumination and page coloration of the de-skewed document can be removed by applying a Sobel operator to the de-skewed image and thresholding the output to obtain the binary gradient, de-skewed image. The operation can be roughly separated into 3 steps: column segmentation, line segmentation and word segmentation. Columns are easily segmented from the binary gradient, de-skewed images by summing pixels vertically. Baselines of each row are segmented in the same way as the column segmentation process but horizontally. Finally, individual words are segmented by applying the vertical process at each segmented row. These segmentations are important because the document mosaic is created by matching the lower right corners of words in overlapping images pair. Moreover, the segmentation operation can organize the list of images in the context of a hierarchy of rows and column reliably. The segmentation operation involves a considerable amount of summing in the binary gradient, de-skewed images, which done by construct a matrix of partial sums whose elements are given by p i y = ∑ u = 1 i ∑ v = 1 j b u v {\displaystyle p_{iy}=\sum _{u=1}^{i}\sum _{v=1}^{j}b_{uv}} The matrix of partial sums is calculated in one pass through the binary gradient, de-skewed image. ∑ u = u 1 u 2 ∑ v = v 1 v 2 b u v = p u 2 v 2 + p u 1 v 1 − p u 1 v 2 − p u 2 v 1 {\displaystyle \sum _{u=u_{1}}^{u_{2}}\sum _{v=v_{1}}^{v_{2}}b_{uv}=p_{u_{2}v_{2}}+p_{u_{1}v_{1}}-p_{u_{1}v_{2}}-p_{u_{2}v_{1}}} == Correspondences establishing == The two images are now organized in hierarchy of linked lists in following structure : image=list of columns row=list of words column=list of row word=length (in pixels) At the bottom of the structure, the length of each word is recorded for establishing correspondence between two images to reduce to search only the corresponding structures for the groups of words with the matching lengths. === Seed match finding === A seed match finding is done by comparing each row in image1 with each row in image2. The two rows are then compared to each other by every word. If the length (in pixel) of the two words (one from image1 and one from image2) and their immediate neighbours agree with each other within a predefined tolerance threshold (5 pixels, for example), then they are assumed to match. The row of each image is assumed a match if there are three or more word matches between the two rows. The seed match finding operation is terminated when two pairs of consecutive row match are found. === Match list building === After finishing a seed match finding operation, the next process is to build the match list to generate the correspondences points of the two images. The process is done by searching the matching pairs of rows away from the seed row. == Images mosaicing == Given the list of corresponding points of the two images, finding the transformation of the overlapping portion of the images is the next process. Assuming a pinhole camera model, the transformation between pixels (u,v) of image 1 and pixels (u0, v0) of image 2 is demonstrated by a plane-to-plane projectivity. [ s u ′ s v ′ s ] = [ p 11 p 12 p 13 p 21 p 22 p 23 p 31 p 32 1 ] [ u v 1 ] E q .1 {\displaystyle \left[{\begin{array}{c}su'\\sv'\\s\end{array}}\right]=\left[{\begin{array}{ccc}p_{11}&p_{12}&p_{13}\\p_{21}&p_{22}&p_{23}\\p_{31}&p_{32}&1\end{array}}\right]\left[{\begin{array}{c}u\\v\\1\end{array}}\right]\qquad Eq.1} The parameters of the projectivity is found from four pairs of matching points. RANSAC regression technique is used to reject outlying matches and estimate the projectivity from the remaining good matches. The projectivity is fine-tuned using correlation at the corners of the overlapping portion to obtain four correspondences to sub-pixel accuracy. Therefore, image1 is then transformed into image2's coordinate system using Eq.1. The typical result of the process is shown in Figure 5. === Many images coping === Finally, the whole page composition is built up by mapping all the images into the coordinate system of an "anchor" image, which is normally the one nearest the page center. The transformations to the anchor frame are calculated by concatenating the pair-wise transformations found earlier. The raw document mosaic is shown in Figure 6. However, there might be a problem of non-consecutive images that are overlap. This problem can be solved by performing Hierarchical sub-mosaics. As shown in Figure 7, image1 and image2 are registered, as are image3 and image4, creating two sub-mosaics. These two sub-mosaics are later stitched together in another mosaicing process. == Applied areas == There are various areas that the technique of document mosaicing can be applied to such as : Text segmentation of images of documents Document Recognition Interaction with paper on the digital desk Video mosaics for virtual environments Image registration techniques == Relevant research papers == Huang, T.S.; Netravali, A.N. (1994). "Motion and structure from feature correspondences: A review". Proceedings of the IEEE. 82 (2): 252–268. doi:10.1109/5.265351. D.G. Lowe. [1] Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Boston, 1985. Irani, M.; Peleg, S. (1991). "Improving resolution by image registration". CVGIP: Graphical Models and Image Processing. 53 (3): 231–239. doi:10.1016/1049-9652(91)90045-L. S2CID 4834546. Shivakumara, P.; Kumar, G. Hemantha; Guru, D. S.; Nagabhushan, P. (2006). "

    Read more →
  • Plotting algorithms for the Mandelbrot set

    Plotting algorithms for the Mandelbrot set

    There are many programs and algorithms used to plot the Mandelbrot set and other fractals, some of which are described in fractal-generating software. These programs use a variety of algorithms to determine the color of individual pixels efficiently. == Escape time algorithm == The simplest algorithm for generating a representation of the Mandelbrot set is known as the "escape time" algorithm. A repeating calculation is performed for each x, y point in the plot area and based on the behavior of that calculation, a color is chosen for that pixel. === Unoptimized naïve escape time algorithm === In both the unoptimized and optimized escape time algorithms, the x and y locations of each point are used as starting values in a repeating, or iterating calculation (described in detail below). The result of each iteration is used as the starting values for the next. The values are checked during each iteration to see whether they have reached a critical "escape" condition, or "bailout". If that condition is reached, the calculation is stopped, the pixel is drawn, and the next x, y point is examined. For some starting values, escape occurs quickly, after only a small number of iterations. For starting values very close to but not in the set, it may take hundreds or thousands of iterations to escape. For values within the Mandelbrot set, escape will never occur. The programmer or user must choose how many iterations–or how much "depth"–they wish to examine. The higher the maximal number of iterations, the more detail and subtlety emerge in the final image, but the longer time it will take to calculate the fractal image. Escape conditions can be simple or complex. Because no complex number with a real or imaginary part greater than 2 can be part of the set, a common bailout is to escape when either coefficient exceeds 2. A more computationally complex method that detects escapes sooner, is to compute distance from the origin using the Pythagorean theorem, i.e., to determine the absolute value, or modulus, of the complex number. If this value exceeds 2, or equivalently, when the sum of the squares of the real and imaginary parts exceed 4, the point has reached escape. More computationally intensive rendering variations include the Buddhabrot method, which finds escaping points and plots their iterated coordinates. The color of each point represents how quickly the values reached the escape point. Often black is used to show values that fail to escape before the iteration limit, and gradually brighter colors are used for points that escape. This gives a visual representation of how many cycles were required before reaching the escape condition. To render such an image, the region of the complex plane we are considering is subdivided into a certain number of pixels. To color any such pixel, let c {\displaystyle c} be the midpoint of that pixel. We now iterate the critical point 0 under P c {\displaystyle P_{c}} , checking at each step whether the orbit point has modulus larger than 2. When this is the case, we know that c {\displaystyle c} does not belong to the Mandelbrot set, and we color our pixel according to the number of iterations used to find out. Otherwise, we keep iterating up to a fixed number of steps, after which we decide that our parameter is "probably" in the Mandelbrot set, or at least very close to it, and color the pixel black. In pseudocode, this algorithm would look as follows. The algorithm does not use complex numbers and manually simulates complex-number operations using two real numbers, for those who do not have a complex data type. The program may be simplified if the programming language includes complex-data-type operations. for each pixel (Px, Py) on the screen do x0 := scaled x coordinate of pixel (scaled to lie in the Mandelbrot X scale (-2.00, 0.47)) y0 := scaled y coordinate of pixel (scaled to lie in the Mandelbrot Y scale (-1.12, 1.12)) x := 0.0 y := 0.0 iteration := 0 max_iteration := 1000 while (xx + yy ≤ 22 AND iteration < max_iteration) do xtemp := xx - yy + x0 y := 2xy + y0 x := xtemp iteration := iteration + 1 color := palette[iteration] plot(Px, Py, color) Here, relating the pseudocode to c {\displaystyle c} , z {\displaystyle z} and P c {\displaystyle P_{c}} : z = x + i y {\displaystyle z=x+iy\ } z 2 = x 2 + 2 i x y {\displaystyle z^{2}=x^{2}+2ixy} - y 2 {\displaystyle y^{2}\ } c = x 0 + i y 0 {\displaystyle c=x_{0}+iy_{0}\ } and so, as can be seen in the pseudocode in the computation of x and y: x = R e ⁡ ( z 2 + c ) = x 2 − y 2 + x 0 {\displaystyle x=\mathop {\mathrm {Re} } (z^{2}+c)=x^{2}-y^{2}+x_{0}} and y = I m ⁡ ( z 2 + c ) = 2 x y + y 0 . {\displaystyle y=\mathop {\mathrm {Im} } (z^{2}+c)=2xy+y_{0}.\ } To get colorful images of the set, the assignment of a color to each value of the number of executed iterations can be made using one of a variety of functions (linear, exponential, etc.). One practical way, without slowing down calculations, is to use the number of executed iterations as an entry to a palette initialized at startup. If the color table has, for instance, 500 entries, then the color selection is n mod 500, where n is the number of iterations. === Optimized escape time algorithms === The code in the previous section uses an unoptimized inner while loop for clarity. In the unoptimized version, one must perform five multiplications per iteration. To reduce the number of multiplications the following code for the inner while loop may be used instead: x2:= 0 y2:= 0 w:= 0 while (x2 + y2 ≤ 4 and iteration < max_iteration) do x:= x2 - y2 + x0 y:= w - x2 - y2 + y0 x2:= x x y2:= y y w:= (x + y) (x + y) iteration:= iteration + 1 The above code works via some algebraic simplification of the complex multiplication: ( i y + x ) 2 = − y 2 + 2 i y x + x 2 = x 2 − y 2 + 2 i y x {\displaystyle {\begin{aligned}(iy+x)^{2}&=-y^{2}+2iyx+x^{2}\\&=x^{2}-y^{2}+2iyx\end{aligned}}} Using the above identity, the number of multiplications can be reduced to three instead of five. The above inner while loop can be further optimized by expanding w to w = x 2 + 2 x y + y 2 {\displaystyle w=x^{2}+2xy+y^{2}} Substituting w into y = w − x 2 − y 2 + y 0 {\displaystyle y=w-x^{2}-y^{2}+y_{0}} yields y = 2 x y + y 0 {\displaystyle y=2xy+y_{0}} and hence calculating w is no longer needed. The further optimized pseudocode for the above is: x:= 0 y:= 0 x2:= 0 y2:= 0 while (x2 + y2 ≤ 4 and iteration < max_iteration) do x2:= x x y2:= y y y:= 2 x y + y0 x:= x2 - y2 + x0 iteration:= iteration + 1 Note that in the above pseudocode, 2 x y {\displaystyle 2xy} seems to increase the number of multiplications by 1, but since 2 is the multiplier the code can be optimized via ( x + x ) y {\displaystyle (x+x)y} . == Coloring algorithms == In addition to plotting the set, a variety of algorithms have been developed to efficiently color the set in an aesthetically pleasing way show structures of the data (scientific visualisation) === Histogram coloring === A more complex coloring method involves using a histogram which pairs each pixel with said pixel's maximum iteration count before escape/bailout. This method will equally distribute colors to the same overall area, and, importantly, is independent of the maximum number of iterations chosen. This algorithm has four passes. The first pass involves calculating the iteration counts associated with each pixel (but without any pixels being plotted). These are stored in an array IterationCounts[x][y], where x and y are the x and y coordinates of said pixel on the screen respectively. The first step of the second pass is to create an array NumIterationsPerPixel[n], where the array size n is the maximum iteration count. Next, one must iterate over the array of pixel-iteration count pairs IterationCounts[x][y], and retrieve each pixel's saved iteration count, i, via e.g. i = IterationCounts[x][y]. After each pixel's iteration count i is retrieved, it is necessary to index the NumIterationsPerPixel array at i and increment the indexed value (which is initially zero) -- e.g. NumIterationsPerPixel[i] = NumIterationsPerPixel[i] + 1. for (x = 0; x < width; x++) do for (y = 0; y < height; y++) do i:= IterationCounts[x][y] NumIterationsPerPixel[i]++ The third pass iterates through the NumIterationsPerPixel array and adds up all the stored values, saving them in total. The array index represents the number of pixels that reached that iteration count before bailout. total: = 0 for (i = 0; i < max_iterations; i++) do total += NumIterationsPerPixel[i] After this, the fourth pass begins and all the values in the IterationCounts array are indexed, and, for each iteration count i, associated with each pixel, the count is added to a global sum of all the iteration counts from 1 to i in the NumIterationsPerPixel array . This value is then normalized by dividing the sum by the total value computed earlier. hue[][]:= 0.0 for (x = 0; x < width; x++) do for (y = 0; y < height; y++) do iteration:= Iteration

    Read more →
  • Image analysis

    Image analysis

    Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophisticated as identifying a person from their face. Computers are indispensable for the analysis of large amounts of data, for tasks that require complex computation, or for the extraction of quantitative information. On the other hand, the human visual cortex is an excellent image analysis apparatus, especially for extracting higher-level information, and for many applications — including medicine, security, and remote sensing — human analysts still cannot be replaced by computers. For this reason, many important image analysis tools such as edge detectors and neural networks are inspired by human visual perception models. == Digital == Digital Image Analysis or Computer Image Analysis is when a computer or electrical device automatically studies an image to obtain useful information from it. Note that the device is often a computer but may also be an electrical circuit, a digital camera or a mobile phone. It involves the fields of computer or machine vision, and medical imaging, and makes heavy use of pattern recognition, digital geometry, and signal processing. This field of computer science developed in the 1950s at academic institutions such as the MIT A.I. Lab, originally as a branch of artificial intelligence and robotics. It is the quantitative or qualitative characterization of two-dimensional (2D) or three-dimensional (3D) digital images. 2D images are, for example, to be analyzed in computer vision, and 3D images in medical imaging. The field was established in the 1950s—1970s, for example with pioneering contributions by Azriel Rosenfeld, Herbert Freeman, Jack E. Bresenham, or King-Sun Fu. == Techniques == There are many different techniques used in automatically analysing images. Each technique may be useful for a small range of tasks, however there still aren't any known methods of image analysis that are generic enough for wide ranges of tasks, compared to the abilities of a human's image analysing capabilities. Examples of image analysis techniques in different fields include: 2D and 3D object recognition, image segmentation, motion detection e.g. Single particle tracking, video tracking, optical flow, medical scan analysis, 3D Pose Estimation. == Deep learning == Since the early 2010s, deep learning methods have substantially advanced the field of image analysis. In 2012, a deep convolutional neural network (CNN) known as AlexNet achieved a significant reduction in error rates on the ImageNet large-scale image classification benchmark, demonstrating the effectiveness of deep learning for visual recognition tasks. Subsequent architectures such as ResNet introduced residual connections that enabled training of much deeper networks, further improving accuracy across image analysis tasks. Real-time object detection became practical with frameworks such as YOLO (You Only Look Once), which unified detection and classification into a single network pass. In 2020, the Vision Transformer (ViT) demonstrated that transformer architectures, originally developed for natural language processing, could achieve competitive results on image classification when applied directly to sequences of image patches. More recently, foundation models trained on large-scale datasets have enabled zero-shot generalisation across image analysis tasks. The Segment Anything Model (SAM), trained on over one billion masks, can segment arbitrary objects in images without task-specific fine-tuning. These advances have made image analysis techniques increasingly accessible through browser-based tools and open-source implementations. == Applications == The applications of digital image analysis are continuously expanding through all areas of science and industry, including: anatomy, allows for precise measurements, visualization, and statistical analysis of anatomical structures. assay micro plate reading, such as detecting where a chemical was manufactured. astronomy, such as calculating the size of a planet. automated species identification (e.g. plant and animal species) defense error level analysis filtering machine vision, such as to automatically count items in a factory conveyor belt. materials science, such as determining if a metal weld has cracks. medicine, such as detecting cancer in a mammography scan. metallography, such as determining the mineral content of a rock sample. microscopy, such as counting the germs in a swab. automatic number plate recognition; optical character recognition, such as automatic license plate detection. remote sensing, such as detecting intruders in a house, and producing land cover/land use maps. robotics, such as to avoid steering into an obstacle. security, such as detecting a person's eye color or hair color. == Object-based == Object-based image analysis (OBIA) involves two typical processes, segmentation and classification. Segmentation helps to group pixels into homogeneous objects. The objects typically correspond to individual features of interest, although over-segmentation or under-segmentation is very likely. Classification then can be performed at object levels, using various statistics of the objects as features in the classifier. Statistics can include geometry, context and texture of image objects. Over-segmentation is often preferred over under-segmentation when classifying high-resolution images. Object-based image analysis has been applied in many fields, such as cell biology, medicine, earth sciences, and remote sensing. For example, it can detect changes of cellular shapes in the process of cell differentiation.; it has also been widely used in the mapping community to generate land cover. When applied to earth images, OBIA is known as geographic object-based image analysis (GEOBIA), defined as "a sub-discipline of geoinformation science devoted to (...) partitioning remote sensing (RS) imagery into meaningful image-objects, and assessing their characteristics through spatial, spectral and temporal scale". The international GEOBIA conference has been held biannually since 2006. OBIA techniques are implemented in software such as eCognition or the Orfeo toolbox.

    Read more →
  • Topological deep learning

    Topological deep learning

    Topological deep learning (TDL) is a research field that extends deep learning to handle complex, non-Euclidean data structures. Traditional deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), excel in processing data on regular grids and sequences. However, scientific and real-world data often exhibit more intricate data domains encountered in scientific computations, including point clouds, meshes, time series, scalar fields graphs, or general topological spaces like simplicial complexes and CW complexes. TDL addresses this by incorporating topological concepts to process data with higher-order relationships, such as interactions among multiple entities and complex hierarchies. This approach leverages structures like simplicial complexes and hypergraphs to capture global dependencies and qualitative spatial properties, offering a more nuanced representation of data. TDL also encompasses methods from computational and algebraic topology that permit studying properties of neural networks and their training process, such as their predictive performance or generalization properties. The mathematical foundations of TDL are algebraic topology, differential topology, and geometric topology. Therefore, TDL can be generalized for data on differentiable manifolds, knots, links, tangles, curves, etc. == History and motivation == Traditional techniques from deep learning often operate under the assumption that a dataset is residing in a highly-structured space (like images, where convolutional neural networks exhibit outstanding performance over alternative methods) or a Euclidean space. The prevalence of new types of data, in particular graphs, meshes, and molecules, resulted in the development of new techniques, culminating in the field of geometric deep learning, which originally proposed a signal-processing perspective for treating such data types. While originally confined to graphs, where connectivity is defined based on nodes and edges, follow-up work extended concepts to a larger variety of data types, including simplicial complexes and CW complexes, with recent work proposing a unified perspective of message-passing on general combinatorial complexes. An independent perspective on different types of data originated from topological data analysis, which proposed a new framework for describing structural information of data, i.e., their "shape," that is inherently aware of multiple scales in data, ranging from local information to global information. While at first restricted to smaller datasets, subsequent work developed new descriptors that efficiently summarized topological information of datasets to make them available for traditional machine-learning techniques, such as support vector machines or random forests. Such descriptors ranged from new techniques for feature engineering over new ways of providing suitable coordinates for topological descriptors, or the creation of more efficient dissimilarity measures. Contemporary research in this field is largely concerned with either integrating information about the underlying data topology into existing deep-learning models or obtaining novel ways of training on topological domains. == Learning on topological spaces == One of the core concepts in topological deep learning is considering the domain upon which this data is defined and supported. In case of Euclidean data, such as images, this domain is a grid, upon which the pixel value of the image is supported. In a more general setting this domain might be a topological domain. Studying and developing deep learning models that are supported ln topological domains constitute the essence of topological deep learning. Next, we introduce the most common topological domains that are encountered in a deep learning setting. These domains include, but not limited to, graphs, simplicial complexes, cell complexes, combinatorial complexes and hypergraphs. Given a finite set S of abstract entities, a neighborhood function N {\displaystyle {\mathcal {N}}} on S is an assignment that attach to every point x {\displaystyle x} in S a subset of S or a relation. Such a function can be induced by equipping S with an auxiliary structure. Edges provide one way of defining relations among the entities of S. More specifically, edges in a graph allow one to define the notion of neighborhood using, for instance, the one hop neighborhood notion. Edges however, limited in their modeling capacity as they can only be used to model binary relations among entities of S since every edge is connected typically to two entities. In many applications, it is desirable to permit relations that incorporate more than two entities. The idea of using relations that involve more than two entities is central to topological domains. Such higher-order relations allow for a broader range of neighborhood functions to be defined on S to capture multi-way interactions among entities of S. Next we review the main properties, advantages, and disadvantages of some commonly studied topological domains in the context of deep learning, including (abstract) simplicial complexes, regular cell complexes, hypergraphs, and combinatorial complexes. ==== Comparisons among topological domains ==== Each of the enumerated topological domains has its own characteristics, advantages, and limitations: Simplicial complexes Simplest form of higher-order domains. Extensions of graph-based models. Admit hierarchical structures, making them suitable for various applications. Hodge theory can be naturally defined on simplicial complexes. Require relations to be subsets of larger relations, imposing constraints on the structure. Cell Complexes Generalize simplicial complexes. Provide more flexibility in defining higher-order relations. Each cell in a cell complex is homeomorphic to an open ball, attached together via attaching maps. Boundary cells of each cell in a cell complex are also cells in the complex. Represented combinatorially via incidence matrices. Hypergraphs Allow arbitrary set-type relations among entities. Relations are not imposed by other relations, providing more flexibility. Do not explicitly encode the dimension of cells or relations. Useful when relations in the data do not adhere to constraints imposed by other models like simplicial and cell complexes. Combinatorial Complexes : Generalize and bridge the gaps between simplicial complexes, cell complexes, and hypergraphs. Allow for hierarchical structures and set-type relations. Combine features of other complexes while providing more flexibility in modeling relations. Can be represented combinatorially, similar to cell complexes. ==== Hierarchical structure and set-type relations ==== The properties of simplicial complexes, cell complexes, and hypergraphs give rise to two main features of relations on higher-order domains, namely hierarchies of relations and set-type relations. ===== Rank function ===== A rank function on a higher-order domain X is an order-preserving function rk: X → Z, where rk(x) attaches a non-negative integer value to each relation x in X, preserving set inclusion in X. Cell and simplicial complexes are common examples of higher-order domains equipped with rank functions and therefore with hierarchies of relations. ===== Set-type relations ===== Relations in a higher-order domain are called set-type relations if the existence of a relation is not implied by another relation in the domain. Hypergraphs constitute examples of higher-order domains equipped with set-type relations. Given the modeling limitations of simplicial complexes, cell complexes, and hypergraphs, we develop the combinatorial complex, a higher-order domain that features both hierarchies of relations and set-type relations. The learning tasks in TDL can be broadly classified into three categories: Cell classification: Predict targets for each cell in a complex. Examples include triangular mesh segmentation, where the task is to predict the class of each face or edge in a given mesh. Complex classification: Predict targets for an entire complex. For example, predict the class of each input mesh. Cell prediction: Predict properties of cell-cell interactions in a complex, and in some cases, predict whether a cell exists in the complex. An example is the prediction of linkages among entities in hyperedges of a hypergraph. In practice, to perform the aforementioned tasks, deep learning models designed for specific topological spaces must be constructed and implemented. These models, known as topological neural networks, are tailored to operate effectively within these spaces. === Topological neural networks === Central to TDL are topological neural networks (TNNs), specialized architectures designed to operate on data structured in topological domains. Unlike traditional neural networks tailored for grid-like structures, TNNs are adept at handling more intricate data representations, such as graphs

    Read more →
  • 1.58-bit large language model

    1.58-bit large language model

    A 1.58-bit large language model (also known as a ternary LLM) is a type of large language model (LLM) designed to be computationally efficient. It achieves this by using weights that are restricted to only three values: -1, 0, and +1. This restriction significantly reduces the model's memory footprint and allows for faster processing, as computationally expensive multiplication operations can be replaced with lower-cost additions. This contrasts with traditional models that use 16-bit floating-point numbers (FP16 or BF16) for their weights. Studies have shown that for models up to several billion parameters, the performance of 1.58-bit LLMs on various tasks is comparable to their full-precision counterparts. This approach could enable powerful AI to run on less specialized and lower-power hardware. The name "1.58-bit" comes from the fact that a system with three states contains log 2 ⁡ 3 ≈ 1.58 {\displaystyle \log _{2}3\approx 1.58} bits of information. These models are sometimes also referred to as 1-bit LLMs in research papers, although this term can also refer to true binary models (with weights of -1 and +1). == BitNet == In 2024, Ma et al., researchers at Microsoft, declared that their 1.58-bit model, BitNet b1.58 is comparable in performance to the 16-bit Llama 2 and opens the era of 1-bit LLM. BitNet creators did not use the post-training quantization of weights but instead relied on the new BitLinear transform that replaced the nn.Linear layer of the traditional transformer design. In 2025, Microsoft researchers had released an open-weights and open inference code model BitNet b1.58 2B4T demonstrating performance competitive with the full precision models at 2B parameters and 4T training tokens. == Post-training quantization == BitNet derives its performance from being trained natively in 1.58 bit instead of being quantized from a full-precision model after training. Still, training is an expensive process and it would be desirable to be able to somehow convert an existing model to 1.58 bits. In 2024, HuggingFace reported a way to gradually ramp up the 1.58-bit quantization in fine-tuning an existing model down to 1.58 bits. == Critique == Some researchers point out that the scaling laws of large language models favor the low-bit weights only in case of undertrained models. As the number of training tokens increases, the deficiencies of low-bit quantization surface.

    Read more →
  • Symbaloo

    Symbaloo

    Symbaloo is a cloud-based site that allows users to organize and categorize web links in the form of buttons. Symbaloo works from a web browser and can be configured as a homepage, allowing users to create a personalized virtual desktop accessible from any device with an Internet connection. Symbaloo users, which must be previously registered, have a page with a grid of buttons that can be configured to link to a specific page. The site allows users to assign different colors to the buttons for easy visual classification. Symbaloo allows a single user to create different pages or screens with buttons. These screens called webmix are useful to separate topics and links can be shared with other users, making them public and sending the link via email. As of 2015 Symbaloo has 6 million users worldwide and mainly used as an online education resource. Symbaloo's slogan is "Start Simple".

    Read more →
  • Phrase structure grammar

    Phrase structure grammar

    The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue (Post canonical systems). Some authors, however, reserve the term for more restricted grammars in the Chomsky hierarchy: context-sensitive grammars or context-free grammars. In a broader sense, phrase structure grammars are also known as constituency grammars. The defining character of phrase structure grammars is thus their adherence to the constituency relation, as opposed to the dependency relation of dependency grammars. == History == In 1956, Chomsky wrote, "A phrase-structure grammar is defined by a finite vocabulary (alphabet) Vp, and a finite set Σ of initial strings in Vp, and a finite set F of rules of the form: X → Y, where X and Y are strings in Vp." == Constituency relation == In linguistics, phrase structure grammars are all those grammars that are based on the constituency relation, as opposed to the dependency relation associated with dependency grammars; hence, phrase structure grammars are also known as constituency grammars. Any of several related theories for the parsing of natural language qualify as constituency grammars, and most of them have been developed from Chomsky's work, including Government and binding theory Generalized phrase structure grammar Head-driven phrase structure grammar Lexical functional grammar The minimalist program Nanosyntax Further grammar frameworks and formalisms also qualify as constituency-based, although they may not think of themselves as having spawned from Chomsky's work, e.g. Arc pair grammar, and Categorial grammar.

    Read more →
  • Robot Monk Xian'er

    Robot Monk Xian'er

    Robot Monk Xian'er (Chinese: 贤二机器僧) is a humanoid robot based on the cartoon character Xian'er. It was developed by a team of monks, volunteers and AI experts from Beijing Longquan Monastery in Beijing, China. He can follow human instructions to make body movements, read scriptures and play Buddhist music. He can chat and respond to people's emotional and spiritual questions with Buddhist wisdom. As a chatbot, Robot Monk Xian'er is available on certain public platforms including WeChat and Facebook. Over the years, master Xuecheng, the abbot of Beijing Longquan Monastery, replied to thousands of questions on Sina Weibo. These questions and their answers become the data source of the chatbot.

    Read more →