OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B
The previous analysis I posted here dealt with the OCR process and analysis results obtained with Microsoft Florence-2-base and Alibaba Cloud Qwen2-VL-2B.
Deepseek has recently released a new model: named Janus, this “novel autoregressive framework […] unifies multimodal understanding and generation”. It has been trained on an approximate corpus of 500B text tokens and support image input resolutions of 384 x 384. As opposed to others multimodals, Janus uses a single, unified transformer architecture for processing but has been described as being able to address “the limitations of previous approaches by decoupling visual encoding into separate pathways”. Janus-1.3B was released on October 18, 2024.
After analyzing the results of OCR process and analysis with Florence-2-base and Qwen-2-VL-2B, it would be interesting to compare with those of the very recently released Janus-1.3B.
This study was made in the same conditions as previously. The model was used on Google Colab. The same commands and instructions given to Florence-2-base and Qwen2-VL-2B were given to Janus-1.3B as well. The material used for this study was the same corpus of images containing text (with examples of handwritten and typed text from different periods, in different languages and including an example of a piece of art containing elements of text).
Analysis of the U.S. Constitution by Janus-1.3B
The first image analyzed for this study by Janus-1.3B was an extract of the first page of the U.S. Constitution from 1787.
We first used the command “CAPTION” to analyze this image. Here is the description obtained with Janus-1.3B: “A piece of paper that says "Wet the People" on it”.
The second command “DETAILED_CAPTION” led to the following: “In this image we can see a paper with some text on it”.
The third command was the “MORE_DETAILED_CAPTION”: "The image is a page of a book. The title of the book is wet the people. The page is tan colored. The writing on the page is written in black ink. There is a signature at the bottom of the page. The signature is in a foreign language. The book is old and worn. There are words written in a cursive language".
As we can see in these results, Janus-1.3B analysis of the image and of the text are more literal and if they carefully described the aspect and aesthetic qualities of the page, they did not indicate any elements of context about the document itself.
Interestingly enough, it also indicates (in English), the text is written in a “foreign language”. This might be caused by the fact Janus-1.3B has been released by Chinese company DeepSeek which might have led it to detect English as a foreign language. It also might have been a misinterpretation caused by the cursive style of the letters.
As comparison, here are the descriptions obtained with Florence-2-based, for which it seems clear that the model used its own knowledge to analyze them:
If the results with Florence-2-base automatically identifies (and rightly so) the document as the Constitution of the U.S. they also underline the mistakes that can be brought when the model bases its analysis on its own previous knowledge (in this case, when it indicated the U.S. Constitution was signed by John F. Kennedy, probably linking it to U.S. history events).
Janus-1.3B OCR processing of the Constitution of the U.S. led to this result:
"Wethe Peopleof the Unitedinsure domestic Frangultty provide for this common defense, promote theand our Pouterity, al ordan and establish this Constitution for the United!Kircle 1.Statian. All aguatine Poree lown gandale sutting by the bockitian, and hil hul muitie of the houd and houStation. 2. The The The of The The Tho of The Thon of the Thon and the thon of thon and thon.The thon is the tht of the thom of the fonon and hon. The thon on the thot of the sotn of the bonon, and the honon of honn of thom.The hanan of the lonon in the hanon of a hon, the hs the ht of thn. The han on the hinon of tn the hil of the uonon".
As we can see, whereas the model seems to correctly read the letters individually at first, there seems to be an issue with Janus 1.3B to transcribe the words as a whole and to respect their place in the layout of the document, leading to spelling mistakes and inaccurate descriptions. The model also went in a loop rather quickly, as opposed to the other models we used previously. Janus 1.3B has trouble processing different types of layout when analyzing text in images.
Analysis of an handwritten text in French: the Vincent Van Gogh letter
The next text has already been analyzed by Florence-2-base and Qwen2-VL-2B in our previous study and is an handwritten letter in French written by Vincent Van Gogh in 1888, also containing a drawing.
Here is the result obtained by Janus-1.3B with the “CAPTION” command: “A piece of paper with a drawing of a man and a tree on it”. As opposed to Florence-2-base and Qwen2-VL-2B, Janus-1.3B did not give an interpretation of the text. The description of the image is accurate, brief yet without any context or explanation.
We pushed the model to describe the image further with the “DETAILED_CAPTION”, which led to this result: “'In this image we can see a paper. On the paper there is a drawing of a person, tree and some text”. Once again, the result is accurate, literal and does not contain any elements of context or any attempts at explaining the image or transcribing the text.
The “MORE_DETAILED_CAPTION” result was the following: “A piece of paper with writing on it. There is a drawing on the paper. The drawing is black and white. The writing is in a foreign language”. Janus-1.3B correctly detected the presence of the drawing (in black and white) as well as the fact it is written in a foreign language (this time in French, which it does not identify).
Here is Janus-1.3B OCR transcription of the French letter: “Est Céguilo ont la le live cle JelvostreJn Euy Delacrout anna' que l'arricleJn la couleurdus la grummane clesart. cluclofin ole ch. Blanc.demandes leundone cela de mau purst ofVernon vido nont pas du celu g nilsle lisent. Fépense mus a Rembrufplus g'n'l ne peut praire dans mesclado.Dolci Grugquis de me clermire tule entrainenco m juneur, Jumme Solleil, Gel yurt jum e nuy owne. loberau vielat le samur 1l'utre blue de prese/tide de 30”.
As compared to Qwen2-VL-2B, which delivered the best of the French letter transcription (and managed to transcribe accurately the name of artists mentioned such as Delacroix), or Florence-2-base which made an attempt at deciphering and transcribing the French text, this result shows that Janus-1.3B definitely has issues dealing with a text and especially in French. The fact it is handwritten might have made it even more difficult for Janus-1.3B to transcribe it correctly.
Analysis of a printed newspaper typed text in English: The New York Times front page from 1912
To compare Janus-1.3B abilities to OCR process and analyze text with Florence-2-base and Qwen2-VL-2B, this study used another example of image analyzed with the two other models. In this case, the image is the front page of The New York Times issue from April 1912, announcing the sinking of the Titanic.
The “CAPTION” command led to this brief description: “A newspaper article about the New York Times”. It might be interesting to notice the phrasing used by Janus-1.3B, indicating this is not an article “from” The New York Times but apparently “about” it. This result shows the model has correctly recognized the layout and format of the newspaper as well as its name. However, it does not deliver any further information or context, whether the date or the event described on the front page.
The “DETAILED_CAPTION” was the following: “In this image we can see a paper. On the paper there is a picture of a ship. Also something is written on the paper”. With this instruction, the model focuses its attention once again more on the aesthetic qualities of the image than on its textual content. While it indicates the picture of the ship, it does not make any attempt at trying to explain or contextualize it. Janus-1.3B mentions that “something is written on the paper" but it does not process any elements of the text.
The “MORE_DETAILED_CAPTION” brought this result: “The New York Times is written in black and white. There is a picture of a ship in the water. The ship is large and has a lot of smoke coming out of it”. Once more, Janus-1.3B correctly identifies main elements from this image (the name of the newspaper, a picture of a ship and a description of this picture) but does not deliver any other information about the text. It does not identify which event it is referring to, even if the name of the sinking Titanic is clearly stated at the top of the article. As opposed to Florence-2-base and Qwen2-VL-2B, Janus-1.3B is not as focused on context and does not try and retrieve any information from its own knowledge: this can have both a positive and negative impact on the results. The lack of contextual knowledge from the model means it does not have a bias on the content, miscomprehend it or make mistakes in the description. Still, the problem this brings is that by not trying to contextualize or analyze the image, the transcription and description are very brief, not very deep and incomplete.
When applied to Janus-1.3B, the “OCR” command led to the following transcription:
"Wall the News That'sThe New York Times.THE WEATHER.Fit to Print.JINY YORK, TURBAL, IL. 22-WENT-POIN PAGLE.WILL THE WEATHER,WILLI. J. T. TUNDAY, J.T. 22 - TWENT-POUR PAIGLE.COM.TITANIC SINKS FOUR HOURS AFTER HITTING ICEBERG;866 RESCUED BY CARPATHIA, PROBABLY 1250 PERISH;ISMA SAFE, MRS. ASTOR MAYBE, NOTED NAMES MISSINGCol. Astorstraus and Bride,Biggest Liner Pungersand Mj. But. Boardto Bottomof A.20 A.M.ROULE OF SEAT FOLLOWEDRESCULES THERE TOO HAN,PICKED UP ATTERHOUSESWOMEN AND CHILDREN PIRTYCUNDER HOPFULL ALL DAYSEA SEARCH FOR OTHERSFRANKEN HOPPULL, ALL DAYTHEAD OF THE LINE AROUNDOLIVING SENDS FOR THE NEWThe Lost Titanic Belfing Towed Out of Belfast Harbor.PARTAL LIST. OF THE SAVED.The Lostitanic Belfing, Towed out of Belast Harbor.LIVING SENDS OF THE SAVED.In addition to Mr. Wilkins, Mr. Willeman, The Mr. William, the Mr.Willeman and Mr. Willam, The M. Wilman, the Mrs. Wileman, and the Brides of the Cetetet, The St.Wilman and Mrs. Mr. williams, The Mrs.Wileman and the Mrs, The Marlast Harbor.The lostitanic Belfing, LIST OF THE SAFED.Coffee St. F. M.F.M., I.T., The Str.W. W. WILM. The Mrs. Wollman, who is the M.Walesman, I'm. The Marleman and I. Walesman will be the Marlant, and I'm, The Man. The Man is the Man.The Man, who was the Man, and The Man, the Man's The Man was the Woman, The The Man's Man, The Woman's Man. He's the Man-The Man was The Man-Towel, and he's the man-The man was the man. The man was a Man-and I'm the Man Man, He's a Man, he's a man-and the Man to the Man -The Man-A".
Compared to other results, the OCR results for The New York Times front page indicates Janus-1.3B can manage fairly well to transcribe the text, especially to process the letters individually. Indeed, there is no attempt at trying to place the words in a more readable and accurate layout. While a part of the article is correctly transcribed, in some parts there isn’t any spacing between the words. This model also had difficulties to process and correctly transcribe the large number of information in the header (name, slogan, date, location, article headlines etc.). It also went in a loop in the end.
Analysis of an handwritten text in English: the letter written by Queen Elizabeth II
The next text analyzed by Janus-1.3B was the letter written by future Queen Elizabeth II from 1945, that we also already analyzed previously with Florence-2-base and Qwen2-VL-2B.
In this image, the text is handwritten and this time in English. As it is only an extract of one of the pages of the whole letter, the signature of then Princess Elizabeth does not appear. There is only the indication of the “Buckingham Palace” in the heading of the letter which gives context.
The first command given to Janus-1.3B was “CAPTION” which led to the following description: “A letter written in cursive is dated April 24, 1945”. Once again, the content of the description is accurate. It does not make any attempt at trying to process, understand and explain the context, based on the text or on the whole image. The date it mentions is accurate.
The “DETAILED_CAPTION” led to this result: “In this image we can see a paper with some text and a stamp on it”. Interestingly enough, even with the instructions to give more information, the description of Janus-1.3B was less descriptive, indicating the presence of stamp but missing on transcribing the date for this time.
The “MORE_DETAILED_CAPTION” command brought to this more complete description: “A letter is written on a cream colored paper. The letters are written in cursive. At the top of the paper is a red stamp with the words "Buckingham Palace" written on it”. For this example, Janus-1.3B correctly processes and describes the image again (color of the paper, adds a mention that the letters are cursive, indicates the “Buckingham Palace” red stamp).This shows its abilities to analyze the aesthetical elements in this image as well as the text. Still, it’s interesting to point out that, as opposed to Florence-2-base and Qwen2-VL-2B which made many attemps at contextualizing the content, Janus-1.3B does not try this at all. Whereas the analysis both by Florence-2-base and Qwen2-VL-2B led to very descriptive and almost imaginative results, for example trying to find a potential recipient for this letter by searching in their own knowledge (Florence-2-base even pretended it was written to Queen Elizabeth’s future husband Prince Philip even if it is not the case at all), Janus-1.3B does not make any attempt at contextualizing the text.
The OCR analysis of the Queen Elizabeth II letter was the following: “a April1945.BUCKINGHAM PALACEDear may,J was so delghilled toreceive to your letter of goodwishes,for my birthday.Thankyn s so much for thinking of me.I'm sorry to cheer That youare on sick leave, but-9 do hopeyou are feeling letters now. Thisto walker a good time of yourto get leave reallyand won thatwe have got the good weather,are migll as well make less of it.I've just finished a wechauicscurse in the A.I.S. which 9”.
This transcription shows that while Janus-1.3B can be able to transcribe letters accurately, it has many difficulties to comprehend the words as a whole and to situate them in a correct sentence. It also does not contextualize the text at all. The general transcription was overall better for the English handwritten text than the example of the French letter, which did not make any sense. Still, there are many errors in the transcription in English as well (for example, when it indicates the letter says "are feeling letters now" instead of "are feeling better now")
Analysis of a piece of art containing elements of text: NOTARY (1983) by Jean-Michel Basquiat
To have a more complete overview of Janus-1.3B abilities at OCR processing and transcription of text in image, it seemed necessary to add another type of image containing text: thus, the same artwork by Jean-Michel Basquiat which has been previously analyzed with Florence-2-base and Qwen2-VL-2B was also processed for this study by Janus-1.3B.
This painting, titled NOTARY and dated from 1983, blends visual and textual elements as often in Basquiat’s art.
If Florence-2-base correctly identified the name of the artist and Qwen-2-VL-2B delivered a rather accurate description of all the words indicated in this artwork, it will be interesting to compare their results with those obtained with Janus-1.3B.
The “CAPTION” command led to this first description, which was extremely brief yet correct: “A painting of a man's head and body with lots of writing on it”.
The “DETAILED_CAPTION” was in fact even shorter: “In this image we can see a painting”. While this shows Janus-1.3B has abilities to detect artworks, it does not add any other elements of context nor does it explain or transcribe the content.
The results with the “MORE_DETAILED_CAPTION” were more complete and accurate: "The image is a painting. The painting is colorful. There is a person's face in the middle of the painting. There are words written on the painting as well. The words are written in a different language. The face is white with black eyes. The mouth is open. The nose is black. The eyes are blue. The outline of the face is red". In this case as well, Janus-1.3B shows potential to process and analyze visual content. The aesthetic qualities are the main focus, especially on one of the figures in the painting in particular. If it does mention that “There are words written on the painting as well”, the model did not attempt to transcribe any of it. It also indicates in this case as well that the words are "written in a different language". As in previous images, Janus-1.3B did not contextualize the Basquiat painting and it did not try to identify the artist or the subject.
The OCR command for the Basquiat painting led to this transcription: “NOTARYCASCALO0MARITPLUTO0ELEASPVMAROUSSTUDY OF THE150.MALE TORSOHESSYDEUTUTUTO150RE VULDEHYDRATISICKLESDEHYDRAMATTOCKSTHIS NITEFOR ALL DE BTSSALTESALERPUBLIC+PRIVATE46.LEECHESBUCKERMANITES47.LEECCHES304. BRANNER”.
Similar to the results obtained with Florence-2-base, the transcription by Janus-1.3B shows the model is able to detect and transcribe the individual letters present in this image accurately — but it did not make it easily readable as all the letters are juxtaposed together and are not separated by any spacing between the words. In this case, Qwen2-VL-2B actually managed to deliver a more correct transcription of the elements of text in this artwork.
Conclusion
To conclude this study, it can be said that Janus-1.3B might have potential in image analysis. While its results to describe the images are clear, up-to-the-point and brief, the lack of context can also be problematic in some cases. Janus-1.3B is a super intuitive model and it does manage to deliver accurate aesthetical descriptions but it keeps on missing on important elements, especially when processing text.
As we have seen in our previous study, the results were more successful when dealing with typed, printed text. In the case of Janus-1.3B, the examples of handwritten text were much more difficult to process, and even more so when it was written in French. As seen in the Vincent Van Gogh’s letter, the transcription did not make any sense at all — even less so than Florence-2-base transcription which already contained many errors as well.
When processing textual and visual content, Janus-1.3B makes zero attempts at contextualizing it. As opposed to Florence-2-base and Qwen2-VL-2B, it does not use its knowledge to situate and explain the content in the image. If this led to mistakes and miscomprehensions with the results of the two other VLM models, the lack of context made the transcriptions and descriptions of Janus-1.3B incomplete and very short. It also brings to miscomprehensions when processing words and to layout issues. This problem also shows the model might have difficulties to correctly understand and process the image as a whole. Still, these very literal descriptions have the benefits of not containing any kind of bias on the text and they do not make the model misinterpret the content.
The more positive results obtained when describing the aesthetical elements in images (especially with the Basquiat’s painting) indicate Janus-1.3B might be better suited as of now to process visual content, as opposed to text. The captions produced by the model always put a major focus on the aspect of the image, instead of the text. This might be interesting, for a future study, to see how Janus-1.3B might be able to analyze artworks and visual content, as compared to Florence-2-base and Qwen2-VL-2B again.
Bibliography.
- The Constitution of the United States, 1787
- Vincent Van Gogh, Letter to Theo Van Gogh, Arles, Nov. 21, 1888, Van Gogh Letters (https://www.vangoghletters.org/vg/letters/let722/letter.html#original)
- “Titanic Sinks Four Hours After Hitting Iceberg”, from The New York Times, April 16, 1912, Wikimedia Commons, (https://upload.wikimedia.org/wikipedia/commons/0/04/Titanic-NYT.jpg)
- Queen Elizabeth II, Letter to Mary, April 24, 1945 (https://www.express.co.uk/news/royal/1668073/queen-handwritten-letter-friend-world-war-two-spt)
- Jean-Michel Basquiat (1960-1988), NOTARY, 1983, Princeton University Art Museum