Gemini for MacOSX User Manual

Gemini for MacOSX User Manual

Table Of Contents

INTRODUCING GEMINI
-OBTAINING UPDATES
-HOW DOES “GEMINI” WORK?

GEMINI INTERFACE BASICS
-THE TOOL BAR

EXPORTING PAGES

EXPORTING ARTICLES
-WHAT IS AN ARTICLE?
-ADDING ARTICLE THREADS TO A DOCUMENT

EXPORTING BATCHES OF DOCUMENTS

EXPORTING TABLES
-EXPORTING A TABLE
-SPANNING IMAGE & TABLE BOXES ACROSS PAGES

EXPORTING IMAGES & VECTOR ARTWORK
-EXPORTING CHARTS AND GRAPHS
-INCLUDING TEXT IN RENDERED IMAGES
-CONVERTING WHOLE PAGES INTO IMAGES

REMOVING PAGE HEADERS & FOOTERS

SETTING PREFERENCES
-GENERAL PREFERENCES TAB
--TEXT OUTPUT FORMAT
--TRY TO PRESERVE LAYOUT
--CREATE FILE FOR EVERY PAGE / ARTICLE
--TREAT AS HEBREW
--IMAGE FORMAT MENU
--CONVERT EACH PAGE INTO AN IMAGE
--PLACE IMAGES IN SUB-FOLDER
-TEXT PREFERENCES TAB
--IDENTIFY IMAGE CAPTIONS
--RENDER TEXT IN ARTWORK
--RECOGNIZE “SPEECH MARKS”
--PRESERVE LINE BREAKS
--PAGE BREAKS
--HYPERLINKS
--BOOKMARKS
--USE UTF-8 ENCODING
-IMAGE PREFERENCES TAB
--AUTO-IDENTIFY VECTOR ARTWORK
--USE ORIGINAL OPI NAME IF AVAILABLE
-IMAGE SCALING
--FIX IMAGE RESOLUTION AT:
--ADVANCED SCALING POP-UP
--COLOUR DEPTH TO USE WHEN RENDERING
-HTML PREFERENCES
--BACKGROUND
--HEADER & FOOTER
--PAGE/ARTICLE NAVIGATION
--IMAGE BORDERS
--FRAMES

TEXT OUTPUT FORMATS EXPLAINED
-PLAIN TEXT FORMAT
-
RTF FORMAT
-
SIMPLE HTML FORMAT
-
HTML 3.2 FORMAT
-HTML 4 FORMAT
-HTML 4 CSS FORMAT
-OPEN EBOOK FORMAT
-
PALM DOC FORMAT

IMAGE FORMATS EXPLAINED
-JPEG FORMAT
-TIFF FORMAT
-TIFF (MULTIPAGE)
-BMP
-PNG
-EPS

FREQUENTLY ASKED QUESTIONS

TIPS FOR EXPORTING TEXT
TABLE OUTPUT
PALM DOC OUTPUT

Introducing Gemini
Gemini is a stand-alone application program for Macintosh computers that enables you to extract and re-use text and images held within PDF documents. The software takes care of much of the labor-intensive work involved in repurposing PDF documents such as reflowing and de-hyphenating text, translating hypertext links and converting embedded images.

Content can be extracted in pages or complete articles (using article threads) or by selecting individual tables and images. Gemini also offers a batch facility for automated conversion of hundreds of documents at a time.

A wide range of text and image output formats is supported including RTF, HTML, JPEG, PNG and Tiff. Additional controls are provided for HTML output enabling various timesaving effects to be achieved such as importing corporate headers and footers on each page.

Gemini incorporates a basic PDF viewer that allows you to zoom in and out, mark-up tables, images and article threads and add crop zones. There is also a basic ‘search’ facility that can locate words anywhere in a document.

Obtaining updates
Please visit www.iceni.com from time to time to check for news.

All updates up to version 2.0 will be free of charge to existing users.

How does “Gemini” work?
The text held within a PDF page need not be stored in any particular order and sometimes not even as discrete words but as a mix of “b roke n” words. Gemini analyses each page before output and then pieces together this jigsaw based upon many different factors such as position, font, character size and color of text.

Gemini cannot recognize tabulated data without it first being marked-up. If a table is marked-up using the “Table Box” facility, it will treat the area in a different way and generate clean, structured tables in the output. See “Exporting Tables” for details.

In a similar vein, you can also mark-up graphic regions to be rendered. This is especially useful for vector-based artwork that cannot otherwise be exported. See “Exporting Images & vector Artwork” for more information.

(back to the table of contents)

Gemini Interface Basics
The main window of Gemini shows a toolbar at the top, the currently open document in the middle together with bookmarks (revealed by pressing F5) and the navigation buttons at the bottom.

Only one document may be open at any time. You can however, close a document and open a new one.

Notes
• The current page number and total number of pages is shown in the title bar of the window.
• The dotted lines shown in the screenshot indicate the location of Article Thread boxes.

(back to the table of contents)

Exporting Pages
1. Open a PDF document
2. Choose the “Export Pages…” option from the menu bar.
3. Choose a page range for export
4. Modify the output settings by pressing the “Options…” button
5. Press “OK” to begin the export

Notes
• If some pages contain tabular data, mark-up the tables using the “Table Box” tool. If this is not done, the tables may appear jumbled when output. See “Exporting Tables” for more information.

• Export can be interrupted at any point by pressing the “Esc” key – there may be short delay before Gemini responds, depending upon the function being performed at the time.

• If the text output does not appear to flow in the correct reading order it may be that Gemini is unable to determine what the order should be. In these cases it helps to place an “Article Thread” around the problem area – see “Adding Article Threads to a Document” on page 8 for more information.

(back to the table of contents)

Exporting Articles
Gemini can extract content from articles in documents. The advantage of article export is that only the requested articles are exported, everything else is ignored. Also, paragraphs spanning columns and even pages are reflowed to give seamless, cross-column, cross-page output.

Select “Export--> Articles…” from the main menu

What is an article?
An Article Thread is a sequence of boxes drawn around paragraphs of text to indicate the order in which the text should be read. The sequence can extend across pages.

In Gemini, Articles can be used to arbitrarily reorder documents. If you have a document that does not output in the correct reading order, use article boxes to reorder the text flow. You can then export using Article or Page mode.

Notes
• When exporting articles, layout cannot be preserved.
• If the “Create file for every page/article” is selected, Gemini will write each complete article to a different file. Furthermore, when using HTML output with article mode, Gemini will produce a table of contents linking to each article output.

Adding Article Threads To A Document
1. To add article threads to your document, open the page containing the start of the story and select the “Article Tool” from the toolbar.

2. Drag a box around the first column of text in the article. Continue dragging boxes around subsequent columns, changing pages as needed to follow the story’s flow.

3. When the last box of your article has been drawn, press the “Esc” key to finish the article and display the “Article Properties” dialogue. Here you can enter information about the article such as the title and author. Some of this information may be exported with the article.

Notes
• You can create as many articles as you need to cover every story in a document. If you then save the document, the article boxes will be saved with it and will be ready for use next time it is loaded.
• Other PDF viewing applications such as Acrobat Reader should also take notice of the new article threads added by Gemini.

(back to the table of contents)

Exporting Batches of Documents
Gemini offers the ability to process a batch of document automatically. Using this feature many hundreds of documents may be "queued-up" and processed with little or no manual intervention.

1. Select “Export Batch…” from the main menu
2. Choose the page range to export from each document in the batch. If you want to export Articles from document rather than pages, check the “Export article threads” box now. Gemini can store output from each file in a separate folder. Check the “Create new folder for each file” box to do this.
3. Press OK.
4. Choose your documents from the file selection folder. You can drag-select to choose multiple files or use the Shift and Command keys. All selected files must be from the same folder.
5. Choose an output folder for the converted documents.

If for some reason a document cannot be processed, for example if its security settings do not permit content extraction, the program will display a message and wait for a response before continuing.

(back to the table of contents)

Exporting Tables
Tables can be exported directly by clicking on them (when marked-up) or as part of a normal page or article export.

Gemini does not automatically detect tables in PDF documents*, so all tables should be marked-up to ensure that layout is preserved correctly.

*Gemini will automatically locate tables in “tagged” PDF documents produced from Microsoft Word using Adobe’s PDFMaker plug-in. You do not need to mark-up tables in this case.

If your document contains similar tables across a range of pages (such as a financial report) use Gemini’s “spanning” feature. This allows a single table box to repeat across a range of pages. See “Spanning image and tables boxes…” for details.

Exporting a table
1. Select the table tool from the toolbar (or press “t”) and drag a box around the desired table.
2. Choose your output options from the Table Box dialogue. Nothing is output at this point but Gemini will remember the setting you make for this table when it comes time to export. If Gemini has previously exported this table with incorrect formatting, try checking the “Ignore borders…” check-box and exporting it again.
3. Press OK. If you wish to export the table as part of the entire page, you can stop now and export the page as normal using the “Export Pages..” command. Gemini will remember the table box you just added and format the tabulated data appropriately within the page.
4. Select the Hand Tool from the toolbar.
5. Click the mouse within the table you just marked.
6. Choose your output format and destination for output. Note that the SYLK output format is only available when exporting tables this way. It is not available when exporting tables as part of a “Export Pages…” operation.

Spanning image & table boxes across pages
In documents such as financial reports or invoices, tables may appear at exactly the same position across a range of pages. Instead of marking tables on each page by hand, Gemini can tag all the tables at once using the same Table Box.

Just as when extracting a single table, use the table tool to select the desired table on one page only. Use the page range options to specify the scope of the box just drawn:

Current Page: The box will only cover the current page.

Current Page to End: The box will affect every page from this one until the end of the document.

Current Page to Page …: The box will affect every page from this one up the specified page number.

When you next make use of the “Export-->Pages…” command, Gemini will retain table layout for each table found within the page range, despite the fact that only one box was actually created.

If you click on a marked table using the hand tool, Gemini will output the table and all others in the page range. They will all be output to a single file.

(back to the table of contents)

Exporting Images & Vector Artwork
Gemini can automatically locate and export any embedded images within a PDF. It cannot however export artwork directly (expect as EPS) but must first convert it into an image format such as JPEG. This process is called ‘rendering’.

Although the program tries to automatically identify areas containing vector artwork such as graphs and charts, it does not always succeed. In these cases it is advisable to mark up the artwork using the image tool.

The following steps illustrate the process of converting part of a page into an image. You do not need to follow these steps if you wish to export an embedded image since Gemini will do this itself during “Export Pages…”.

Exporting charts and graphs
1. Select the image tool from the tool bar (or press “i”) and draw a box around the part of the page.
2. Choose your output options from the Image Box dialogue.
3. Press ‘Export’ to render the image now. Otherwise, the image will be rendered and exported next time the page is exported.

Including text in rendered images
If you find the text is missing from rendered diagrams, please check the preferences dialogue box for the “Render text in vector artwork” checkbox. This governs whether text appears in the rendered output or in the text stream (such as the HTML file).

Converting whole pages into images
To convert a range of pages into images including all text and graphics, check the “Convert Each Page into an Image” check box in the “General Preferences” pane. See “Convert each page into an image”.

(back to the table of contents)

Removing Page Headers & Footers
When extracting content it can be useful to ignore certain areas of the page such as headers and footers. This can be achieved by applying a crop region to each page.

1. Select the Crop Tool (or press “c”) and draw a rectangle around the part of the page you wish to keep.
2. Choose the number of pages you wish to crop from the crop dialogue box.
3. Export the range of pages as normal. Only text and images within the cropped region will be output.

You can adjust the crop box at any time by selecting the crop tool then dragging or reshaping it as required.

To delete a crop box, press the delete key while the crop tool is selected.

(back to the table of contents)

Setting Preferences
Choose “Gemini Preferences…” from the main menu to show the preferences dialogue. Alternatively, the dialogue box can also be accessed from any export dialogue box by pressing the “Options…” button.

The dialogue is divided into "tabs" each dealing with a different area of configuration. The titles of the tabs are:

General Preferences Tab
The “General” preferences tab allows the main text and image output options to be chosen without having to switch to other panels. It acts as a short cut for changing the major features of output, keeping the more esoteric settings hidden in other tabs.

• Text output format
Gemini supports a varied range of text output formats including HTML, text and RTF for use with Microsoft Word and other applications.

Gemini can also export table data in SYLK format. However this is only available when clicking on a table that has been marked-up with a “Table Box” annotation. See “Exporting Tables” for more information.

For an explanation of each output format and how they behave in layout and non-layout modes, see “Text Output Formats Explained”.

• Try to preserve layout
In all text formats except Palm Doc, this option forces Gemini to try to preserve the original page layout. The way in which this is achieved depends upon the output format used.
• Plain text with the layout retained uses spaces to reflect the original layout. Output should be viewed without line wrapping in a mono-spaced font (such as Courier). Most output text documents will be very wide.
• Simple HTML with retained layout will be formatted as with plain text using letter spacing to position page elements.
• HTML 3 and 4 use tables to position page elements as closely as possible to the original and will all produce a similar layout to that found in the original.
• HTML 4 CSS uses the absolute positioning functionality available with CCS2. This may not be compatible with some older browsers.
• RTF output places columns of text into floating boxes on the page . The text within the boxes may be edited and reflowed but not across boxes. Occasionally the width of the boxes may need to be adjusted in order to cater for different widths of substitute fonts.

With all output formats there may be some deviation from the original document layout requiring a degree of manual adjustment.

• Create file for every page / article
Selecting this option will cause Gemini to output each page to a separate file. When outputting to HTML and eBook output files will be hyper linked together.

When outputting Articles, Gemini will place each article in a separate file.

This option is disabled for HTML 4 CSS + retain layout, since every page has to be written to a separate file in this instance.

• Treat as Hebrew
For documents containing text written right-to-left (such as Hebrew). Check this option when you wish to output logical Hebrew in HTML or plain text modes. Do not check this option if you are exporting with layout retained.

Only the HTML output formats will include the special tags necessary to help browsers deal with right-to-left text. RTF and other formats will contain no special instructions and thus may not work correctly.

• Image Format Menu
Gemini supports a rich range of image formats:
• JPEG,
• JPEG (progressive),
• Tiff,
• Tiff (multi-page),
• BMP
• PNG
• EPS (with clipping paths)

Each of these formats has different properties and is suitable for a different purpose.

• Convert each page into an image
The simplest way to convert an entire document into a set of images is to check this box. When you next export pages, Gemini will render the contents of each page at the resolution/size specified in the “Image Preferences” tab. Using this approach you can convert pages into high-resolution renders or thumbnails as required.

When this option is selected, text output is disabled and the “Render text in artwork” check box in the “Text Preferences” panel is also disabled.

The “Tiff (multipage)” image format can only be used when this option is chosen.

When combined with the “EPS” image format, pages are not rendered into images but instead converted to PostScript. These may be then be rendered by a PostScript interpreter.

You can control the colour format used for rendering pages, see “• Colour depth to use when rendering” for details.

• Place images in sub-folder
To store all images in a subfolder called “images” , check this option. This is useful for keeping images separate from text output when using the HTML formats for example.

(back to the table of contents)

Text Preferences Tab
Use the settings on this panel to control the way text is output in all formats. Additional settings which effect only HTML output are grouped separately in the HTML settings panel.

• Identify image captions
Selecting this option will cause Gemini to identify image captions and output them along with images, rather than in the main body of text. If image output is selected and the image format supports it (Tiff, JPEG, PNG), captions will also be embedded within the image data.

• Render text in artwork
Text within a graphic (such as in a graph on a chart) will be included with the graphic when checked. If you would prefer to have the text included as plain text in the output rather than in the image, uncheck this option.

This option is disabled when “Convert each page into an image” is checked in the “General Preferences” pane.

If you find that some text you expected to see in your output is missing, perhaps this option is be enabled and is discarding text from areas which Gemini is treating as an image.

• Recognize “speech marks”
When "Preserve Line Breaks" is off, Gemini attempts to reflow text into paragraphs. The process is fairly reliable but may occasionally make mistakes.

However, when it comes to reported speech of the kind that may be found during a conversation in a novel, it is vital that the correct line-breaks are retained; in speech, the line-break is the reader's main indication of a change in speaker.

When this option is on, Gemini pays particular attention to quotation marks, especially those at the beginning of a line or paragraph. The result is that it is much more successful in retaining these important line breaks.

When processing documents unlikely to contain any reported speech, it is best to disable this option.

• Preserve line breaks
Checking this box will ensure that Gemini honors all line breaks in the original document. Furthermore, it will stop Gemini from removing hyphenation.

Enabling this option can make editing the output more difficult since lines of text will not re-flow after insertions or deletions. However, it may improve the appearance if layout is important.

• Page Breaks
When enabled, Gemini will indicate page boundaries in the output. This may be done in a number of different ways depending upon the output format selected.

For example, in HTML the horizontal rule <HR> command is output between pages. In RTF, an RTF page-break command is output.

Use the text box next to this option to define your own page break text. Optional escape sequences include:

\n for new line,

\t for tab,

\000 for an octal code (three digits)

\x for a hexadecimal code (up to four digits).

Leave the box empty if you wish to use Gemini’s default page break.

When off, no indication of page boundaries is output. This may be the preferred mode when processing a novel for example.

• Hyperlinks
If Gemini detects any embedded hypertext links within a PDF, it will try to retain these in the output. This is not possible for plain-text or Palm Doc output.

• Bookmarks
When enabled, Gemini will retain a document's bookmarks as hypertext destinations in all but plain text or Palm Doc output.

Gemini can also synthesize bookmarks even if a document has none of its own. This automatic generation uses simple rules relating to the text size and font to determine which parts of a document to use as bookmarks and which should be treated as plain text.

This feature can save a large amount of time which would normally have to be spent adding bookmarks to a file using Adobe Acrobat's bookmark mechanism.

When disabled, no bookmarks synthetic or otherwise, will be output.

• Use UTF-8 Encoding
UTF-8 is a standard means of representing the full range of Unicode characters. UTF-8 encoded text can safely be sent via email without fear of corruption and is becoming more common within web pages.

Some web browsers or text editors will only handle non-ASCII characters if encoded as UTF-8. If some characters fail to appear correctly in your output, then it may be worth trying this option.

(back to the table of contents)

Image Preferences Tab
Gemini can deal with two types of image in PDF:

1. Graphs and charts consisting of lines, curves and filled areas
2. Photographs or scanned images made up of many colored pixels

Unlike photos, vector illustrations do not suffer from any degradation when scaled or zoomed – they do not become blocky as you zoom-in. Gemini can only output vector formats directly as EPS and will convert (render) them into images for all other output formats.

• Auto-identify vector artwork
Many PDF documents may contain vector artwork or line-art. However, due to the nature of PDF, it is not always possible for a computer program to determine where on a page such artwork occurs since each page is stored as a general mix of text, images and line-art.

Gemini is capable to an extent, of identifying such artwork on a page. Selecting this option will cause it to automatically convert vector artwork in the original into images in the output format selected.

If you are not sure whether a document contains vector artwork or bitmaps, use the zoom tool to zoom in on a picture. If the picture becomes blocky or pixilated then it is a bitmap image, if it stays sharp with detail, it is probably a vector-based drawing.

If you encounter problems with illustrations not being correctly identified, mark them up specifically as images using the “Image Box” tool. See “Exporting Images & Vector Artwork”.

• Use original OPI name if available
If the image being output has an OPI dictionary associated with it, the program will look for the image’s original file name in that dictionary. If found, the image will be exported with that name.

When searching for the name, the program makes use of the values of the either the FileSpec sub-dictionary or the ID field. One or both of these may be present and the program will make a judgment as to which is used.

To view all of the OPI information related to a particular image, press the right mouse button over the image in question and choose “Image properties…” from the menu that appears.

(back to the table of contents)

Image Scaling
This set of options allows you to set the output scaling of the images that Gemini extracts from a PDF. When disabled, images are output as they are found in the PDF. This may give very large images since many PDF documents contain high-resolution versions of an image which are then scaled by the PDF viewer to the correct size.

For web use, it is often preferable to scale images to around 72 dots per inch so that people with slow internet connections do not have to wait too long for an image to download.

The following set of options offers the ability to automatically scale images in this way.

• Fix image resolution at:
This option scales output images so that they are all at the same resolution in dpi (dots per inch). Resolutions available are: 72, 100, 150, 300, 400, 600, 720, 800 and 1200 dpi.

This setting is particularly important when rendering vector artwork.

• Advanced Scaling Pop-up
Three types of advanced scaling can be selected from the drop down menu. For each kind of scaling, if either width or height is set to zero, then an image's aspect ratio is maintained.

Set image size: enables exact image sizes to be specified in pixels. Images will be sub-sampled or expanded accordingly.

Maximum size: enables a maximum image size to be specified. If an image is smaller than the width and height values given then the scaling is unchanged. When Max scaling occurs, the aspect ratio is maintained. So, for example an image which is 800 by 400 when output at a Max of 200 by 200 will be rendered as 200 by 100 -the shape of the image is maintained.

Scale: will resize images according to a percentage value.

• Colour depth to use when rendering
Use this pop-up to control the colour depth for rendered areas. This includes areas marked-up with an “Image Box” annotation and entire pages when ‘Convert each page into an image” is selected.

The default depth is RGB – the same depth as is displayed on screen.

Bitmap 1-bit per pixel, black and white output,
50% threshold
Useful only for text and line-art. Use with PNG, BMP or Tiff output formats.
Typically requires rendering resolutions of 300dpi and above.

Dithered Same as bitmap but colour tones are dithered. Useful for photographs and colour artwork.
Typically requires rendering resolutions of 300dpi and above.

Greyscale 8-bits per pixel, grayscale
Use with any output format.

RGB 24-bits per pixel, default
Use with any output format.

CMYK 32 bits per pixel
Use with Tiff output format.

If you select a colour depth that is not supported by the currently selected image output format, the software will change the depth to something more appropriate.

For example, selecting CMYK when JPEG is the chosen format, will cause rendering to proceed in RGB mode.

(back to the table of contents)

HTML Preferences
This tab contains options for tailoring the way Gemini produces HTML output.

• Background
This group of options allows a background color or image to be specified for each page output in HTML.

Whatever is entered in the “Color:” box is included in the <COLOR> attribute of the <BODY> tag of each page output. Hence, color could be for example, a word such as "yellow" or a color definition such as "#FFFFFF".

The “Image:” box is used to enter a filename of a image. This name is placed into the “BACKGROUND” attribute of the <BODY> tag for each page. Be careful about the name entered here since it will be included exactly as typed.

The image name need not refer to an actual image on your hard disc but may for example, refer to an image which is or will be stored on the computer used to host the web pages once they are complete.

• Header & Footer
The header and footer file boxes are used to specify files on disc whose contents will be merged into each page of HTML output by Gemini.

For example you may wish to add your company's own corporate graphics to the top and bottom of each page output.

The contents of the header file will be output just after the <BODY> tag of a page but before any extracted text or images. Similarly, the contents of the footer file are output just before the closing </BODY> tag of each page.

You cannot type into the text boxes directly so instead use the Browse... buttons to select existing files.

To stop using either header or footer file, un-check the check boxes next to each one.

• Page/Article Navigation
If “Create file for every page/article” is enabled, each page of a document is stored in a new file and navigation links to the previous, first and next pages are placed on each page for convenience.

The controls in this group allow the appearance of these navigation links to be changed. Whatever is entered into either of the Previous, First or Next boxes will be output instead of the words “Previous”, “First” and “Next”.

This facility may be useful when converting documents for a non-English speaking audience or if navigation images are required rather than text.

• Image Borders
When enabled, images will have a border around them generated using the 'border' attribute of the HTML <IMG> tag.

• Frames
This setting only takes effect when bookmark output is enabled.

When enabled, Gemini creates a two-frame frameset definition which places the bookmarks frame on the left hand side of the screen and the extracted content on the right hand side. Clicking on a bookmark, changes the page in the right hand side.

When disabled, no frameset is created and bookmark destinations do not reference a target frame.

(back to the table of contents)

Text Output Formats Explained
Each of the eight styles available is described below.

Plain Text Format
ASCII text, readable with any text editor or word processor.

When layout is retained using this format, spaces are inserted to ensure words and paragraphs are placed into the right location.

RTF Format
Microsoft's Rich Text Format, readable by virtually every word processor. An ideal input format for Microsoft Office. Gemini can embed all document images in a single RTF file.

When exporting to RTF with Retain Layout, you may encounter difficulties if your original document's page size is larger then 55cm (width or height). Some versions of Microsoft Word cannot deal with such large page sizes.

To export documents written in Japanese or other East Asian languages as RTF (for import into MS Word for example) do not select UTF-8 encoding.

Simple HTML Format
HTML using only a few simple tags. This is readable by all web browsers, but some formatting may not come out correctly. Images are linked from the document rather than shown inline. This format is intended for use with PDAs.

A list of document bookmarks will be output at the top of the first page. In File-Per-Page mode a cover page, a bookmarks page and individual contents pages will be produced, each hyper linked together.

When retaining layout, this format inserts spaces making use of the <PRE> tag in HTML to force the browser to take notice of every space character.

HTML 3.2 Format
More complex HTML allowing for a wider range of styling. Inline images are used. Meta-tags are added to the document showing information on the creator, author and title. Different character sizes are rendered using the tags <H1>, <H2> etc.

If “Create file for every page/article” is selected then a bookmark page (if bookmark output is enabled) and several data pages will be output. When frames are enabled the index page uses frames to show the bookmarks on the left while viewing each page on the right. See “HTML” for more details.

When retaining layout this format uses a complex HTML table for each page. Font sizes in layout mode are selected using the FONT SIZE+/- technique rather than <H1>, <H2> etc. since this gives better results.

HTML 4 Format
As HTML 3, but the <FONT> tag is used to set the font face and size of text. This should be compatible with most modern web browsers.

When retaining layout this format uses a complex HTML table in the same way as HTML 3.2 with the addition of the <FONT> tag.

HTML 4 CSS Format
Using HTML with Cascading Style Sheets (v2) provides the most accurate depiction of PDF content for the web. When preserving layout all fonts and positioning are closely replicated.

When using this output format and preserving layout, a separate CSS file is produced which describes the styles of all fonts used. Edit this CSS style sheet if any adjustments need to be made to the look of the converted document.

Open eBook Format
Open eBook format is a freely available open standard, mark-up format for defining electronic books – eBooks. The exact specification of the format can be obtained on-line from www.openebook.org

Gemini can convert a PDF into Open eBook format with the additional files needed for a complete eBook definition. The output from Gemini will then need to be compiled into a self-contained eBook.

Once an eBook has been compiled, it can only be read using the eBook Reader designed for the particular compiler

Palm Doc Format
Compatible with most eBook readers running on Palm Computing platforms such as Palm, Handspring and Sony Clie PDA devices.

The Doc format is a very basic, text-only format. It does not support text styling.

The Doc title (as shown by your Doc reader program), will be the document’s title as defined in the PDF or its filename if it has no title. When exporting articles to individual files, each articles’ title is used instead of the document title.

Gemini will create one or more “.pdb” files that will need to be uploaded to a Palm device. There are many eBook readers available for Palm devices, some of which are free to download such as CSpotRun available from http://32768.com/cspotrun

(back to the table of contents)

Image Formats Explained
JPEG Format
Supports Greyscale and RGB colour formats with compression. Suitable for use in web pages.

When progressive JPEGs are viewed they appear to increase in quality during loading. This may be preferable for web publishing as the image will instantly appear in low quality when viewed but become sharper over time as downloaded. However, some older image viewers do not support progressive JPEG so Gemini also offers baseline JPEG as an alternative.

When exporting black & white bitmap images, Gemini will covert the image into grayscale (JPEG does not support bitmaps). This will increase the size of the output file. If you intend to export many bitmap images, PNG or Tiff may prove more suitable.

Tiff Format
Supports bitmap, Greyscale, RGB and CMYK colour formats all with Fax or RLE/Packbits compression.

Tiff formats support the CMYK colour space which may give improved color fidelity when converting magazine and newspaper pictures if originally stored as CMYK within the PDF.

Tiff (multipage)
As Tiff but all images are placed into a single file known as a multi-page tiff file. Your image viewer will need to be able to cope with Multi-page tiffs in order to view the individual images.

Only operates when rendering complete pages otherwise defaults to normal Tiff output.

BMP
Supports Bitmap, Greyscale and RGB colour formats with no compression.

PNG
Supports Bitmap, Grayscale and RGB colour formats with compression. Suitable for use in web pages.

EPS
For maximum colour fidelity, use EPS which also had the advantage of preserving any clipping paths applied to images. The EPS files produced are compatible with Adobe Photoshop, InDesign and Quark Express.

When rendering, this format does not output an image but instead the PostScript language commands needed to draw the page.

(back to the table of contents)

Frequently Asked Questions

1. Why are some characters not being translated properly?

2. Why are some of the images in my document not extracted?

3. Why don't the tables come out properly when I export my documents?

4. Why do I get nothing but gibberish text when I export?

5. Why do I get no text output at all when I process a page?

6. Why don't the columns of text come out in the right order?

7. Why do the lists in my document lose all formatting when I export?

8. Why do some paragraphs overlap when I output pages using HTML 4 CSS with layout?

9. Why are all the Kanji characters missing from my East Asian documents?

10. Why does the text display look so crude?

11. How do I change the font and/or size of plain text output?

12. The page breaks do not appear to be at the correct location in my plain text or RTF output. Why?

13. Some accented characters are wrong or missing.

14. Some special symbols such a copyright or registered do not appear in the HTML output.

1. Why are some characters not being translated properly?

Gemini attempts to map all special characters in a font into the equivalent or alternate character (if the output format does not support the exact character).

However, some fonts may not be encoded correctly within a PDF document. This is usually caused by the application generating the PDF. In these cases Gemini may be unable to determine what the correct translation should be and output the wrong character.

Sometimes the only remedy for this is to try and use a different font or if using Acrobat Distiller, to ensure that you do not select the “Embed Font Subsets” options within Acrobat Distiller. (back to the faq)

2. Why are some of the images in my document not extracted?

Some images may be constructed from lines, circles and boxes, so called "vector artwork" instead of being photos or scans.

Although Gemini tries to identify the boundaries of this kind of artwork, it may sometimes make a mistake.

To ensure vector artwork is extracted, place an “Image Box” around the artwork prior to extraction. See “Exporting Images & Vector Artwork” for details. (back to the faq)

3. Why don't the tables come out properly when I export my documents?

Gemini is unable to automatically identify areas of a page such as tables which need special formatting.

Instead you need to indicate the location of tables by placing a “Table Box” around each one. Once this is done, Gemini should be able to preserve the layout of the table in any output format you select. See “Exporting Tables” for details. (back to the faq)

4. Why do I get nothing but gibberish text when I export?

Some documents use a special type of font called a “Type 3” font. Although text written using these fonts is readable on screen, it is not always stored within the PDF in a way that can be understood by Gemini or any other program. This is a common problem with PDF.

If you have such a document you may be able to re-distill the document with various different setting to try to coax it into exporting correctly. However this is not always possible and it may be necessary to return to the original document from which the PDF was produced.

Generally if the document has been constructed in such a way as to break the link between the way characters look on screen and their meaning, Gemini will have problems in exporting anything but gibberish. (back to the faq)

5. Why do I get no text output at all when I process a page?

Providing that the “Text Output” option has been checked, it may be that the text you can read on the page is not made from individual letters but an image. This may be because the page has been imported from a scanner.

In order to extract text from a scanned page or image, you will need to process the page using an optical character recognition program (OCR) such as Adobe's Capture program or TextBridge from Xerox.

Another cause may be that Gemini has decided that the entire page represents vector artwork and is discarding all the text. For this to happen, the page has to contain some kind of drawing – perhaps a border. If this is the cause make sure “Auto Identify Vector Artwork” is off and “Render Text In Artwork” is off. (back to the faq)

6. Why don't the columns of text come out in the right order.

Gemini tries to detect the correct reading order for a page of text but, especially when there are columns of text, it does not always guess correctly. One way of ensuring that the order is correct it to use Article Threads.

By drawing articles threads around the columns of text you wish to extract, you can explicitly dictate the order in which they should be extracted.

If you then extract text using either page-by-page or article modes, Gemini will reorder the text using the order in which the threads were drawn. (back to the faq)

7. Why do the lists in my document lose all formatting when I export?

Gemini contains very simple list detection. It looks for paragraphs which start with a bullet or a number such as 1. or a. etc.

If your lists do not look like normal lists or if they use an unusual start character such as hyphen (-) instead of a bullet, Gemini will not recognize them as such and will tend to merge each line into a single large paragraph.

It may help to select “Preserve Line Breaks” or to treat the list as a table by drawing a "Table Box" around it. (back to the faq)

8. Why do some paragraphs overlap when I output pages using HTML 4 CSS with layout?

When preserving layout in HTML 4 Gemini uses the exact positioning feature of CSS2 (Cascading Style Sheets) to mimic the original page. However, this will only ever be perfect if all of the fonts used in the original page are available when browsing the HTML version.

If certain fonts are not be available, the web browser will substitute others as required. This may cause problems with layout since each font has different spacing characteristics.

Also, Gemini does not preserve letter and word spacing (tracking). If a line of text is written with condensed or expanded letters or spacing, it may look different when output.

These problems can easily be solved by editing the associated style sheet file or the HTML pages themselves. (back to the faq)

9. Why are all the Kanji characters missing from my East Asian documents?

Gemini includes basic Hiragana and Katakana character shapes but no Kanji shapes. Instead, Kanji characters will be displayed as black boxes. However, the fact that some characters may be missing will not affect the way Gemini exports the text. (back to the faq)

10. Why does the text display look so crude?

Gemini includes only a low quality rendering engine. This is fine for basic navigation of documents and is fit for most needs at medium resolutions. However, for applications requiring good quality low-resolution text output or high color fidelity, we advise the use of a more complete renderer such as that included in Adobe’s Acrobat line of products. (back to the faq)

11. How do I change the font and/or size of plain text output?

Plain text output includes no information about font or letter sizes. The only way to change its appearance is with the program you use to view the output. If you use “Notepad” on Windows platforms, try altering the “Font” settings. (back to the faq)

12. The page breaks do not appear to be at the correct location in my plain text or RTF output. Why?

There is no concept of a page in plain text output. The extent of a “page” is entirely dependent upon the program you use to view the text output. If your program does detect certain special sequences of characters to indicate a page break, try using such a sequence in the “Page Breaks” edit box in the Preferences dialogue of Gemini.

Similarly, in reflowed RTF output, programs such as Microsoft Word will apply their own page breaks dependent upon the size of the document. Try altering the font size or document size to vary where the page breaks occur in the program you use to view the RTF output. (back to the faq)

13. Some accented characters are wrong or missing.

In some output formats such as Simple HTML & HTML3 there is no HTML entity name to represent certain accented characters (such as Zcaron). In these cases Gemini will remove the accent leaving only the base characters. To avoid this use one of the HTML 4 output formats or enable “Use UTF-8 Encoding” in the text tab of the preferences dialogue box. (back to the faq)

14. Some special symbols such a copyright or registered do not appear in the HTML output Choose “Use UTF-8 Encoding” from the Text tab of the preferences dialogue box. (back to the faq)

(back to the table of contents)

Tips For Exporting Text

Table Output

Some tips for improving table output:

• Ensure that the table box only covers things that are part of the table. If not, Gemini will try to incorporate them into the final table possibly disrupting the layout.

• If the table has the wrong structure when output (too many or too few cells or rows) try altering the value of “Ignore border when calculating table”

Palm Doc Output

Some points to note when producing Palm Doc output:

• Before export, remove unwanted headers & footers from each page using the Crop Tool

• If you are exporting by page, turn off page breaks

• If you are exporting by article, you may want to make use of the page breaks feature to show when one article ends and another starts

• If exporting pages, turn off “Create file for every page/article”

• If exporting articles, it may be better for the reader to have multiple eBooks – one for each article. In which case ensure “Create file for every page/article” is checked.

• Ensure “Use UTF-8 Encoding” is off unless you are sure your doc reader can understand it.

• To ensure proper line breaks during sections of quoted speech, enable “Recognise speech marks”

Copyright
Copyright © 2004 Iceni Technology. All rights reserved. No part of this publication may be reproduced, transmitted or transcribed in any form or by any means without the prior written permission of Iceni Technology.

Technical Support
Technical support for Gemini is available by email from support@iceni.com.

Upgrades and information concerning Gemini and other Iceni products can be found at www.iceni.com.

If you have any comments about this manual, our web site or any of our products, please send them to sales@iceni.com (back to the table of contents)