![]() |
|
|
| Subject:
Exact CSV file format specification (as exported from Microsoft Office 2002)
Category: Computers > Programming Asked by: tomazos-ga List Price: $5.00 |
Posted:
26 Jul 2004 23:50 PDT
Expires: 25 Aug 2004 23:50 PDT Question ID: 379536 |
I'm looking for the technical file format specification for the "CSV
format" as reffered to in Microsoft Excel and Microsoft Outlook as:
1. "Comma Seperated Values (DOS)"
Outlook 2002 on Win XP
File menu > Import and Export > Export to file
2. "Comma Seperated Values (Windows)"
Outlook 2002 on Win XP
File menu > Import and Export > Export to file
3. "CSV (Comma delimited)"
Microsoft Excel 2002 on Win XP
File menu > Save as
4. "CSV (MS-DOS)"
Microsoft Excel 2002 on Win XP
File menu > Save as
5. "CSV (Macintosh)"
Microsoft Excel 2002 on Win XP
File menu > Save as
What is the precise character set, format and escape characters that
each of these five file formats uses? |
|
| There is no answer at this time. |
|
| Subject:
Re: Exact CSV file format specification (as exported from Microsoft Office 2002)
From: dreamboat-ga on 27 Jul 2004 08:30 PDT |
Perhaps I misunderstand the question because I'm not sure what "format" and "escape" characters are. However, to my knowledge, CSV files are nothing more than ascii text files, which means the font is courier and 12 pt. If necessary, I am happy to create a sample file, except from a Mac of course. |
| Subject:
Re: Exact CSV file format specification (as exported from Microsoft Office 2002)
From: crythias-ga on 27 Jul 2004 09:58 PDT |
The precise character set is most likely UTF-7 (7-bit ANSI/ASCII),
primarily all normal printable characters on a US keyboard. The
difference between windows, dos, and MAC are all regarding termination
(End of Record) characters. In each case of CSV export, you are given
the options of field termination (tab, comma, fixed width, etc.) and a
character that designates text fields (Usually quotation marks). By
default, a text field has " on either side of the entry, though it is
possible to not use quotes. The quotes are helpful when you are using
a comma delimeter and have a comma in a text field, which you don't
want to delimit (break apart into multiple fields).
By default, Windows CSV adds both a carriage return {(CR), chr$(13),
^M, (ctrl-M)} and a line feed {(LF), chr$(10), ^J, (ctrl-J)} to the
end of a record. A Macintosh format uses (IIRC) just a carriage return
for end of line/end of record. A DOS format may only use a line feed.
In general, this matters not a lot with import/export. However, if
you've used text editors in linux/unix, you can see ^M's everywhere
(carriage returns) opening a windows text file, whereas sometimes
you'll see unwrapped lines in windows text editors opening a unix doc. |
| Subject:
Re: Exact CSV file format specification (as exported from Microsoft Office 2002)
From: crythias-ga on 27 Jul 2004 10:00 PDT |
http://www.websiterepairguy.com/articles/os/crlf.html |
| Subject:
Re: Exact CSV file format specification (as exported from Microsoft Office 2002)
From: duoas-ga on 04 Aug 2004 20:52 PDT |
You have asked two questions.
1. "format", "escape", and "control" are all words used to describe
characters that have special meaning. An example is the line-feed
character (ASCII 10) which instructs the line device (printer,
terminal, etc.) to advance the print head (or cursor, or whatever) to
the next line. ASCII 13 (carriage-return) moves the print head back to
the left side of the carriage (or crt, ...).
ASCII-7 control characters are in the range 0..31 and 127. All other
characters are printable, such as a space and the letter 'A'. Google
ASCII for a chart.
2. CSV (Comma-Separated) files follow a very simple syntax.
Notation
FS (Field Separator) is almost always a comma or a tab.
FD (Field Delimiter) is almost always a double-quote.
NL (New Line) is almost always the ASCII CR LF combination.
Unix systems tend to use LF only.
Macintoshes tend to use CR only.
Description
The file is ASCII encoded.
Each line represents one record. Lines are terminated by any valid NL.
Each record contains multiple fields, separated by the FS character.
Again, this is normally a comma or a tab, but it could be anything.
Whitespace surrounding the FS character is ignored (the field begins and
ends with the first and last non-space characters). Neither FS nor FD can
be considered whitespace, even if it would be under normal circumstances.
Thus three tabs delimits four fields, not one.
A field is a string of text characters, which *may* be delimited by the
FD character. Again, this is usually the double-quote ("), but it can
be anything convenient. I have seen pipe (|) and single-quote (') used.
The FS character may appear in a FD-delimited field; in this case it is not
treated as the field separator.
The FD character may appear in a FD-delimited field; simply double it.
I do not know whether an FD character can appear in the middle of an *un-
delimited* field.
- Is it treated as a normal character?
- Does it delimit text to be concatenated with surrounding text?
I have never come across an instance where this is a problem.
If someone knows the answer, it would be nice to see it posted.
The end of the file should contain a single blank line (no terminating NL).
This, however, is not guaranteed. The last record just might not end with
a NL, or there might be two, or more... In any case, your algorithm should
be able to discard blank lines.
Example
FS = ,
FD = "
one,two, , "three,four " ,"five ""six"" seven", " eight,""", ",nine"
indicates the following seven fields
one 3 characters
two 3
0
three,four 11
five "six" seven 16
eight," 8
,nine 5
Hope this helps.
Duoas |
| Subject:
Re: Exact CSV file format specification (as exported from Microsoft Office 2002)
From: creativist-ga on 17 Aug 2004 06:55 PDT |
tomazos, I've spent quite a bit of time on this issue. Like you, I have never found an official document, so have worked out the details using experimentation and input from others. These efforts are documented here: http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm The paper, including the format description has had a few additions and corrections over the years. There's always the possibility that there is still another gotcha hiding in there, but it's pretty clean to my knowledge (I haven't had any bug reports for a while). Hope this helps. -cvst |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
| Search Google Answers for |
| Google Home - Answers FAQ - Terms of Service - Privacy Policy |