Home All Groups Group Topic Archive Search About

BCP and CSV - my conclusions

Author
16 Oct 2007 1:04 PM
PhilHibbs
There have been several threads here on reading/writing CSV files
using BCP, and I suspect that they will continue to crop up for ever.
The conclusion that I have come to is that you might, if you are
persistent and lucky, be able to get something working. Be prepared,
however, for it to be tricky to achieve, and a little contrived, in a
lot of real-life circumstances. It may even prove to be impossible if
you are really unlucky with your data. So far I have only explored
exporting to CSV, because that's all I have needed to do. If anyone
has wrangled with importing CSVs and come to any useful conclusions,
please post a follow-up.

CSV is more than just putting "," in between all the fields using a
format file and then jumping through a hoop to get the leading and
trailing quotes in as well before and after the first and last fields
respectively.

For a start, numeric fields should not have quotes around them, so
your format file is going to have to be crafted so that each field
terminator is correct in respect of its own data type and the data
type of the following field. Secondly, any double-quotes characters
within a field need to be doubled up, so that the text [a piece of
wood 2" long] becomes "a piece of wood 2"" long". This means that you
can't just give a table name, you have to write an SQL query. It also
makes the query considerably longer, all character fields that might
contain a double-quotes character need to be wrapped in a translation
function that replaces " with "". Now you aren't likely to hit the
8192 character limit for the length of Command Prompt commands, but
this does make for some pretty unwieldy statements.

The icing on the cake is how to deal with embedded carriage return
characters in your data. Excel seems to be able to cope with this;
when it comes across a line-end it seems to check to see if it is
within a field that is enclosed in quotes, and accepts the line-end
character as part of the cell value. I have no idea how BCP would cope
with this - presumably if you are exporting data then it will happily
write the line-end characters out, and it is up to the receiving
program to handle them, as I said Excel seems to do this. As to what
BCP would do with line ends inside a delimited field at the import
stage, I have no idea.

Phil.

Author
16 Oct 2007 9:48 PM
Mike C#
I prefer Tab-delimited, or some other delimited format, when I need to use
delimited files with BCP.  Excel has the ability to save files in
Tab-delimited format.

Show quote
"PhilHibbs" <sna***@gmail.com> wrote in message
news:1192539859.727334.97190@i38g2000prf.googlegroups.com...
> There have been several threads here on reading/writing CSV files
> using BCP, and I suspect that they will continue to crop up for ever.
> The conclusion that I have come to is that you might, if you are
> persistent and lucky, be able to get something working. Be prepared,
> however, for it to be tricky to achieve, and a little contrived, in a
> lot of real-life circumstances. It may even prove to be impossible if
> you are really unlucky with your data. So far I have only explored
> exporting to CSV, because that's all I have needed to do. If anyone
> has wrangled with importing CSVs and come to any useful conclusions,
> please post a follow-up.
>
> CSV is more than just putting "," in between all the fields using a
> format file and then jumping through a hoop to get the leading and
> trailing quotes in as well before and after the first and last fields
> respectively.
>
> For a start, numeric fields should not have quotes around them, so
> your format file is going to have to be crafted so that each field
> terminator is correct in respect of its own data type and the data
> type of the following field. Secondly, any double-quotes characters
> within a field need to be doubled up, so that the text [a piece of
> wood 2" long] becomes "a piece of wood 2"" long". This means that you
> can't just give a table name, you have to write an SQL query. It also
> makes the query considerably longer, all character fields that might
> contain a double-quotes character need to be wrapped in a translation
> function that replaces " with "". Now you aren't likely to hit the
> 8192 character limit for the length of Command Prompt commands, but
> this does make for some pretty unwieldy statements.
>
> The icing on the cake is how to deal with embedded carriage return
> characters in your data. Excel seems to be able to cope with this;
> when it comes across a line-end it seems to check to see if it is
> within a field that is enclosed in quotes, and accepts the line-end
> character as part of the cell value. I have no idea how BCP would cope
> with this - presumably if you are exporting data then it will happily
> write the line-end characters out, and it is up to the receiving
> program to handle them, as I said Excel seems to do this. As to what
> BCP would do with line ends inside a delimited field at the import
> stage, I have no idea.
>
> Phil.
>
Author
16 Oct 2007 9:59 PM
Erland Sommarskog
PhilHibbs (sna***@gmail.com) writes:
> CSV is more than just putting "," in between all the fields using a
> format file and then jumping through a hoop to get the leading and
> trailing quotes in as well before and after the first and last fields
> respectively.
>
> For a start, numeric fields should not have quotes around them, so
> your format file is going to have to be crafted so that each field
> terminator is correct in respect of its own data type and the data
> type of the following field. Secondly, any double-quotes characters
> within a field need to be doubled up, so that the text [a piece of
> wood 2" long] becomes "a piece of wood 2"" long".

BCP is a fairly squared tool. It's good for what it does, but if you
need to create files with that special format, BCP is not your friend.

Have you looked at Integration Services? I haven't myself, but there
should be a lot more options for file-export, including CSVs.

> The icing on the cake is how to deal with embedded carriage return
> characters in your data. Excel seems to be able to cope with this;
> when it comes across a line-end it seems to check to see if it is
> within a field that is enclosed in quotes, and accepts the line-end
> character as part of the cell value. I have no idea how BCP would cope
> with this - presumably if you are exporting data then it will happily
> write the line-end characters out, and it is up to the receiving
> program to handle them, as I said Excel seems to do this. As to what
> BCP would do with line ends inside a delimited field at the import
> stage, I have no idea.

When you use a format file, BCP reads the file as stream of bytes.
CR-LF are just two bytes in the stream. If CR-LF is the terminator
for the currenr field, then the field will stop there. If the current
field has another terminator, the CR-LF will be part of the data. Thus
embedded line breaks in the last field will mean that all the stuff
after the line break will end up in the first field of the next
record.


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
17 Oct 2007 9:41 AM
PhilHibbs
On 16 Oct, 22:59, Erland Sommarskog <esq***@sommarskog.se> wrote:
> When you use a format file, BCP reads the file as stream of bytes.
> CR-LF are just two bytes in the stream. If CR-LF is the terminator
> for the currenr field, then the field will stop there. If the current
> field has another terminator, the CR-LF will be part of the data. Thus
> embedded line breaks in the last field will mean that all the stuff
> after the line break will end up in the first field of the next
> record.

IBM DataStage seems to work in exactly the same way, you can't have
embedded line-ends in the last field.

Phil Hibbs.
Author
17 Oct 2007 10:05 PM
Erland Sommarskog
PhilHibbs (sna***@gmail.com) writes:
> On 16 Oct, 22:59, Erland Sommarskog <esq***@sommarskog.se> wrote:
>> When you use a format file, BCP reads the file as stream of bytes.
>> CR-LF are just two bytes in the stream. If CR-LF is the terminator
>> for the currenr field, then the field will stop there. If the current
>> field has another terminator, the CR-LF will be part of the data. Thus
>> embedded line breaks in the last field will mean that all the stuff
>> after the line break will end up in the first field of the next
>> record.
>
> IBM DataStage seems to work in exactly the same way, you can't have
> embedded line-ends in the last field.

Well, you can can have embedded newlines in any field with BCP. What you
cannot have is the field terminator in the field itself.

Often when I run BCP in character mode, I use @!@ for field delimiters
and \n<->\n as line terminators. Of course that format may not be useful
if you are exporting to for instance Excel.


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
18 Oct 2007 3:50 AM
Linchi Shea
In my experience, SSIS (and the data import wizard) seems to deal with the
full CSV format fine.

Linchi

Show quote
"Erland Sommarskog" wrote:

> PhilHibbs (sna***@gmail.com) writes:
> > CSV is more than just putting "," in between all the fields using a
> > format file and then jumping through a hoop to get the leading and
> > trailing quotes in as well before and after the first and last fields
> > respectively.
> >
> > For a start, numeric fields should not have quotes around them, so
> > your format file is going to have to be crafted so that each field
> > terminator is correct in respect of its own data type and the data
> > type of the following field. Secondly, any double-quotes characters
> > within a field need to be doubled up, so that the text [a piece of
> > wood 2" long] becomes "a piece of wood 2"" long".
>
> BCP is a fairly squared tool. It's good for what it does, but if you
> need to create files with that special format, BCP is not your friend.
>
> Have you looked at Integration Services? I haven't myself, but there
> should be a lot more options for file-export, including CSVs.

> > The icing on the cake is how to deal with embedded carriage return
> > characters in your data. Excel seems to be able to cope with this;
> > when it comes across a line-end it seems to check to see if it is
> > within a field that is enclosed in quotes, and accepts the line-end
> > character as part of the cell value. I have no idea how BCP would cope
> > with this - presumably if you are exporting data then it will happily
> > write the line-end characters out, and it is up to the receiving
> > program to handle them, as I said Excel seems to do this. As to what
> > BCP would do with line ends inside a delimited field at the import
> > stage, I have no idea.

> When you use a format file, BCP reads the file as stream of bytes.
> CR-LF are just two bytes in the stream. If CR-LF is the terminator
> for the currenr field, then the field will stop there. If the current
> field has another terminator, the CR-LF will be part of the data. Thus
> embedded line breaks in the last field will mean that all the stuff
> after the line break will end up in the first field of the next
> record.
>
>
> --
> Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se
>
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
>

AddThis Social Bookmark Button