How to Fix CSV Encoding Issues (UTF-8 Guide)

What Are CSV Encoding Issues and Why Do They Happen?

Every text file on your computer is stored as a sequence of bytes. Character encoding is the rulebook that maps those bytes to the letters, numbers, and symbols you see on screen. When the software reading a file uses a different rulebook than the software that wrote it, the bytes get misinterpreted. The result is garbled text: accented characters turn into multi-character garbage, currency symbols vanish, and emoji become strings of question marks or diamond characters.

This is not a rare edge case. Encoding mismatches are one of the most common data quality problems in the world, affecting every team that exchanges CSV files between different platforms, operating systems, or countries. If you have ever seen "Ã©" where you expected "e", "â€™" instead of an apostrophe, or "Ã±" instead of "n", you have encountered an encoding issue firsthand.

The core problem is that the world never agreed on a single encoding standard until UTF-8 became dominant. Before that, every operating system and language region had its own encoding. Windows used Windows-1252 (also called CP-1252) for Western European languages. Mac OS used MacRoman. Unix systems often used ISO-8859-1 (Latin-1). Japanese systems used Shift-JIS or EUC-JP. Chinese systems used GB2312 or Big5. Each of these encodings maps the same byte values to different characters.

The Most Common Source: Excel Saves as Windows-1252

Microsoft Excel is the single biggest source of CSV encoding problems. When you use "Save As" and choose CSV format in Excel on Windows, the file is saved in the Windows-1252 encoding by default, not UTF-8. This encoding handles basic English characters and most Western European accented letters, but it cannot represent characters from Asian languages, Arabic, Hebrew, or emoji. Worse, any software that opens this file assuming UTF-8 will misrender the accented characters that Windows-1252 does support.

Excel on Mac has historically been even more unpredictable. Older versions used MacRoman encoding, which is different from both UTF-8 and Windows-1252. Recent versions of Excel for Mac offer a "CSV UTF-8" export option, but the default CSV option still does not guarantee UTF-8 output.

Google Sheets, by contrast, always exports CSV files in UTF-8 encoding. If all your data originates in Google Sheets and stays there, encoding is rarely an issue. The problems start when you mix Google Sheets exports with Excel exports, or when you import a Google Sheets CSV into Excel without specifying the encoding.

How Accented Characters Turn to Garbage

Understanding why garbled text appears requires a quick look at how UTF-8 works. In UTF-8, standard ASCII characters (A-Z, 0-9, basic punctuation) use one byte each, the same as in every other Western encoding. But accented characters like e, n, or u use two bytes. The character e, for example, is stored as the two bytes 0xC3 0xA9 in UTF-8.

Now here is the problem. In Windows-1252, the byte 0xC3 maps to the character "A" (A with tilde), and the byte 0xA9 maps to the copyright symbol. So when a UTF-8 file containing "e" is opened as Windows-1252, you see "Ã©" instead. This is not corruption. The bytes are intact. The file just needs to be read with the correct encoding.

// How UTF-8 bytes get misread as Windows-1252:

UTF-8: "cafe" → bytes: 63 61 66 C3 A9

Read as 1252: "cafÃ©" → C3=Ã, A9=©

Read as UTF-8: "cafe" → C3 A9 = e (correct)

The reverse also happens. If a Windows-1252 file containing "e" (byte 0xE9) is opened as UTF-8, the byte 0xE9 is an invalid UTF-8 sequence start. Depending on the software, you will see a replacement character (the black diamond with a question mark), a blank space, or the character will simply be dropped.

Why Emoji Break CSV Imports

Emoji are a special case that exposes encoding problems dramatically. Every emoji character requires four bytes in UTF-8. In Windows-1252 and ISO-8859-1, these four-byte sequences are completely invalid. Any CSV file that contains emoji and gets saved in a non-UTF-8 encoding will either lose the emoji entirely or replace them with four separate garbage characters.

This is increasingly common because modern CRM systems, form builders, and customer support platforms allow users to enter emoji in text fields. A customer's name, a company description, or a notes field may contain emoji that were entered deliberately. When these records are exported to CSV and the encoding is wrong, the data corruption is obvious and alarming.

Salesforce exports, Zendesk ticket exports, and HubSpot contact exports can all contain emoji if your users or customers entered them. If you process these exports through Excel before importing them elsewhere, the emoji are at high risk of being destroyed.

UTF-8 BOM vs. UTF-8 Without BOM

The BOM (Byte Order Mark) is an invisible three-byte prefix (0xEF 0xBB 0xBF) placed at the very beginning of a UTF-8 file. Its purpose is to signal to software that the file is UTF-8 encoded. In theory, this eliminates ambiguity. In practice, it creates a whole new category of problems.

When to use UTF-8 with BOM: If your file will be opened in Microsoft Excel, use BOM. Without it, Excel on Windows will assume the file is Windows-1252 and will mangle every non-ASCII character. The BOM is the only reliable way to tell Excel "this is UTF-8." Google Sheets also added a "CSV UTF-8" export option that includes the BOM specifically for Excel compatibility.

When to use UTF-8 without BOM: For programmatic imports, database loads, API uploads, and most non-Microsoft software, skip the BOM. Many programming languages and import tools treat the BOM bytes as literal content, which means your first column header will start with three invisible characters. This causes column-name mismatches, failed lookups, and import errors that are extremely difficult to debug because the characters are invisible.

// BOM bytes at file start (invisible in most editors):

With BOM: EF BB BF 22 6E 61 6D 65 22 → (BOM)"name"

Without BOM: 22 6E 61 6D 65 22 → "name"

Gotcha: header === "name" → false (BOM bytes are part of the string)

Step-by-Step: Fix CSV Encoding Manually

If you need to fix an encoding issue manually, follow these steps. Be aware that this process has pitfalls, and if the file has already been saved in the wrong encoding and then re-saved, some data may be permanently lost.

Step 1: Identify the current encoding. Open the file in a text editor that shows encoding information. On Mac, TextEdit or BBEdit will show the encoding in the document properties. On Windows, Notepad++ shows the encoding in the status bar. VS Code shows it in the bottom-right corner.

Step 2: Reopen with the correct encoding. If the text looks garbled, the editor is using the wrong encoding. In Notepad++, go to Encoding and select the encoding that makes the text readable. Common options to try are UTF-8, Windows-1252, and ISO-8859-1. In VS Code, click the encoding label in the status bar, choose "Reopen with Encoding," and select the right one.

Step 3: Convert to UTF-8. Once the text displays correctly, convert the file to UTF-8. In Notepad++, go to Encoding and select "Convert to UTF-8" (or "Convert to UTF-8-BOM" if the file will be opened in Excel). In VS Code, click the encoding label, choose "Save with Encoding," and select UTF-8.

Step 4: Verify. Open the converted file in a different application and check that all accented characters, special symbols, and emoji display correctly. Pay special attention to names with diacritics, addresses with non-English characters, and any free-text fields.

Common Encoding Errors by Platform

Different platforms produce different encoding problems. Knowing the source of your CSV helps you diagnose the issue faster.

Microsoft Excel (Windows)

Default CSV export uses Windows-1252. Accented characters survive within the Windows ecosystem but break everywhere else. The "CSV UTF-8" export option (available in recent Excel versions) is the fix, but most users do not know it exists. Excel also strips the BOM when opening UTF-8 files and re-saving them as plain CSV, silently downgrading the encoding.

Microsoft Excel (Mac)

Historically used MacRoman encoding, which is incompatible with both Windows-1252 and UTF-8. Modern versions default to UTF-8 for CSV export, but the "Windows Comma Separated" option still produces Windows-1252. Tab-delimited exports may use a different encoding entirely.

Google Sheets

Always exports UTF-8 without BOM. This is ideal for programmatic use but causes garbled characters when the downloaded file is opened in Excel on Windows. The fix is to use Excel's Data import wizard (Get Data → From File) and manually select UTF-8 encoding, but this workflow is non-obvious for most users.

Salesforce Exports

Salesforce Data Export and Report Export produce UTF-8 files with BOM. This works well with Excel but can cause BOM-related issues with programmatic imports. Data Loader uses UTF-8 without BOM. The inconsistency between Salesforce's own export tools is a common source of confusion.

Database Exports (MySQL, PostgreSQL)

The encoding of a database CSV export depends on the database's character set configuration. Older MySQL installations often use latin1 (ISO-8859-1) as the default character set. PostgreSQL typically uses UTF-8. If your database uses a non-UTF-8 encoding, the CSV export will inherit that encoding. Always check the database character set before exporting.

How to Prevent Encoding Problems Before They Start

The best fix for encoding issues is to prevent them entirely. Adopt these practices across your organization and you will eliminate the vast majority of encoding problems.

Standardize on UTF-8 everywhere. Configure your databases, export tools, and import scripts to use UTF-8. When creating CSV files programmatically, always write UTF-8. When exchanging files with partners or vendors, specify UTF-8 in your data format requirements.

Never double-convert. The most common cause of permanently lost data is opening a file in the wrong encoding and then saving it. Once you save a UTF-8 file as Windows-1252, the multi-byte characters get truncated or replaced. If you then convert the damaged file back to UTF-8, the original characters are gone. Always verify the encoding before saving.

Use a validation step. Before importing a CSV into any system, check for encoding issues. Look for the telltale signs: multi-character sequences where single characters should be (Ã©, Ã±, â€™), replacement characters, or missing text in fields that should contain non-English characters.

How NoSheet Fixes Encoding Automatically

Manually identifying and converting encodings is tedious and error-prone. NoSheet's CSV Cleaner eliminates this problem entirely by auto-detecting the encoding of every file you upload. It uses byte-pattern analysis to identify the source encoding with high confidence, then converts the entire file to clean UTF-8 without any manual intervention.

The tool handles all the edge cases that trip up manual processes: mixed encodings within a single file (which happens when records were appended from different sources), BOM detection and normalization, and the double-encoding problem where a file has been converted incorrectly and needs to be unwound. For a comprehensive walkthrough of all the data quality issues NoSheet catches, see our complete guide to cleaning CSV data.

If you are preparing data for a specific platform, NoSheet also ensures the output encoding matches what that platform expects. For example, exports destined for Salesforce import get UTF-8 with BOM, while exports for database import get clean UTF-8 without BOM.