微调IndexTTS以输出可控情绪的语言音频

2025-06-15 Machine Learning

本文为作者记录学习和微调 IndexTTS 以生成带有可控情绪的语音音频的过程。

About IndexTTS

IndexTTS 是B站 Index 团队开源的一款语音合成模型TTS，支持中文、英文的零样本语音克隆。特色是参数量小还可以用拼音声调来控制中文多音字发音。其基本结构基于 Tortoise TTS 和 XTTS，声码器（Vocoder）则采用 BigVGAN。虽然官方报告中提到了支持合成可控情绪音频，但实际目前并未开放相关能力的代码和使用方式 issues#13。

微调实验结果

以下使用 NVIDIA GeForce RTX 4070 大约半小时微调后的 IndexTTS 所生成的中英文语音音频样例：

参考音频	合成的语音试听
Elise-1	Hey there my name is Elise, <GIGGLES> and I’m a speech generation model that can sound like a person.
	你好 , 我是 ELISE, 一个语音生成模型 , <GIGGLES> 我的声音听起来跟真人一样 .
Female-1	Seriously? <giggles> That’s the cutest thing I’ve ever heard!
	真的吗？ <giggles> 这也太可爱了吧！
Male-1	Wha—? Cute? <giggles> You think I’m cute?! Well, uh, thanks, I guess?
	哎呀! 忘了他还在那等我们呢！ <giggles> 我们两个动作得快点了！

完整的实验 Jupyter Notebook 见仓库 yrom/finetune-index-tts

模型结构与微调思路

IndexTTS 基于 Tortoise TTS，采用了多个模块协同，其主要流程如下：

图为作者绘制

要达到微调目标，主要涉及两个核心部分：

BPE 分词器: 基于 sentencepiece，将文本和新增的情绪标签（如 <GIGGLES>）编码为词表序列ID。
GPT2 自回归模块：通过微调学习根据目标情绪标签ID生成合适的音频 latent 表示。

将PyTorch模型适配到MLX

2025-05-14 Machine Learning

本文记录一下将PyTorch模型适配到MLX的过程。

什么是MLX？

MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
https://github.com/ml-explore/mlx/blob/main/README.md

MLX 是适应于苹果M系列芯片(Apple Silicon)的机器学习框架。

mlx的array设计更加接近于numpy^[1]，而不是PyTorch的tensor，即只存有结构信息（如形状、数据类型等），没有其它与深度学习训练相关的属性（如梯度）。与numpy和torch不同的是，mlx 的array是 Unified Memory 可以在CPU和GPU之间共享，这也是mlx被单独开发而非拓展pytorch的 mps backend的理由^[2]。

array与tensor的区别

np.array和torch.tensor的区别可以阅读What is a Tensor in Machine Learning?

模型转换实践

BigVGAN PyTorch -> mlx-BigVGAN

基本的映射

	torch	mlx
DataTypes	Data types	Data Types
NN	`torch.nn.*`	`mlx.nn.*`
Parameters/Weight/Buffer	`torch.Tensor`	`mlx.core.array`
ModuleList	`torch.nn.ModuleList`	`list`
ModuleDict	`torch.nn.ModuleDict`	`dict`
Transform	`torch.fft`	`mlx.core.fft`

如何快速定位有性能问题的 Shader 和一些简单有效的优化方法

2024-04-08 Performance

问题

在特效渲染的项目中，通常会有很多的实现不同功能的 Shader，比如：高斯模糊、液化、色彩校准、LUT等等。在使用特效时会遇到一些可能是 Shader 导致的性能问题，比如：渲染速度慢、实时渲染卡顿等等。这时候，就需要一个能够快速找到有性能问题的 Shader的方法，即本文。

定位

工欲善其事，必先利其器。

首先，需要一个能够快速评估性能的工具（详见之前的汇总Graphics API Debuggers），挑一个适合的测试真机+测试工具（个人使用的是 Arm Performance Studio for Mobile 。

推荐观测的指标：

FPS （对于实时渲染的项目）
GPU 使用率
CPU 使用率
内存使用量

其次，针使用场景创建一系列基准测试用例，用于比较不同Shader实现的性能，找出瓶颈所在。

对于定位 Shader 的性能问题，推荐使用以下方法：

整体，用一个简单的 Shader （或什么也不做的 NULL-Shader）替换掉所有的 Shader，观察性能变化。如果性能有明显提升，说明瓶颈问题大概率在 Shader 上。
二分法定位，将 Shader 分为 AB两部分，分别测量 A + 原始Shader 和原始Shader + B，进行性能对比，继续二分，重复对比，直到找出性能问题的 Shader。

OpenGL Uniform Buffer Object 的坑

2024-02-24 Graphics

什么是 Uniform Buffer Object？

见WIKI： https://www.khronos.org/opengl/wiki/Uniform_Buffer_Object

及 learnopengl 中的示例：https://learnopengl.com/Advanced-OpenGL/Advanced-GLSL#:~:text=the%20geometry%20shader.-,Uniform%20buffer%20objects,-We%27ve%20been%20using

在项目中使用 UBO 时，遇到了一些坑，这里记录一下。

坑1：UBO 的对齐问题

在使用 UBO 时，需要注意 UBO 的对齐问题。为了代码的可移植性，一般会直接使用 std140 来定义 UBO 的内存布局，如：

layout (std140) uniform ExampleBlock
{
    float value;
    vec3  vector;
    mat4  matrix;
    bool  boolean;
};

std140 的布局规则理解了是一回事，但是在C++中写一个 UBO 对应的struct 的时候，还是会出现对齐问题。比如C++ 中用下面这个 struct 来对应上面的 ExampleBlock：

struct ExampleBlock {
    float value;
    glm::vec3 vector;
    glm::mat4 matrix;
    bool boolean;
};

这个结构体在C++编译器的眼里是按C++ 自己的内存布局规则来的，这时候我们需要手动按std140对齐：

struct alignas(16) ExampleBlock {
    float value;
    glm::vec4 vector;
    glm::mat4 matrix;
    alignas(4) bool boolean;
};

注意上面例子中的 struct 中的 bool 类型，需要对齐到 4字节。

但是，如果你这么写了，仍然可能会遇到一个问题，C++ 代码中明明设置的是 false ，但是程序执行后，在 Shader Program 中读取到的却是 true。(ﾟДﾟ≡ﾟдﾟ)!?

通过 dyld-interposing 实现C/C++代码注入

2023-10-19 Cxx

苹果系统的链接器/usr/lib/dyld 提供了一个叫dyld-interposing的功能（从 Mac OS X 10.4 开始），可以在程序启动时替换掉某个函数的实现。这个功能可以用来实现代码注入（详见：《Mac OS X Internals: A Systems Approach》- Amit Singh - 第二章 2.6.3.4 dyld interposing）

举个栗子

比如，我们可以在程序运行时，替换掉malloc函数的实现：

malloc_trace.c

// malloc_trace.c
#include <stdio.h>
#include <stdlib.h>

#include <mach-o/dyld-interposing.h>
#include <memory.h> // memset
#include <malloc/malloc.h> // malloc_printf

void *trace_malloc(size_t size) {
  char *p = malloc(size);
  // fills with '#'
  memset(p, '#', size);
  malloc_printf("malloc(%u) = %p\n", size, p);
  return (void *)p;
}

DYLD_INTERPOSE(trace_malloc, malloc);

test.c

// test.c
#include <stdio.h>
#include <stdlib.h>
int main() {
    char *p = (char*)malloc(10);
    printf("malloc return %p, %s\n", p, p);
    free(p);
    return 0;
}

$ cc -dynamiclib -o libmalloctrace.dylib malloc_trace.c -install_name libmalloctrace.dylib
$ cc -o test test.c
$ DYLD_INSERT_LIBRARIES=libmalloctrace.dylib ./test

test(46555,0x11bdd3600) malloc: malloc(1536) = 0x7febbc808200
test(46555,0x11bdd3600) malloc: malloc(32) = 0x7febbc704130
test(46555,0x11bdd3600) malloc: malloc(32) = 0x7febbc704170
test(46555,0x11bdd3600) malloc: malloc(20) = 0x7febbc705550
test(46555,0x11bdd3600) malloc: malloc(422) = 0x7febbc7055d0
test(46555,0x11bdd3600) malloc: malloc(50) = 0x7febbc7057e0
test(46555,0x11bdd3600) malloc: malloc(16) = 0x7febbc705880
test(46555,0x11bdd3600) malloc: malloc(52) = 0x7febbc705900
test(46555,0x11bdd3600) malloc: malloc(12) = 0x7febbc7059b0
test(46555,0x11bdd3600) malloc: malloc(10) = 0x7febbc705b00
test(46555,0x11bdd3600) malloc: malloc(4096) = 0x7febbc808800
malloc return 0x7febbc705b00, ##########

Graphics API Debuggers

2023-09-27 Graphics

Learning Render Graph

2023-08-13 Graphics

什么是 Render Graph？

Render Graph 或者说 Frame graph 是对复杂渲染管线的一个高度抽象，以图（Graph）的形式呈现渲染过程中的各个步骤，不同的渲染任务之间的依赖关系，以及它们对资源（如纹理、缓冲区等）的使用。

Frame graphs are a design pattern for handling complex rendering pipelines, which are currently used in industry. Their usage is motivated by handling barriers, queue synchronization and memory aliasing in the background by abstracting the rendering pipeline of a frame on a higher level.
—— https://github.com/gfx-rs/gfx/wiki/Frame-graphs

解决什么问题？

在传统的渲染管线中，渲染过程通常被划分为多个阶段，如下图所示：

这些阶段之间存在着输入和输出的依赖关系，其中一个阶段的输出作为下一个阶段的输入。

Render Graph 的主要思想是将渲染过程表示为一个有向无环图（DAG），其中节点表示渲染通道（Render pass），边表示依赖关系。每个渲染通道执行特定的渲染操作，可具有输入和输出资源，例如Texture、Frame Buffer和执行的 Shader/Program。例如，假设节点 A 的输出Texture是节点 B 的输入Texture，那么节点 B 就依赖于节点 A。

通过概括渲染流程中的依赖关系，确保渲染阶段按正确的顺序执行，并且在需要时可基于一定的同步机制（Fence、Semaphore、Resource barriers）尽可能地并行执行渲染通道。

Render Graph 的目标是为了解决大型渲染引擎里复杂渲染管线中的一些问题。如资源生命周期管理、渲染效率、渲染过程的可视化调试等等。

Render Graph 不仅在游戏引擎中广泛应用：

它的理念也在现代图形 API 中可窥一斑，如Vulkan 的 Render Pass、DirectX 12 的 Command List、Metal 的 Render Pass等。

从Android 原生库 (.so) 中里挖掘一些有用的信息

2023-07-25 Android

当一个 Android APP 需要集成别的地方来的原生库（.so）时，你可能也会跟我一样会有那么几点疑惑：

这个 so 用的什么 NDK 版本编译的？会不会跟项目里其它的so 冲突，尤其项目里使用共享 C++ STL的情况下 ANDROID_STL=c++_shared，一个应用不能使用多个 C++ 运行时
这个 so 目标 Android API 等级是多少？会不会大于项目的minSdkVersion？
这个 so 依赖（链接）其它哪些 so？这些 so 有没有都放进项目里？
这个 so 有没有除了用文件哈希之外唯一编号，用来标识崩溃堆栈等？

ps. 本文假定读者有一定 Android Native 开发经验，且理解一些基本的概念。

查看 so 的 NDK 版本信息

通过 readelf 工具查看 Android NDK 编译出来的 so 的 Section headers 里有什么 Android 特有的玩意。

ps. 可以用 ndk-which 找到 NDK 中预编译好的 readelf：

1 2	$ $ANDROID_NDK_HOME/ndk-which --abi arm64-v8a readelf /~/ndk/21.4.7075529/prebuilt/darwin-x86_64/bin/../../../toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android-readelf

定义一个名为 readelf 的 alias 方便在 Terminal 中调用 aarch64-linux-android-readelf

$ alias readelf=`$ANDROID_NDK_HOME/ndk-which --abi arm64-v8a readelf`
$ readelf -v
GNU readelf (GNU Binutils) 2.27.0.20170315
Copyright (C) 2016 Free Software Foundation, Inc.
...

以 NDK 中带的 libc++_shared.so 为例，在我本机上路径是$ANDROID_NDK_HOME/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/lib/aarch64-linux-android/libc++_shared.so：

1	$ readelf -WS $ANDROID_NDK_HOME/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/lib/aarch64-linux-android/libc++_shared.so

How to run Java standalone app (with JNI) on Android without creating an apk

2023-07-07 Android

In this week, I found a great POC to run a pure Java standalone app (command line tool, no apk) on Android. But what about running a standalone application using JNI (with .so files) on Android like this?

Java app with JNI

Imagine there is a Java program that loads the JNI shared native library to run and use some Android APIs :

HelloWorld.java

package com.example;
import android.os.Build;
import android.util.Log;
public class Helloworld {

    static { System.loadLibrary("hello"); }
    public static native String stringFromJNI();

    public static void main(String[] args) {
        Log.i("@@", "Hello world, " + Build.MANUFACTURER + " "+ Build.MODEL + "!");
        Log.i("@@", stringFromJNI());
        System.out.println(stringFromJNI());
        System.out.println("DONE.");
    }

    public static String getBuildVersion() {
        return Build.VERSION.RELEASE;
    }
}

…the JNI source would be like:

hello-jni.c

// ... emit codes

JNIEXPORT jstring JNICALL
Java_com_example_Helloworld_stringFromJNI(JNIEnv *env,
                                          jobject thiz)
{
  // ... emit codes
  jmethodID versionFunc = (*env)->GetStaticMethodID(env, clz, "getBuildVersion", "()Ljava/lang/String;");

  jstring buildVersion = (*env)->CallStaticObjectMethod(env, clz, versionFunc);
  const char *version = (*env)->GetStringUTFChars(env, buildVersion, NULL);

  if (!version)
  {
    LOGE("Unable to get version string");
  }
  else
  {
    LOGI("Build Version - %s\n", version);
    (*env)->ReleaseStringUTFChars(env, buildVersion, version);
  }
  (*env)->DeleteLocalRef(env, buildVersion);

  return (*env)->NewStringUTF(env,
                              "Hello from JNI !  Compiled with ABI " ABI ".");
}
// ...

The working directory structure

1
2
3

.
├── Helloworld.java
└── hello-jni.c

Compile and deploy

Now we need to compile both the Java and C sources for Android.

Using javac and dx to compile for a jar file which Android can read:

export BUILD_DIR=$PWD/build
export JARFILE=helloworld.jar
export JAVAC_OPTS=-source 1.8 -target 1.8 -cp .:$ANDROID_HOME/platforms/android-30/android.jar
# Compile .java to .class
javac $JAVAC_OPTS -d $BUILD_DIR/classes Helloworld.java
# Convert .class file into a dex file and embedded in a jar file
$ANDROID_HOME/build-tools/30.0.2/dx --output=$BUILD_DIR/$JARFILE --dex ./$BUILD_DIR/classes

Cross-compile the C to Android shared native library via NDK:

# Using the prebuilt toolchain diretly
# See https://developer.android.com/ndk/guides/other_build_systems
export ANDROID_NDK_STANDALONE=$ANDROID_NDK_HOME/toolchains/llvm/prebuilt/darwin-x86_64
$ANDROID_NDK_STANDALONE/bin/clang \
    --target=aarch64-none-linux-android21 \
    --gcc-toolchain=$ANDROID_NDK_STANDALONE \
    --sysroot $ANDROID_NDK_STANDALONE/sysroot \
    -L${ANDROID_NDK_STANDALONE}/sysroot/usr/lib \
    -shared -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables \
    -fstack-protector-strong -no-canonical-prefixes -fno-addrsig -fPIC \
    -Wl,-llog \
    -Wl,-soname,libhello.so \
    -o $BUILD_DIR/libhello.so hello-jni.c

The build directory looks like:

build
├── classes
│   └── com
│       └── example
│           └── Helloworld.class
├── helloworld.jar
└── libhello.so

Using the adb tool to deploy the helloworld.jar and libhello.so to Android device:

1 2	adb shell mkdir /data/local/tmp/helloworld adb push $BUILD_DIR/helloworld.jar $BUILD_DIR/libhello.so /data/local/tmp/helloworld/

Run appliation on Android

Run helloworld.jar via app_process on Android:

adb shell CLASSPATH="/data/local/tmp/helloworld/helloworld.jar" \
    LD_LIBRARY_PATH=/data/local/tmp/helloworld \
    app_process \
    /data/local/tmp/helloworld \
    com.example.Helloworld
# output
Hello from JNI !  Compiled with ABI arm64-v8a.
DONE.

How to detect memory leaks in C++ programs on macOS

2023-07-03 Cxx

一段很典型的内存泄漏C++ 代码如下：

int main(int argc, const char **argv)
{
    auto *p = new int(10);
    // other codes ...
    p = nullptr; // leaked
    // ...
    return 0;
}

如何在庞大的 C++ 项目代码中找出类似的问题呢？

libgmalloc

苹果提供了内存调试功能：Guard Malloc，用于debug 内存问题。 man libgmalloc 可以查看更多使用信息。

开启libgmalloc的记录 malloc 调用日志的功能，在执行程序前设置环境变量MallocStackLogging=1，如：

1	MallocStackLogging=1 ./my_tool

日志会写到一个临时文件中：

my_tool(38364) MallocStackLogging: stack logs being written to /private/tmp/stack-logs.38364.103f3a000.my_tool.19n2JH.index
my_tool(38364) MallocStackLogging: recording malloc and VM allocation stacks to disk using standard recorder

需要注意的是，在程序退出时，这个调用日志文件会被自动删除：

my_tool(38364) MallocStackLogging: stack logs deleted from /private/tmp/stack-logs.38364.103f3a000.my_tool.19n2JH.index

所以需要将程序在结束时最好 block 住，以便分析日志文件：

#include <iostream>
static void wait_for_input()
{
    std::cout << "Press Enter to exit." << std::endl;
    char b[1];
    std::cin.read(b, 1);
}
int main(int argc, const char **argv)
{
    auto *p = new int(10);
    // other codes ...
    p = nullptr; // leaked
    // ...

    // blocking program for analyzing malloc stack history
    wait_for_input();
    return 0;
}

The leaks Tools

另外，可以直接用 macOS 检测内存泄漏：/usr/bin/leaks （详见：the leaks Tool ）

终端中执行 man leaks 查看使用手册。

leaks 使用方式很简单，指定 pid 即可 attach 到执行中的程序：

# pid=38364
leaks $pid --outputGraph=$pid.memgraph
# open memory graph file with Xcode
open $pid.memgraph

Xcode Memory Graph Debugger

结合起来就是：

修改 C++ 程序 main 函数，使其在结束时 block，重编程序
设置环境变量MallocStackLogging=1，执行程序。
在程序执行结束时，亦即 block 时，执行 leaks，保存 memory graph文件
⌃+C 结束程序
使用 Xcode Memory Graph Debugger 打开 memory graph文件，分析内存泄漏